Posters
Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015.
To confirm your poster find the poster acceptence email there will be a confirmation link.
Click on it and follow the instructions.
If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.
Category M - 'Proteomics'
M01 - Griss J, Foster JM, Hermjakob H, Vizcaíno JA. PRIDE Cluster: building a consensus of proteomics data.
Short Abstract: If genomics deciphers the blueprint of life, proteomics technologies based on mass spectrometry determine how nature and nurture have put the plans into practice. Funders and journals strongly mandate the deposition of proteomics data in public repositories, resulting in rapid data growth in public repositories like PRIDE at the EBI. However, data reuse in new contexts is still limited, partially due to the high heterogeneity of the data present in repositories and the inflation of false positive identifications when data sets are combined.
We present a spectral clustering algorithm that is able to select reliable identifications in heterogeneous datasets. We also show how incorrect annotations found in data submitted to PRIDE can be corrected using this approach. Finally, the PRIDE Cluster approach provides spectral libraries, which are compatible with current spectral search algorithms, turning quality controlled public data into a valuable resource for the next generation of experiments.
TOP
M02 - Inferring protein-protein interaction complexes from immunoprecipitation data
Short Abstract: Protein-protein interactions in cells are widely explored using small-scale experiments. However, the search for protein complexes and their interactions in data from high throughput experiments such as immunoprecipitation is still a challenge. We present ''4N'', a novel method for detecting protein complexes in such data. Our method is a heuristic algorithm that is written in R. It is faster than model-based methods, and has only a small number of tuning parameters.
During an IP experiment, an antibody binds to its target antigen in the cell sample. The antigen and proteins that are bound can be effectively isolated from the sample via interaction with the antibody and quantified and identified directly by mass-spectrometry. IP-experiments using various antibodies on the same sample result in different, but possibly partly identical sets of identified proteins that have different abundancies in each experiment.
Proteins of the same complex are predicted to have a similar abundance in different IP-experiments. They can be separated by applying a cluster algorithm on the IP-dataset. 4N is such a clustering method that is specificly developed to detect protein complexes as well as complex-complex interaction networks.
Our poster gives an overview on how 4N works and explains its application on a medium scale Immunoprecipitation dataset. It further explains how protein complexes and complex-complex interactions can be predicted and visualized.
TOP
M03 - The Single Amino acid Polymorphism Sequence Heterogeneity On Demand Transmutator (SAPshodt) and its application to in-depth bladder cancer proteome analysis
Short Abstract: An algorithm was developed to introduce single amino acid polymorphism (SAP) information from Ensembl into an existing protein sequence reference database (Human proteome set with isoforms from UniProt). After ID conversion and removing redundancy as well as entries lacking source information, 2,729,120 SAPs belonging to 77,047 proteins were retrieved. For introducing all SAPs into a specific tryptic peptide of a given protein, all corresponding residues were substituted with one of the permutations at a time and the novel tryptic peptides were amended successively to the original protein sequence entry together with their two flanking tryptic peptides using the artificial amino acid “J” as a separator [Schandorf, Nature Meth, 2007], generating the “SAPshodt” sequence database. Identification of SAPs using SAPshodt was tested with data sets of a bladder cancer project [Mukherjee, Annu. Conf. German Soc. Urol., 2013]. Extracted proteins were analysed using a nanoLC-IMS-MSE mass spectrometer with ProteinLynx Global SERVER™ (Waters Corp.). Peak lists were searched with Mascot (taxonomy: human, enzyme: trypsin, missed cleavages: 1, modifications: carbamidomethyl C/oxidation M, peptide tolerance: 5 ppm, MS/MS tolerance: 0.6 Da) with (i) the database SWALL (SwissProt/TrEMBL with isoforms) and (ii) the SAPshodt data base. In te second search 27 additional precursor masses were assigned to peptides comprising SAPs, that otherwise would have been overlooked. For example, unique peptides assigned to SAP V93L in the Ig kappa chain C region (P01834), SAPs A14D, W16G, and G17R in Hemoglobin subunit beta (P68871), and SAP G592S in the Collagen alpha-2 (I) chain (P08123) were found.
TOP
M04 - The mzQuantML Standard for Quantitative Proteomics and Supporting Software
Short Abstract: The Proteomics Standards Initiative (PSI) has recently released the mzQuantML standard (version 1.0) to capture, archive and exchange quantitative proteomic data, derived from mass spectrometry (MS). The standard can represent quantitative data about regions in two-dimensional retention time versus mass/charge space (called features), peptides, proteins and protein groups. The format has structures for representing replicate MS runs, grouping of replicates, for example as study variables, and for capturing the parameters used by software packages. The format can reference other standards such as mzML and mzIdentML and thus the evidence trail for the MS workflow as a whole can be described. Specific semantic rules have been developed to ensure that data from particular techniques are encoded consistently, including label-free, MS1 label-based, MS2 tagging and spectral counting techniques. Rules for SRM are currently under development and will be released shortly. All project resources are available in the public domain from http://www.psidev.info/mzquantml.
We have developed a Java API (Application Programming Interface) for mzQuantML, called jmzQuantML, providing a bidirectional mapping from XML to Java objects, with methods for reading and writing valid files (available from http://code.google.com/p/jmzquantml/). The API is used in a number of software packages, developed by our group and others, including the mzQuantML validator (http://code.google.com/p/mzquantml-validator/). The validator can check i) if an mzQuantML file is valid against the XML Schema, ii) if required CV terms have been used appropriately, and iii) if the additional semantic validation rules used for describing particular experimental techniques have been fulfilled correctly.
TOP
M05 - Comparative analysis of statistical methods used for detecting differential expression in label-free mass spectrometry proteomics
Short Abstract: For semi-quantitative, label-free mass spectrometry proteomics, several methods have been proposed and used to detect the differential expression of proteins from spectral count data. These methods consist of basic statistical methods, as well as those used in microarray analyses and those designed specifically for spectral counts levels. Here we have assessed seven methods used across the literature for detecting differential expression: student’s t-test, significance analysis of microarrays (SAM), normalized spectral abundance factor (NSAF), normalized spectral abundance factor-power law global error model (NSAF-PLGEM), spectral index (SpI), Qspec and Qspec/Qprot. We used 500 simulated data sets to assess the ability of these methods to detect differential expression of effect sizes from 20% to 200%, which are representative of the effect sizes found in real proteomics studies. The sensitivity and specificity of each method, calculated at a 5% FDR threshold, varied across the effect sizes. For the lowest effect size, 20%, none of the seven methods were able to accurately detect differential expression, which is most likely due to the low signal to noise ratio. NSAF-PLGEM outperformed the other methods in the remaining four effect sizes and showed the highest sensitivity of 0.78 for the effect size of 200%. Qspec/Qprot had the second highest sensitivity of 0.51 and the greatest area under the curve (AUC, 0.98) for the same effect size. These results suggest label-free mass spectrometry proteomics requires statistical methods that are specifically designed to handle discrete, count-based data.
TOP
M06 - An Assessment of Correlation Between Instrument Platforms in Label-free Proteomics
Short Abstract: There is an underlying assumption that label-free proteomics studies should be broadly comparable even if they are conducted on different instrument platforms. While it is clearly impossible to verify this assumption across all labs and instrument platforms, this study aims to conduct an assessment of the correlation between the protein and peptide level abundances reported by analysis of the same samples on two separate instrument platforms. To achieve this, a set of time-course samples (3 biological replicates from each time-point) was run in parallel on two separate instrument platforms, the Thermo Scientific Orbitrap Velos and the Waters Synapt G2, under identical chromatographic conditions. The resulting files were processed and peptide/protein abundance values calculated using Progenesis LC-MS (Nonlinear Dynamics). This was done using as comparable a method as was practicable, with only high-quality peptide identifications being imported from Mascot (Matrix Science) and PLGS (Waters) respectively for the two instruments. Downstream analysis was conducted using in-house Java code to produce correlation statistics and plots (generated in R) for both protein and peptide data. It was observed that particular groups of peptides/proteins correlate well on different platforms, but others appear less reproducible.
Further analysis is on-going to determine the factors affecting the results, and to determine whether post-processing can be applied to data sets to improve overall reliability.
TOP
M07 - PTM MarkerFinder: On detecting and validating spectra from peptides carrying post-translational modifications
Short Abstract: Mass spectrometry (MS) analysis of peptides carrying post translational modifications (PTMs) is challenging due to the instability of some modifications during MS analysis. However, glycopeptides as well as acetylated, methylated and other modified peptides realease specific fragments ions during CID (Collision Induced Dissociation) and HCD (Higher-Energy Collisional Dissociation) fragmentation. These fragment ions can be used to validate the presence of the PTM on the peptide.
Here we present PTM MarkerFinder a software that takes advantage of such marker ions. PTM MarkerFinder screens the MS/MS spectra in the output of a database search (i.e.: Mascot) for marker ions specific for selected PTMs. Moreover, it reports and annotates the HCD and the corresponding Electron Transfer Dissociation (ETD) spectrum (when present), and summarizes information on the type, number, and ratios of marker ions found in the data set.
In the present work, a sample containing enriched N-Acetylhexosamine (HexNAc) glycopeptides from yeast has been analysed by liquid chromatography-mass spectrometry on an LTQ Orbitrap Velos using both HCD and ETD fragmentation techniques. The identification result (Mascot .dat file) has been submitted input to PTM MarkerFinder and screened for HexNAc oxonium ions. The software output has been used for high-throughput validation of the identification results. PTM Marker Finder is included in the R package protViz(≥ 0.1.40), http://cran.r- project.org/web/packages/protViz/.
TOP
M08 - High performance computational analysis of large-scale proteome datasets to assess incremental contribution to coverage of the human genome
Short Abstract: Computational analysis of shotgun proteomics data can now be performed in a completely automated and statistically rigorous way, as exemplified by the freely available MaxQuant environment. The sophisticated algorithms involved and the sheer amount of data translate into very high computational demands. Here we describe parallelization and memory optimization of the MaxQuant software with the aim of executing it on a large computer cluster. We analyze and mitigate bottlenecks in overall performance and find that the most time consuming algorithms are those detecting peptide features in the MS1 data as well as the fragment spectrum search. These tasks scale with the number of raw files and can readily be distributed over many CPUs as long as memory access is properly managed. Here we compared the performance of a parallelized version of MaxQuant running on a standard desktop, an I/O performance optimized desktop computer (‘game computer’), and a cluster environment. The modified gaming computer and the cluster vastly outperformed a standard desktop computer when analyzing more than 1000 raw files. We apply our high performance platform to investigate incremental coverage of the human proteome by high resolution MS data originating from in-depth cell line and cancer tissue proteome measurements.
TOP
M09 - ProteomicsDB: In-memory computing platform enables rapid meta analysis of thousands of mass spectrometry data sets
Short Abstract: In-memory databases utilize main memory as the primary data storage. This reduces disk seek when querying data and, ultimately, results in faster data retrieval. Implementing common mass spectrometry algorithms within the database furthermore exploits the fast access to main memory and allows rapid meta-analysis on big data.
ProteomicsDB was built upon the SAP HANA in-memory computing platform. Two nodes with each 1 TB main memory and 80 processing units are connected to one 50 TB storage unit. Besides offering support for relational data in column and row store, it also supports the management of graph and text processing within the same system. Furthermore, R, C++ and L integration offers data manipulation within the database. Currently, commonly performed calculations, such as FDR estimation, theoretical peptide fragmentation pattern or sequence coverage are implemented within the database and thus allow rapid and online processing of large data sets.
As of writing, ProteomicsDB contains more than 10,000 LC-MS/MS runs from around 700 experiments. Despite this amount of data, common tasks such as calculating the sequence coverage of a protein across the entire database or a selected subset of experiments, matching a user defined sequence against an observed spectrum usually take only (sub-) seconds. This allows a direct user interaction without significant latency. Algorithms like k-means clustering and an efficient R-interface will allow the generation of spectral libraries, model training and statistical testing within the database. We anticipate high availability and performance of the database which facilitates completion of the human proteome.
TOP
M10 - Representation of Protein Interactions in Complexes as Multilayer Graphs
Short Abstract: The basic machineries in all forms of life are the proteins complexes (PCs). Some PCs are found in all organisms (e.g., replisome, ribosome) while others PCs are specific to a phylogenetic branch (e.g., photosystem 2) or specific cells (e.g., exosome). Protein-protein interaction technologies allowed cataloguing stable (coined obligatory) PCs. The number of PCs increases with the complexity with 180, 500 and 600 PCs found in E. coli, yeast and human, respectively. PC machines function to execute a function such as activation of gene expression, electron transfer, degradation and more. Herein, we transform each obligatory CP to a graph where the nodes are the proteins and the edges represent the physical interactions between the nodes. However, CPs as ‘cellular machines’ may carry an intrinsic dynamic component. A transformation of a static graph to a series of dynamic representations captures the ability of the CP proteins to vary (e.g., coding SNP, phosphorylation). Eventually, a variability on every node in the graph leads to an exponential number of graph variants. In our work, we model such multilayered graphs of CPs and classify them according to the dynamic robustness as an internal measure among CP graphs. Using the known outcome of the CPs to variation, we aim to rank the key nodes in any obligatory CP. Moreover, a gain in CP complexity along evolution will be tested by comparing the properties of the CP multilayered graphs. We illustrate our model on the proteasome and the transcription complex of RNA polymerase II.
TOP
M11 - Utilizing Unidentified Tandem Mass Spectral Libraries for Biological Sample Fingerprinting
Short Abstract: Shotgun proteomics, the dominant proteomics technology for the large-scale study of proteins in a biological sample, still suffers from the dependence of the availability of protein sequence database, which limits its application only to model organisms with already sequenced genomes.
Here we show that we have developed a novel approach that extends the traditional spectral library for peptide identification to a MS/MS spectral library clustered from both identified and unidentified MS/MS spectra, without the need of a reliable protein sequence database. More importantly, the resulting library can function as a complete record of experimental data, allowing better data analysis and integration. We first adapted the unidentified tandem mass spectral library to the problem of identifying the source of the last blood meal of hematophagous arthropods, even 6 months after the tick had fed (Önder, Ö. et al. Identifying sources of tick blood meals using unidentified tandem mass spectral libraries. Nat. Commun in
press (2013)), which provides a useful tool for public health programs as well as for the study of the ecology of infectious diseases in nature. We further adapted the unidentified spectral library building and searching strategy to address the problem of general biological sample fingerprinting. It is shown that this strategy can potentially function as useful biomarkers for species classification, cell type classification, and cell state (e.g., normal versus diseased) classification.
In conclusion, this biological fingerprinting methodology is sensitive, fast, cost-effective and can potentially be adapted for other biological and medical applications when existing genome-based methods are impractical or ineffective.
TOP
M12 - Improved label-free quantitative analytical pipeline by accurate peptide map alignment and unrestrictive post-translational modification search for unquantifiable proteins
Short Abstract: Label-free based protein quantitation is one of the most popular strategy to determine relative proteins abundance in samples. In the former, peptides are labeled with an isotopic reagent before MS(/MS) analysis to comparative proteomic study. However, it has some technical limitations. Label-free based approach is good alternative strategy to avoid those problems. Current instruments and IT techniques can support and handle huge and high quality mass spectrum data. By advanced analytical ways, high-throughput analysis has become common concept to proteomic study and needs for sophisticated protein analysis algorithms is greater than ever. Here, we developed analysis pipeline to calculate protein abundance ratio between different samples using peptide map alignment and peak intensity. First, after get MS/MS data set, all peptides are mapped through m/z and retention time axis. At the same time, peptides and proteins are identified by sequence DB search engine. All identified peptides used as landmark to correct peptide map alignment. To find unmatched identified peptide list across samples, we use mass spectrum quality assessment and unrestrictive PTM search to additional identify missing proteins. This step helps to recover unquantifiable proteins information through data dependent quantitative approach like as spectral counting. Protein abundance value is calculated and differentially expressed protein analysis by ANOVA test. This solution support to find additional unquantifiable proteins which can be ignored by data dependent approaches and automatically accurate label-free protein quantitative analysis without sample number or data size limitation.
TOP
M13 - myProMS, a Bioinformatics Solution for Collaborative Processing and Analysis of Mass Spectrometry-based Proteomics Data
Short Abstract: Proteomic Mass Spectrometry (MS) generates complex data that require multiple steps of computational and manual processing to be converted into meaningful biological information. To be successful, this process requires the skills of MS specialists, bioinformaticians and biologists. To facilitate such collaboration, we have developed myProMS[1], a bioinformatics environment that rationalizes this data processing workflow while allowing multiple users to interact with the data according to their expertise level.
Typically, outputs from search engines such as Mascot are imported into myProMS within a defined experimental context. Spectrum interpretations, protein attributions and variable modification (e.g. phosphorylation) positions must be validated by MS specialists either automatically through dedicated algorithms or manually. Only curated data become accessible to biologists for further investigation. Different quantification and differential analysis methods are provided through intuitive interfaces. Results are displayed as interactive graphics to help users visualize and mine their data. Further biological interpretation is possible using customized Gene Ontology analyses and extensive linking to external resources.
myProMS is an efficient and user-friendly solution for proteomic MS collaborative projects. It is used by multiple MS laboratories and benefits from users’ feedbacks for continuous improvement. The software can be evaluated and downloaded freely at http://myproms-demo.curie.fr. Contact: myproms@curie.fr
[1] Poullet P., Carpentier S., Barillot E. Proteomics 2007; 7(15):2553-6.
TOP
M14 - Indications of Instability and Other Non-typical Characteristics of Human Alcohol Dehydrogenase Class V
Short Abstract: Human alcohol dehydrogenase class V (ADH5) has been successfully
expressed as a fusion protein with green fluorescent protein, and also
with glutathione-S-transferase. However, it has never been isolated as
a native protein, nor shown any activity towards the traditional
alcohol dehydrogenase substrates. We have used computational methods
to study structure and properties of this protein. The structure was
generated using homology modelling based on multiple ADH structures,
and properties were examined using molecular dynamics.
The ADH5 behaviour was analysed using a novel method where we
generated models of other protein family members (ADH1C and ADH3) and
compared these trajectories with the trajectories of the ADH5 models.
This analysis implied that the regions involved in dimer interactions
behave in a different way in ADH5 than the corresponding regions in
other ADH enzymes, mainly causing increased structural variability in
the central β sheets. The dimer formation is known to be important for
the stability and function of other ADH enzymes. The increased
structural variability implies that while the protein is expressed at
the transcript level, the stability of the ADH5 dimer is compromised,
which in turn would explain the lack of activity and that a dimeric
ADH5 has never been isolated.
Modelled ADH5 structures with sequence segments modified into those
from other ADH enzymes decreased the structural variability, but not
down to the level of the other ADH enzymes, implying that the
instability is focused to more than one part of the sequence.
TOP
M15 - Determining the subcellular location of new proteins using local features
Short Abstract: Automatically determining the subcellular location of proteins from fluorescent microscopy is both an important problem in its own right and a test problem for pattern recognition in bioimages more generally. In previous work in the field, the methods have been evaluated on datasets where each location class is represented by images of different cells displaying the same marker protein (or other molecule). In this setting, very high accuracies have been reported for automated methods. However, it is impossible to distinguish between methods that are capable of identifying a protein (perhaps even based on staining method) rather than identifying a location class.
In this work, we model a more realistic setting. We define several location classes and, for each of them, image different proteins which display the location class of interest. We were thus able to quantitatively demonstrate that widely-used methods, which obtain very high accuracies in the setting in recognizing the same protein, do not obtain the same high accuracies when trained and tested on different proteins (which should share location classes, but were tagged and imaged independently).
We also demonstrate that the use of local features (in particular, Speeded-Up Robust Features, or SURF) can achieve better results than the traditional whole-image features used in subcellular location analysis. Their use was additionally validated on several publicly-available benchmarks.
TOP
M16 - Efficient Interpretation of Tandem Mass Tags in Top-Down Proteomics
Short Abstract: Mass spectrometry is increasingly used as the experimental method of choice for the identification of proteins in biological samples. In most settings, proteins are first digested into peptides, which are then used to infer the presence of their containing proteins in a bottom-up manner. Alternatively, intact proteins can be directly subjected to mass-spectral analysis. While this top-down approach is typically less sensitive than the bottom-up variant, it has several distinctive advantages, such as the sequence coverage or the potential study of post-translational modifications.
Recently, Hung and Tholey have shown in a pilot study that the popular tandem mass tag (TMT) labelling technology, which is often used for quantification in bottom-up studies, can be applied in top-down proteomics as well. This, however, leads to a complex interpretation problem, where we want to annotate peaks with their respective generating protein, the number of charges, and the number of TMT – groups acquired by this protein. In this work, we give an algorithm for the efficient enumeration of all valid annotations that fulfil available experimental constraints. Applying the algorithm to real-world data, we show that the annotation problem can indeed be efficiently solved in realistic situations, but that further experimental constraints will be required to go beyond the proof-of-concept stage.
TOP
M17 - An integrated Pipeline to enhance Bacterial Genome Annotation using Mass Spectrometry
Short Abstract: In bottom up liquid chromatography mass spectrometry (LC-MS), the most common high-throughput proteomics approach, generated tandem MS (MS/MS) spectra are usually matched to peptide sequences from protein databases. The used databases contain protein sequences with varying quality: Only a minor part of the sequences are experimentally validated, some are predicted by homology to other species, while a considerable part of the sequences are only based on predicted open reading frames Protein prediction algorithms are very advanced, but generally have weaknesses for the prediction of small proteins, introns and translation start sites and inevitably have sensitivity and specificity below 1.0, e.g. due to species dependent genome patterns.
As bacteria genomes are comparatively short and easily sequenced, it is possible to create a pseudo-protein database by translating all six reading frames of the genome. Such a database can be used to identify MS/MS spectra, which give no identification in conventional database searches.
We present a freely available pipeline, which allows for an enhanced genome annotation by finding novel protein coding regions. Starting from the bacterial genome, a six frame translation is performed. MS data can be searched against major search engines using this new pseudo-protein and a common protein database. After filtering for a given false discovery rate (FDR), high quality matches are mapped back to the genome and novel protein coding regions are reported and visualized. The pipeline was tested with MS experiments and the genome of the Cyanobacterium Synechocystis sp. PCC 6803 (3.95 Mb) and the results are presented.
TOP
M18 - NMR-based Metabolite Identification by Iteratively Updating Evolved Bayesian Networks
Short Abstract: The interpretation of nuclear magnetic resonance (NMR) experimental results for metabolomics studies requires intensive signal processing and multivariate data analysis techniques. A critical process in the typical workflow is the identification of significant metabolites, typically compiled post hoc. Current techniques require repeated manual tuning and are built on databases of pure compound samples, where the experimental conditions are simulated in the laboratory. We developed a novel metabolite identification algorithm utilizing a Bayes network with genetic algorithm feature selection that iteratively adjusts identification probabilities based on positive or negative identification. The algorithm is built upon empirical spectroscopic data, avoiding biases inherent in libraries of pure compounds. This technique captures the inherent variability in experimental data, while greatly reducing the need to build databases of pure compounds. The ability to annotate spectra by learning patterns within empirical data allows the metabolomics community to utilize existing datasets to improve and extend our method. Specifically, we have bundled our novel identification algorithm with cloud-based cyberinsfrastructure to build a crowd-sourced database of metabolite identifications and spectral annotations. The feasibility and accuracy of our algorithm is shown by measuring the specificity (>0.75) and sensitivity (>0.65) on 1H urine derived spectroscopic data. The genetic algorithm successfully removes redundant information and identifies networks of influence that represent annotation dependencies. More than 60% redundant and irrelevant data is identified and removed without sacrificing accuracy. We demonstrate our technique by identifying metabolites in a NMR-based spectroscopic study of metabolite profiles in corals exposed to a combination stressors.
TOP
M19 - Structural coverage using X-ray crystallization for a current snapshot of the protein universe
Short Abstract: We designed an accurate and fast method for determination of protein crystallization propensity, fDETECT. Empirical validation shows that it is competitive with existing predictors, PPCpred, XtalPred, SVMCrys, CRYSTALP2, OBScore, and ParCrys, achieving accuracy of 71.6% and MCC of 0.388. Interestingly, fDETECT generates crystallization propensity scores that are correlated with resolution of resulting crystal structure, as we demonstrate using chains from 44671 PDB structures. fDETECT provides accurate means to select easier to crystallize targets, which is vitally important for current structural genomics pipelines.
The time-efficient design enabled us to perform first-of-its-kind large-scale analysis of crystallization propensity for 877 fully sequenced proteomes (64 archaea, 553 bacterias, 201 eukaryotes and 59 viruses). We analyzed coverage of structures and functions (structural coverage of proteins annotated with specific GO annotations) that could be obtained via X-ray crystallization for specific domains of life and functional types. Our study shows that proteomes differ in their difficulty for structural determination, with archaea and bacteria species being easier. Using current X-ray crystallization protocols and a cut-off equal to the median score for PDB structures, we estimate that structures can be obtained for around 1/3 of all considered proteins with varying success rates across domains: 14% for eukaryotes, 27% for viruses, 47% for bacterial, and 59% for archaea. However, the coverage of known GO functions (including molecular processes, biological functions and cellular components) defined by solving representative proteins from each functional annotation is higher and equals 40% in eukaryotes, 57% in viruses, 85% for bacterial and 97% and archaea species.
TOP
M20 - Utility of prior information such as RNASeq and GPMdb protein observation frequency for improving MS/MS based protein identifications
Short Abstract: Tandem mass spectrometry (MS/MS) based shotgun proteomics has become the method of choice for protein identification in most studies. The method employs spectral matching algorithms and statistical models to identify the proteins present in the sample based on the MS/MS spectra generated. However these methods do not, in general, take into account any prior information available about the sample in their protein inference step. Since for most biological systems there is often a wealth of prior information available, algorithms that can incorporate and utilize such information could help to improve the sensitivity of protein identification from shotgun proteomics data.
In this study, we have explored the utility of RNASeq abundance values and GPMdb protein observation frequencies for improving MS/MS based protein identification. We have developed a statistical method for adjusting the identification probabilities of proteins, initially computed by the Trans-Proteomic Pipeline analysis suite, to account for the GPMdb and RNASeq information. But utilizing these adjusted probabilities we were able to confidently identify proteins that would have otherwise fallen below the confidence threshold. In moderate and low depth datasets, ( < 2000-3000 proteins) the method allowed improvements of 2-10% in number of protein identifications at 1% FDR.
TOP
M21 - Unbiased analysis of high throughput LC-IMS-MSE data for large scale biomarker discovery programs
Short Abstract: Modern approaches in mass spectrometry utilising LC-ion mobility-MSE, provide a large amount of data that is characterised in multiple dimensions: mass-to-charge ratio (m/z); shape (Drift); retention time; and intensity. In generalised analytical approaches, spectral information is searched against protein databases with fixed criteria (i.e. multiply charged tryptic peptides, minimal modifications). This approach for biomarker discovery is therefore inherently biased by such databases, and any non-identified ions will be discarded, even though they likely represent important chemical entities in the system under investigation. To maximally extract information from spectra provided by LC-IMS-MSE instruments in large high-throughput studies, we have developed an unbiased approach to discover biomarkers from spectral datasets. Integral to our method is the alignment of high dimensional spectral data from across experiments, where as an initial step, internal standards are used to monitor, model, and correct for experimental variation. Hierarchical clustering approaches are used to align ions from different spectra, utilising an overlapping bin strategy followed by a post alignment duplicate removal, to avoid boundary effects. Post alignment, peak model information obtained from commercial software pre-processing is used to ensure isotope peaks are well aligned. Current findings show that our approach can identify more discriminant ions than database dependent approaches. Moreover, we have been able to identify potential peptide conformers, and detector shadows which have led to abundance mis-estimation using a database searching approach. Continuing work involves combining the alignment with a pipeline of classification approaches for large scale clinical sample analysis in the UBIOPRED severe asthma project.
TOP
M22 - N-terminal domains in two-domain proteins are biased to be shorter and predicted to fold faster than their C-terminal counterparts
Short Abstract: Computational analysis of proteomes in all kingdoms of life reveals a strong tendency for N-terminal domains in two-domain proteins to have shorter sequences than their neighboring C-terminal domains. Given that folding rates are affected by chain length, we asked whether the tendency for N-terminal domains to be shorter than their neighboring C-terminal domains reflects selection for faster folding N-terminal domains. Calculations of absolute contact order, another predictor of folding rate, provide additional evidence that N-terminal domains tend to fold faster than their C-terminal neighboring domains. A possible explanation for this bias, which is more pronounced in prokaryotes than in eukaryotes, is that faster folding of N-terminal domains reduces the risk of protein aggregation during folding by preventing formation of non-native interdomain interactions. This explanation is supported by our finding that two-domain proteins with a shorter N-terminal domain are much more abundant than those with a shorter C-terminal domain.
TOP
M23 - Analysis of High Accuracy, Quantitative Proteomics Data in the MaxQB Database
Short Abstract: Mass spectrometry (MS)-based proteomics generates rapidly increasing amounts of precise and quantitative information. Analysis of individual proteomic experiments has made great strides but the crucial ability to compare and store information across different proteome measurements still presents many challenges. For example, it has been difficult to avoid contamination of databases with low quality peptide identifications, to control for the inflation in false positive identifications when combining datasets, and to integrate quantitative data. While, for example, the contamination with low quality identifications has been addressed by joint analysis of deposited raw data in some public repositories, we reasoned that there should be a role for a database specifically designed for high resolution and quantitative data. Here we describe a novel database termed MaxQB that stores and displays collections of large proteomics projects and allows joint analysis and comparison. We demonstrate the analysis tools of MaxQB using proteome data of 11 different human cell lines and 28 mouse tissues. The database-wide False Discovery Rate is controlled by adjusting the project specific cut-off scores for the combined datasets. The 11 cell line proteomes together identify proteins expressed from more than half of all human genes. For each protein of interest, expression levels estimated by label-free quantification can be visualized across the cell lines. Similarly, the expression rank order and estimated amount of each protein within each proteome is plotted. The information contained in MaxQB, including high-resolution fragment spectra, is accessible to the community via a user-friendly web interface at http://maxqb.biochem.mpg.de.
TOP
M24 - Quantification of Cell-to-cell Variability in Protein Spatial Spread from Fluorescence Microscopy of Unsynchronized Budding Yeast
Short Abstract: The characterization of protein abundance and stochastic abundance has been systematically defined in budding yeast using fluorescently tagged proteins. Subcellular location can also be systematically uncovered using supervised machine learning approaches that have been trained to recognize predefined image classes based on statistical features. As an alternative, we capture cell stage dependence of protein spatial expression within automatically identified cells. We use the identified the bud area as cell-stage indicator. We show that similarities between the inferred expression patterns contain more information about protein function than can be explained by a previous manual categorization of subcellular localization. Further analysis reveals that such a characterization allows identify a 12% of the 4004 proteins by finding the protein that is closest in expression pattern in a replicate experiment. This characterization includes stochasticity levels in measurement, which are correlated with previous reports in the case of stochasticity in protein abundance. Other stochasticity levels, such as in compactness for protein expression, are shown to be reproducible. Changes in cell morphology due to the alpha factor mating pheromone or changes of fluorescents markers required for segmentation also have a limited impact on the measured variability levels. Our results suggest that quantitative cell-stage dependent representations of protein spread discriminates protein spatial expressions without requiring predefined subcellular location classes. We show that some major quantified deviations, such as high spatial variability, are systematically detected under a spectrum of experimental conditions.
TOP
M25 - Building a deep phenotyping reference map of the immune system by density-based tracing of mass cytometry datasets
Short Abstract: The immune system consists of hundreds of discrete immune cell types, each characterized by distinct functions and a unique quantitative pattern of marker expression. Despite morphological/functional diversity the majority of immune cell types share common origin and act to maintain immunologic integrity. Although an outline of immune hierarchy exist there exists no unified quantitative map of the immune system.
Highly parameterized mass cytometry allows routine measures of up to 45 protein markers from single cells. We use mass cytometry to systematically map immune structure by applying comprehensive phenotyping panels of mass labeled surface markers and intracellular network response states. We created a deterministic mapping algorithm that can delineate cell types that reflect “natural” biological features of the cells, can infer phylogenetic and functional relationships between cell types, and which embodies previously identified biological phenotypes. For this a mass cytometry dataset is first decomposed into individual populations by means of random tessellation density-based clustering. Next, individual populations are subjected to directional density tracing to identify cue points marking local differentiation routes. Third, cue points are connected to link related populations, thus forming a map that can be represented in 2D. The resulting population map is automatically annotated using Cell Ontology database definitions. Based on preliminary analyses, we expect this will allow automated faithful reconstruction the of cell population maps that recapitulate known phenotypes and demonstrate new marker combinations that call out previously known as well as novel cell subsets.
TOP
M26 - Peak detection by Gaussian mixture modeling of MALDI-ToF spectra for proteome profiling
Short Abstract: MALDI-ToF mass spectrometry allows characterization of human proteome giving the possibility to identify protein signatures that can distinguish between profiles of two various biological conditions. Important fragments of the spectrum are peaks corresponding to the proteins contained in the sample. An issue is to extract significant information from a limited number of samples and to avoid selecting peaks influenced mostly by non-disease-related artefacts.
We model mass spectrum by a mixture of Gaussian distributions, where a single component has a similar meaning to the signal peak. We use EM algorithm in which crucial element is setting up initial conditions. We propose to divide spectrum into segments and modeling separate segments with different numbers of components. Obtained model parameters are used for modeling of the whole spectrum. Due to risk of modeling noise post-processing of spectral components is needed. We merge similar Gaussians and filter out noise components with criterions based on coefficients of variation and weights of the components.
We have checked the ability to reconstruct proteins in the sample using simulated spectra with known peaks locations and spike-in experiment data and compare it to two known peak detection algorithms. By filtering biologically inconsistent components of the model, we reduced the number of false positive biomarker discoveries. Proposed algorithm detects more low-mass and low-signal peaks, which are particularly important when searching for protein biomarkers. Post-processing of Gaussian spectral components increases validity of protein profiling.
This work was funded by the National Science Centre (2011/01/N/NZ2/04813) and a scholarship from `Doktoris’ program.
TOP
M27 - Succinct Multibit Tree for Large-Scale Chemical Fingerprint Searches
Short Abstract: Similarity searches in the databases of chemical fingerprints
are a fundamental task in discovering novel drug-like molecules. Multibit
trees have a data structure that enables fast similarity searches of chemical fingerprints (Kristensen et al., WABI’09). A standard pointer-based
representation of multibit trees consumes a large amount of memory to
index large-scale fingerprint databases. To make matters worse, original fingerprint databases need to be stored in memory to filter out false
positives. A succinct data structure is compact and enables fast operations. Many succinct data structures have been proposed thus far, and
have been applied to many fields such as full text indexing and genome
mapping. We present compact representations of both multibit trees and
fingerprint databases by applying these data structures. Experiments revealed that memory usage in our representations was much smaller than
that of the standard pointer-based representation. Moreover, our representations enabled us to efficiently perform PubChem-scale similarity
searches.
TOP
M28 - Measuring and managing ratio compression for accurate iTRAQ/TMT quantification
Short Abstract: Isobaric mass tagging (e.g. TMT and iTRAQ) is a precise and sensitive multiplexed quantification technique in mass spectrometry. However, accurate quantification of complex (chemo-)proteomic samples is impaired by co-isolation and co-fragmentation of peptides, thus leading to ‘ratio compression’. In contrast, label-free quantification strategies cannot be multiplexed and are less precise but do not suffer from such a systematic accuracy bias. We compared protein quantification results obtained with these methods for a chemoproteomic competition binding experiment and evaluated the utility of measures for spectrum purity in survey spectra for predicting co-fragmentation induced ratio compression in TMT quantification. Stringent filters for spectra with low interference and high precursor signal-to-noise ratios enabled substantially more accurate TMT quantification but 30%-60% fewer proteins were quantified. However, when a signal-to-interference ratio based fold change correction algorithm was applied in conjunction with soft spectrum filters, quantification accuracy was comparable to that obtained with stringent spectrum filters at a negligible loss in coverage (< 10%). The fold change correction algorithm enabled accurate determination of inhibitor binding potencies in complex chemoproteomics samples, thus avoiding the need for extensive sample fractionation.
TOP
M29 - Sequence Determinants Govern the Translation Efficiency of the Secretory Proteome
Short Abstract: Translation must be tightly controlled for coping with the cell's demand and its limited resources. Energetically, translation is the most expensive operation in dividing cells. We applied a measure of tRNA adaptation index (tAI) as an indirect proxy for the translation rate. We tested the possibility that sequence determinants are encoded along the transcripts to govern translational efficiency. The secretory proteome comprises about 30% of the proteins in human and other multi-cellular model systems. Many of these proteins contain at their N’-terminal a segment that is called Signal Peptide (SP) which determines a translocation to the ER. Indeed, all SP-proteins are translated by ER-membrane bound ribosomes. We anticipated that proteins translated by free or bound ribosomes differ with respect to their overall translation speed. We demonstrate that clusters of poorly adapted codons followed by abundant codons specify the N’-terminal of secreted and SP-membranous proteins. The phenomenon is generalized to the proteomes of yeast, fly and worm despite a poor correlation among their codon tAI values. We propose that translation determinants are evolved to match the cellular needs for translational rate. The codons’ arrangement along transctipts is crucial for management of synaptic sites and poorly folded protein translation. The appearance of low tAI codons at the N'-terminal of SP proteins attenuates the elongation rate. We conclude that processes such as translocation through the ER membrane, processing, maturation and folding are dependent on a specific codon arrangement that dictates a delay in translational elongation.
TOP
View Posters By Category
TOP