A modular software platform integrating the processing and bioinformatic analysis of proteomics data

Soeren Schandorff1, Hans Jespersen2, C. H. Ahrens, M. Damsbo, S. Larsen, B. K. Ramsgaard, E. T. Nielsen, G. Thorvil, J. P. Kristensen, K. P. Budin, J. Matthiesen, P. Venø, J. C. Brønd, T. Topaloglou, P. T. Ruhoff
1schandorff@mdsdenmark.com, MDS Denmark; 2hjespersen@mdsdenmark.com, MDS Denmark

A modular software platform integrating the processing and bioinformatic analysis of proteomics data S. Schandorff, H. M. Jespersen, C. H. Ahrens, M. Damsbo, S. Larsen, B. K. Ramsgaard, E. T. Nielsen, G. Thorvil, J. P. Kristensen, K. P. Budin, J. Matthiesen, P. Venø, J. C. Brønd, T. Topaloglou*, P. T. Ruhoff. MDS Denmark, Stærmosegårdsvej 6, DK-5230 Odense M, Denmark * MDS Proteomics, 251 Attwell Drive, Toronto, ON M9W7H4, Canada E-mail: pruhoff@mdsdenmark.com We have developed an integrated software platform that addresses common problems encountered when processing and analyzing large proteomics datasets. The platform is designed as a 3-tier application. The central component is a proteomics data model that is backed by an RDBMS, which acts as a data warehouse. The highly modular design implemented as J2EE components results in a robust, well-integrated and extensible system. Tools have been developed that enable semi-automatic data handling and verification. An automated data acquisition component encapsulates expert knowledge and provides one common interface to several proteomics technology platforms. An experimental sample module tracks identification of proteins and their close sequence neighbors in individual and separate experiments. Closely related proteins are clustered based on pre-computed sequence similarity. Quality scores are assigned enabling statistical analysis and filtering of processed data. Experimental data is subsequently integrated with data from public and proprietary databases, contained within the bioinformatics data warehouse. This protein-centric data warehouse consolidates protein sequence information, annotations, and literature. It furthermore integrates genomic and transcript data, along with a number of other data sets like protein-protein interaction data, patents, and proprietary data. In addition, these data are complemented by pre-computed results of various bioinformatic analyses (conserved domains, signal peptides, transmembrane helices, close sequence neighbors, subcellular localization, etc.), which are also automatically run on new sequences. The integration of the computationally enriched data warehouse information with the experimental data enables protein isoform and splice variant distinction, protein-protein interactions analysis, pathway analysis, and text mining. The flexible modular setup of this system has been applied to support the target discovery and validation needs of MDS Proteomics.