Using proteomics to mine genome sequences

Jonathan W Arthur1, Marc R Wilkins2
1jonathan.arthur@proteomesystems.com, Proteome Systems Ltd; 2marc.wilkins@proteomesystems.com, Proteome Systems Ltd

We present a hypothesis-independent method for mining genome sequences with proteomic data. This works by identifying the region of a genome coding for a protein sequence using information from the analysis of proteins and peptides with mass spectrometry. The raw genome sequence of an organism is theoretically cleaved and translated into a series of virtual proteins. Each virtual protein is then subjected to a theoretical enzymatic digestion. Standard proteomic sample preparation methods are used to separate, array, and digest the proteins expressed in a sample. The masses of the digested proteins are measured using mass spectrometry and compared to the theoretical masses of the virtual proteins using peptide mass fingerprinting. The region of the genome responsible for coding for a particular protein, and thus the sequence of the protein, can then be identified. The method provides a distinct advantage over existing methods for annotating genome sequences as no assumptions are made about the location of a protein in a particular gene sequence or the positions of start and stop codons. Further, the method can be applied to entire assembled genomes as well as unassembled sequences including short or long contigs. The method can also be used to identify novel genes with non-typical start codons. To illustrate this approach, all 773 proteins of Psuedomonas aeruginosa contained in SWISS-PROT were used to theoretically test the method and optimise parameters. Increasing the size of the virtual proteins results in an overall improvement in the ability to detect the coding region, at the cost of decreasing the sensitivity of the method for smaller proteins. Increasing the number of matching peptides required in the peptide mass fingerprinting search improves the ability to detect coding regions, as does increasing the signal-to-noise ratio of the simulated mass spectrum. Adjusting the error tolerance on the peptide mass fingerprinting search has little effect. The method is also demonstrated on experimental data from Mycobacterium tuberculosis and is shown to work with eukaryotic organisms (e.g., Homo sapiens) as well.