cSAGE and the Serial Analysis of Gene Expression in Arabidopsis thaliana

Christopher T Lewis1, Stephen Robinson2, Tony Kusalik, Isobel AP Parkin
1LewisCT@agr.gc.ca, Agriculture and Agri-food Canada; 2RobinsonS@agr.gc.ca, Agriculture and Agri-food Canada

The Serial Analysis of Gene Expression (SAGE) is based on the premise that a short signature sequence derived from the 3’ UTR of a transcript is sufficient to uniquely identify a gene within a sequenced organism. The protocol employs a series of restrictions and ligations to acquire the specific fragments (SAGE tags). The efficiency of the protocol is enhanced as pairs of SAGE tags are ligated to form ditags, which are amplified and concatenated to form ditag chains before they are cloned and sequenced. The resulting sequence reads contain 400-750 bases consisting of 16-28 ditags separated by the anchoring enzyme's recognition sequence (i.e. 'CATG' for NlaIII). Valid ditags are extracted from the sequence read and the frequency of individual SAGE tags within the library is determined, which provides an accurate quantitative estimate of the transcriptome. Valid ditags have a defined length (24-26 bases) and may not contain identical tags. Duplicate ditags may be formed legitimately from highly expressed genes, but they are excluded from further analysis as they may result from biased PCR amplification. Errors within the analysis might occur due to infidelity of DNA replication during the PCR or sequencing reactions. cSAGE provides an efficient mechanism for extracting SAGE tags and matching them with virtual SAGE tags derived from DNA sequence databases. SAGE tags are extracted from the sequence reads in linear time using a state machine, and stored in a 5-ary tree with nodes representing the bases {A,C,T,G,N}. This tree enables rapid detection of duplicate ditags and efficient tag-to-gene matching. Virtual SAGE tags extracted from the DNA databases are used to search the tree for matches. Known vector and linker tags can be excluded from analysis by placing them in an "exclude" file. Sequence reads may be in either FASTA or PHD format (output from Phred) and DNA database sequences must be in FASTA format. PHD format sequence reads allow the experimental SAGE tags to be screened for sequence quality prior to analysis. Highlights from a cold acclimation experiment in A. thaliana using the SAGE protocol and cSAGE include: 92,290 valid ditags containing 184,580 SAGE tags from which 146,178 had an average phred quality greater than 20. Removal of polyA and linker tags provided a final set of 145,170 tags. This set contained 29,663 (20.4%) unique tags, of which 16,664 (11.5%) were present as singletons. Of the unique tags 89% matched a gene: 46% of tags matched the canonical (3' most) recognition site and 43% matched a non-canonical site. Non-canonical matches are explained in four ways: incomplete digestion of the mRNA, alternate splicing of the gene, misannotation of the gene, or anti-sense transcription of the gene. Alternate splicing has been confirmed for a small number of the non-canonical matches. cSAGE is an open-source, freely available application written in C. It is intended to fit into a larger analysis pipeline, for instance a PERL script is used to compare two cSAGE reports and display tags with a significant change in expression. A modular design facilitates the extension of cSAGE for new applications. For more information on cSAGE see http://homepage.usask.ca/\~ctl271/csage.