Target selection for the custom oligonucleotide array by clustering experimentally determined and computationally predicted transcript sets in mouse

Serge Batalov1
1batalov@gnf.org, GNF

A custom oligonucleotide array design is aimed at effectively interrogating a largest possible non-redundant set of transcripts under a physical limit of the currently available array sizes and feature densities. This target set should balance a best attempt at including the "novel" transcripts (i.e. computationally predicted from a draft genomic sequence) with mandatory presence of the functionally characterized known transcripts. Generating this target set is an important first stage for a successful array design; the effective feature (probe) selection being the next stage. This task is the same for setting a large-scale screen set of cDNA clones, where the physical size constraint is replaced by the cost factor associated with the collection size.

To identify a non-redundant target set of publicly available and proprietary sequences, we surveyed the following set of sources: experimentally defined: MGC (~6955 clone sequences), RIKEN (FANTOM 1+2, 60770 sequences), RefSeq (~11461), UniGene with mRNA annotation (~17723); computational predictions: Ensembl (27923), Celera (46250), RIKEN Representative set (RTPS, 36830). It can be readily noted that RefSeq and MGC sequences are almost entirely present within the UniGene set. Other sets overlapped to a varying degree (a Venn diagram will be presented at the poster).

Any clustering is as good as its distance metrics. Obviously, BLAST is the prime (usual suspect) candidate for producing all of the pairwise distances. Anything else is just too slow: for the compendium of N=200,000 sequences ~20 billion pairwise sequence comparisons are needed. However, BLAST produces local alignments, and a simple combination of the BLAST's high-scoring pairs (HSPs) is a poor substitute for a global (full-length) sequence alignment. Speed of BLAST also suffers in the presence of simple and interspersed repeats, since most of the time is then spent in the second (extension) stage of the BLAST which doesn't run in parallel, and the overall time is dominated by the impact of the most repeat containing sequence chunk.

The use of hardware accelerated versions of BLAST may help to cut down the time by an order of magnitude (MegaBLAST, or TimeLogic's TeraBLAST, or Paracel's GigaBLAST).

A commonly used post-processing tool for the sequence pairs preliminary identified by BLAST is sim4 [Florea et al. 1998], which has been written to minimize manual editing previously usually required for aligning cDNA and genomic sequences. sim4 does very well also for aligning homologous sequences which is notably somewhat outside of its intended scope [Florea et al. 1998]. BLAT [Kent W.J. 2002] is the new client-server application extending the same ideology. Both tools build on top of the same BLAST-like first stage search, but BLAT is more oriented towards massively parallel searches by keeping the index of all non-overlapping words in the database. Just as sim4 does a fair job at aligning alternative transcripts, the database for BLAT can be a collection of all possible transcripts instead of the genomic contigs. BLAT has been shown to be more precise in defining the candidate exon boundaries than sim4 [Kent 2002], faster than sim4, and doesn't require BLAST precomputing stage.

After determining the pairwise scores using the BLAT engine the single-link clustering was performed using a set of overlap rules, optimized and validated by inspection of the largest non-artifact clusters, and the representative clusters for which careful inspection is available. Even before the alignment stage, if care is not taken of L1 and other repeats, the largest clusters reach a thousand members, which is clearly an artifact. But when those repeats are RepeatMasked and the clustering criteria are tweaked, the largest clusters are typically 90 members and less. They contained the previously known multilocus genes (e.g. Gapd, RpL21, RpL29, Hmgb1, Rps2, RpL7a), as well as low copy repeats which were subsequently masked.

The resulting 66,024 clusters were further triaged to produce a set of 40,000 non-redundant target sets with the highest degree of confidence of computational prediction. Additional considerations included a "normal" pattern of genomic splicing (discouraging the single-exon computational predictions), a functional inference (e.g. from InterPro analysis), etc.

The final set along with the cluster association information was submitted to the Affymetrix' design pipeline. The custom chip was subsequently extensively used to profile the expression in 70+ different tissue samples and cell lines, and the results will be made available to public shortly.