In silico prediction of UTR repeats using clustered EST data

Stefan Rensing1, Daniel Lang2, Ralf Reski
1stefan.rensing@biologie.uni-freiburg.de, University of Freiburg, Plant Biotechnology; 2daniel.lang@biologie.uni-freiburg.de, University of Freiburg, Plant Biotechnology

Clustering of EST (expressed sequence tag) data is a method for the non-redundant representation of an organisms transcriptome. During pre-processing of the sequence raw data, contaminations such as vector or linker sequences as well as bacterial genes are being removed (clipping). In the same process, it is essential to mask repetitive elements in order to avoid wrong clustering due to these sequence stretches.
We present three approaches for the in silico detection of putative repetitive elements in untranslated regions of protein encoding genes (UTR repeats). (I) The REPuter approach looks for direct repeats in singlet regions and afterwards checks for the occurrence of those within contigs. (II) The HASTE BLAST approach uses BLAST with the same parameters as the HASTE algorithm - which is initially used for clustering - to determine regions of erroneous clustering. (III) Finally, the “tiresome” approach tries to find those repeats that were missed in the first two approaches by determining false clustering stretches utilizing BLAST and manual inspection.
All three approaches yielded several putative UTR repeat sequences. When including those into the pre-processing of the EST data, a lot of repetitive regions could be masked. In addition, seven predicted repeats have been checked in the wet lab for presence in the genome and could be detected with a copy number between 5 and 17, proofing their repetitive nature.