ScriptSure: A Non Redundant View of the Human Transcriptome

Jarret Glasscock1, Warren Gish2
1jglassco@sapiens.wustl.edu, Washington University; 2gish@watson.wustl.edu, Washington University

The goal of the ScriptSure project is to create a database that gives an accurate, comprehensive representation of the human transcriptome. The significance of such a resource is that it will aid in the characterization and identification of genes and gene features, as well as elucidate UTR regions. Current approaches of searching the transcript data suffer from redundancy present in today's EST databases. In addition, EST clustering procedures that use transcript data alone, fail to take advantage of the inherent information present in the genomic sequence. This results in chimeras bringing together unrelated genes into one cluster and paralogue cluster collapsing. Conversely, clusters which do belong together, based on information supplied by the genomic sequence (proximity, splice sites, etc), are not clustered. Alternatively, genomic sequence is too expensive to search in depth on a routine basis and gives rise to false positives. All of these things point to the need for a new resource. ScriptSure makes it easier to get transcript data for a given region, provides a non-redundant representation of the transcript data, provides high quality (genomic) sequence, and alleviates problems associated with chimeras and paralogues. The fundamental technique through which ScriptSure accomplishes this is by stringent anchored alignment of the transcript data with genomic data and subsequent extraction of the genomic segment associated with the transcript. Such a segment would constitute an entry in the ScriptSure database. Current focus involving ScriptSure is the characterization of features of the transcriptome. Additional levels of abstraction from the data are also being investigated. http://sapiens.wustl.edu/ScriptSure