From Tag to Gene: Linking the Pieces in Sequence Analysis

Christian Iseli¹, C. Victor Jongeneel¹ and Philipp Bucher², Swiss Institute of Bioinformatics and ¹Office of Information Technology,Ludwig Institute for Cancer Research, ² Swiss Institute for Experimental Cancer Research, 1066 Epalinges, Switzerland

Synopsis

The current databases contain large amounts of sequence data, but much of it in raw or poorly annotated form. Researchers are often faced with the problem of gathering extensive information about a gene starting from a partial sequence (e.g. a SAGE tag, a peptide tag from an MS/MS instrument, or an EST-type sequence). While many tools exist to assist in this task, they have not usually been gathered and presented in the framework of a single, coherent strategy.

Goal

To provide participants with a set of clearly defined tools and recipes for interrelating sequence information obtained from ESTs, genomic sequences, and proteins, and teach them to use these tools to obtain a comprehensive picture of gene structure, function and polymorphism.
Content:
The tutorial will be based on a series of practical examples and step-by-step instructions for answering the following questions, starting from an experimentally determined nucleotide or protein sequence:

Does this sequence come from a known, well-characterized gene? Can this gene be unambiguously identified?
If not, can the full sequence of the gene or the corresponding mRNA be deduced from existing database entries?
Can the intron/exon structure and coding region of the gene be reconstructed? How do we use a combination of gene prediction tools, EST data and database similarities to do this reconstruction?
Where is the gene located on the genome physical and genetic maps?
What methods are available to predict the function of the protein encoded by a predicted or reconstructed gene?
Can one find SNP-type polymorphisms in the gene or the surrounding area of the genome?
What data are publicly available about the expression patterns of the gene, from SAGE, cDNA library analysis, or DNA chip data?

Details

The tutorial will be based on practical problems faced by researchers using EST, SAGE or MS data to explore mammalian transcriptomes or proteomes. Examples will be taken from human or mouse sequences.

1. Gene identification. Methods will be described for matching short, potentially error-containing sequence tags to sequence databases. In particular, the following issues will be examined: (i) Can errors in the tags and/or in the databases be flagged and corrected by exploiting the redundancy of the data? Statistical approaches to the determination of tag accuracy and its effects of subsequent database searches will be presented. (ii) What methods are most appropriate for tag matching? Potential methods include BLAST and Smith-Waterman searches, pattern searches, and profile or HMM-based searches. The advantages and potential pitfalls of each method will be discussed. (iii) If the tag matches multiple entries in the databases, what criteria can be used to decide whether this reflects a redundancy in the databases or an ambiguity in the assignment of the tag to a specific gene? Common problems include the databasing of a single gene sequence in different forms (genomic, mRNA, EST, GSS), the presence in the databases of closely related gene families, the presence of pseudogenes, and true redundancy in the databases. (iv) To what extent are various efforts at creating unique gene indices useful in tag mapping? An overview of current gene indices and their value and limitations will be presented.

2. mRNA sequence reconstruction. In cases where tag hits do not map to a well-defined gene whose mRNA sequence and intron/exon structure have been experimentally verified, EST and genome sequence data can sometimes be used to reconstruct this information. Emphasis in this section will be on methods for clustering EST data, and for generating contigs from these data and associated trace files (when available). Contig assembly and verification tools, both automatic and manual, will be demonstrated and compared. The use of genome sequence data to complement EST sequences will also be demonstrated.

3. Gene structure determination. In this section, we will examine methods for documenting the positions of exons in genome sequences. Specific problems include the generation of alignments between ESTs or EST contigs and genome sequences, the estimation of the accuracy of various gene prediction programs and the correction of their results with EST data, and the use of protein sequence similarity to provide an independent assessment of coding exon positions. The use of a variety of public domain software, including gene prediction programs (GENSCAN, GRAIL, the Solovyev program suite, etc.), profile alignment tools (especially pftools and HMMER), and Ewan Birney's Wise Tools, will be demonstrated.

4. Coding region detection and correction in EST sequences and contigs. In this section, we will discuss the theoretical underpinnings and the practical use of coding region detection algorithms. Emphasis will be on using our own coding region detection and correction program, ESTScan. The potential and limitations of various approaches will be discussed and illustrated.

6. Protein function prediction. The prediction of protein function based on its sequence has been widely discussed in previous ISMB tutorials. We will give a rapid overview of the subject, and present a Web-based protein domain search and exploration environment developed at our Institute. The main emphasis will be on the use of this environment for protein function prediction.

5. Gene mapping. There are many sources of information allowing the placing of a sequence on the physical and genetic maps of a genome. We will examine the value and correlation of these various sources: database entry annotation, links to locus databases (e.g. locuslink), presence of radiation hybrid tags, presence of polymorphic marker loci, and information related to physical maps (contig tiling). Examples will be given where ambiguities may arise, or where the precise map location may only be determined indirectly.

7. SNP discovery. Because many human genes have been sequenced several times from different sources (cDNA and genomic clone libraries), the current sequence databases document in an implicit fashion many polymorphisms that exist between or within individuals. Methodologies will be presented to generate multiple alignments including genome sequences, full-length RNA sequences and ESTs, and to flag differences between them that may reflect true polymorphisms. The experimental verification of these candidate polymorphisms will also be examined.

8. Expression analysis. An overview of Web-accessible data sources for transcriptome and proteome expression will be presented, as well as procedures for searching them for the expression patterns of a specific gene. We will specifically address the very practical problem of matching expression data for the same uncharacterized gene from diverse data sets in which genes are identified in different ways (e.g. by SAGE tags or database accession numbers). Difficulties in comparing data obtained using different methodologies (cDNA library representation, cDNA chip, Affymetrix-type chip, SAGE, 2D gel) will be highlighted.

The program of this tutorial is an ambitious one, and unfortunately the time allotted will not permit a detailed exploration of all of the issues raised above. It is the organizers' hope that participants will come away with enough information to point them in the right direction and make them aware of problems and possibilities.