GeneAnnot: Annotation of high-density oligunocleotide arrays and their linking with GeneCards.

Vered Chalifa-Caspi1, Itai Yanai2, Ron Ophir, Michael Shmoish, Hila Benjamin-Rodrig, Naomi Rosen, Pavel Kats, Marilyn Safran, Orit Shmueli and Doron Lancet.
1vered.caspi@weizmann.ac.il, Weizmann Insitute of Science; 2Iyanai@wisemail.weizmann.ac.il, Weizmann Insitute of Science

Affymetrix GeneChip® expression array sets are designed to contain representations of the entire gene complement of an organism (Liu et al. 2003). Many of the probe sets in these sets are derived from ESTs which are not always reliable indicators of mRNA identity. Of those derived from known mRNA sequences, the specificity is not always guaranteed. The GeneAnnot software tool strives to revise and improve the annotation of these arrays; provide qualitative assessment to the various probe sets; and link the probe sets (in the human arrays) to GeneCards, an integration tool of gene-related information from a wide range of databases (Safran et al. 2002, 2003). In conjunction with experimental expression data, this effort significantly increases the ability to functionally decipher genes belonging to the "Terra incognita" of an organism's genome, i.e. the genes for which little information is available. Many GeneAnnot concepts are also applicable to spotted arrays data.

The GeneAnnot program was initially applied to the human HG-U95A-E array set, comprising 62,839 probe sets. Data was analyzed and stored in a MySQL relational database. Using the Blat software (Kent 2002), all probe sequences (16 probes of 25mer per probe set) were compared to the major sets of publicly available human mRNA sequences: RefSeq (NCBI), Ensembl (EBI/Sanger) and all human mRNA sequences from GenBank's "primate" division. All cases where a probe sequence was identical to an mRNA sequence, or when there was no more than one mismatch (substitution, deletion or insertion of one nucleotide), and where the alignment was in the same orientation, were stored in the database. Whenever possible, we identified the mRNAs as belonging to GeneCards genes (having a GeneCards ID and symbol). The association between mRNAs and GeneCards genes was done using data and algorithms from GeneLoc (Rosen et al 2003), utilizing both ID associations and genomic positional information. Each probe set to gene pairing received a score indicating the sensitivity and specificity of the relation, and GeneCards annotation was only assigned if the score was above a certain cutoff. When the gene to which the mRNA belonged was not known, the probe set was annotated by the mRNA GenBank accession number itself. For probe sets for which there was no mRNA match, annotation was assigned based on their corresponding UniGene cluster ID (namely that cluster that contain the EST accession from which the probe set was derived; only clusters with a descriptive title were considered). The remaining probe sets received Affymetrix annotation (at least an EST ID and title). The GeneAnnot program is currently being applied also to the newer human HG-U133 set and to the mouse MG-U74 set.

GeneAnnot results were integrated with our GeneNote (Gene Normal Tissue Expression) database , an in-house experimental profiling of human gene expression using Affymetrix array set HG-U95A-E. GeneAnnot enables gene-based clustering and analysis of GeneNote data. In addition, a tissue vector for each gene was built by weighted averaging of the intensities of the various probe sets that match this gene, based on their sensitivity and specificity scores. The GeneNote web site merges the annotation and expression data for each gene, and enables searches by different gene and sequence attributes, including their IDs in the various databases that were used during construction of GeneAnnot. The GeneNote web site is linked to our GeneCards and GeneLoc (UDB) databases, enabling one to present combined GeneNote and GeneAnnot results in a summarized form, and to search the GeneNote site via the GeneCards and GeneLoc search engines.

References

Kent WJ, Genome Res. 2002 Apr;12(4):656-64.
Liu G et al., Nucleic Acids Res. 2003 Jan 1;31(1):82-6.
Rosen N. et al., Bioinformatics Vol. 19 Suppl. 1 2003 pp. i222-i224 (ISMB 2003, in Press).
Safran M. et al., Bioinformatics. 2002 Nov;18(11):1542-3.
Safran M. et al., Nucleic Acids Res. 2003 Jan 1;31(1):142-6.