Statistical Analysis of Arabidopsis T-DNA-flanking sequences

Hyung Seok Choi1
1gnie@lycos.co.kr, Seoul National University

Agrobacterium-mediated T-DNA transformation of plant genome has been instrumental in tagging and subsequent isolation of genes important in a plant life cycle. Underlying hypothesis of using T-DNA in plant functional genomics is that the T-DNA insertion occurs randomly in plant genomes. To test this hypothesis, we analyzed the T-DNA flanking sequences isolated by and available in the Salk Institute Genomic Analysis Laboratory (SIGnAl) database. Examining the data of 29,084 Arabidopsis genes containing insertion information of approximately 120,000 T-DNA insert lines and functional annotations of the genes revealed that 70% of the 29,084 genes have at least one T-DNA insert, whereas 8760 genes are still left without an insert. This is quite surprising, because statistically the 120,000 lines should cover more than 95% of the genes with an insertion. This result led us to further analyze if the T-DNA insertion events truly take place randomly in the genome. Arabidopsis genome consists of 56.8 Mb (48.4%) of genic region and 60.5 Mb (51.6%) of intergenic region. Mapping the T-DNA insertion lines to the genome showed that 55% of the 120,000 T-DNA lines landed on the genic region, and 45% was assigned to intergenic regions. Nonparametric correlation tests indeed revealed that T-DNA insertion events occurred preferentially in the genic regions. Interestingly, the gene At2g25610 annotated as a putative vacuolar ATP synthase proteolipid subunit possess as many as 622 T-DNA inserts, again, suggesting a preferred T-DNA insertion region in the genome. To further elucidate if the T-DNA insertion events left certain region of the genome untouched due to an innate biological reason of Agrobacterium-mediated T-DNA transformation method, we are functionally and bioinformatically cataloging the genes with no T-DNA insert, and comparing it with the catalog of the entire genes of Arabidopsis. Any differences between these two catalogs may show if the T-DNA tagging method fails to insert certain class of genes.