Poster Title: Linguistic profiling of genome sequences. The Sequence identifier position end set cardinal: an estimate of linear sequence complexity: algorithms and genome profiling applications.

Christophe Lefevre1, Victorian Bioinformatics Consortium

The complexity of a symbol sequence has been defined by several methods: algorithmic complexity, Shannon entropy, complexity vector and, linguistic complexity. The complexity of a genome sequence is traditionally viewed as the amount of uniqueness in the sequence (eg. in DNA renaturation kinetic experiments). Genomic regions with high number of repetitions are often called low complexity zones. A number of complexity measures have been applied to genome sequences and shown to be useful to map regularities and irregularities on different scales that may help distinguish functional or regulatory regions. Here, the sequence identifier end set cardinal is proposed as a new estimator of linear sequence linguistic complexity. A linear time scanning window algorithm to compute this value in a scanning window profile is presented and results obtained with genomic sequences are discussed. Following a linguistic approach based on previous work on word recognition in biological sequences (Lefevre & Ikeda, 1993), we consider the word composition of a sequence. Several sub-word classes can be identified: the set of repeats R, the set of non repeat U, the set of position identifiers (PIS, defintion 1) and, the position identifier end set (PIES, defintion 2). - defintion 1: PIS ={ word of length k = w(k) = a(1)..a(k) ? U such that w(k-1) = a(1)..a(k-1) ? R}. - defintion 2: PIES = { w ? PIS such that no sub word is a proper suffix of another}. The number of position identifiers in any sequences is always equal to the length of the sequence. The content of the other word sets varies with the amount of regularities (e.g. repetitions) in the sequence. In particular, the number of words contained in the position identifier end set grows between 0 and the length of the sequence as less regularities are present in the sequence. This provides a simple linguistic measure of linear sequence complexity. Using a position end set tree data structure described previously , an algorithm was implemented to draw a profile of the number of words in each category over a variable window size using artificial (simple, complex, mixed or randomly generated sequences) and natural genomic sequences. Such profiles capture variations and irregularities in local linguistic complexity at different scales. Examples will be presented and the interpretation of complexity profiles for the study of genome composition and evolution will be discussed. reference: Christophe Lefèvre & Joh-E Ikeda. "A small automaton for word recognition in DNA sequences and its application to consensus ana;ysis of regulatory elements in DNA regions controlling gene expression". First International Conference on Intelligent Systems for Molecular Biology (ISMB I), Washington DC, USA, 1993.