Frequency enumeration of DNA subsequences from large-scale sequences using linear codes

Yoichi Takenaka1, Hideo Matsuda2
1takenaka@ist.osaka-u.ac.jp, Department of Bioinformatic Engineering, Osaka Univ. Japan; 2matsuda@ist.osaka-u.ac.jp, Department of Bioinformatic Engineering, Osaka Univ. Japan

Frequency enumeration of the DNA subsequences is one of the classical but the important techniques in probe design, finding unique and repeat subsequences, and so on. The enumeration algorithms are simple, but they take enormous memory space when the DNA sequences become large and consequently the length of the subsequence becomes large number. To cope with the memory issue, many approximation methods, such as randomized algorithms, stochastic methods and a method based on the Shannon entropy are proposed. These methods can be useful to design probes and to find several unique sequences. But they are unfit for gaining the comprehensive property of the whole DNA sequences, such as repeat sequences.
We propose an enumerate method that uses linear codes to reduce the memory space. Firstly, our method maps each subsequence to the representative subsequence which is the same as or a single nucleotide mutation of the original subsequence by use of a linear code. Then it enumerates the frequency of the representative subsequences.
In the presentation, we prove the decoding process of the linear and perfect codes is equivalent to the calculation of the representative subsequences. Then we show the method to calculate representative subsequences with the check matrix and the syndrome of the 2-ary (31, 26) Hamming code, and that the method requires about 1/60 of the usual memory space.