An Effective Query Method for DNA Sequence

Jiyuan An1, Yi-Ping Phoebe Chen2
1j.an@qut.edu.au, Queensland University of Technology; 2p.chen@qut.edu.au, Queensland University of Technology

In the field of Bio-informatics, similarity query is importance algorithm to analyse biological data sequences. A DNA sequence is represented by a string of four alphabets: A (Adenine) C (Cytosine) T (Thymine) G (Guanine). It is primary used to find a sequence in a database that parts of it are similar to a query sequence. There are many similarity search algorithms, such as BLASTN, are based on dynamic programming (DP). It is difficult to compare with a whole database of millions of sequences. Practically, heuristic approach is used to approximate a proper edit distance. In this paper, we use the time warping distance to measure dissimilarity between two DNA sequences. Time warping distance does not be influenced by the distortions of sequence and reflects the property of sequence well. For searing similar DNA sequences from massive database, we use 2 bits to represent an alphabet of DNA sequence. The sub-sequence which consists of the same alphabet is reduced to one alphabet. For example, ACCTTTGC -> 00 01 10 11 01 (10 bits). We can use only 10 bits to represent above DNA sequence. Moreover, when the number of repeat alphabet increases, we have more compact structure to represent a DNA sequence. Fortunately we can find that reducing the repeat alphabet will not effect the time warping distance. Although the compute cost is added to cope with bit operation, because of memory-based query, the number of page access is gotten. In conclusion, we introduced time warping distance to measure dissimilarity between two DNA sequences. We proposed an effective query method by coding DNA sequence. Because the query process is on memory, the time to access data from storage is saved.