Long-range correlation in protein sequences and its implicationKazuhito Shida1, Makoto Ikeda2, Atsuo Kasuya
email@example.com, CIR Tohoku University; firstname.lastname@example.org, CIR Tohoku University
It is well known that there is a certain level of
long-range correlations in the natural amino-acid
One example is the dipeptide substitution matrix proposed by Gonnet et al. (1994),
which assumes that
such correlations are at least partially preserved over
the process of evolution.
We would like to extend this idea to the regime of
We scanned the AA sequences from GenBank and
observed weak correlations between
two letters more than one letter apart.
In some case, clear long-range patterns were found.
Such correlation may effect
the similarity assessment of distant homologues
because it is usually performed by
taking randomized sequences as a null-hypothesis.
The distribution of the score
under the null-hypothesis might be changed
when the correct correlation is introduced into
The same effect is expected
for the phylogeny analysis.
Also, the database searches based on gapped n-grams,
for example the PaternHunter by Ma et al. (2002),
can be improved by means of the data of the long-range correlation.
The expected abundance can be used to define
a weight for n-grams, which enables us a more efficient usage of the index structure.