Higher order Markov chains are usually required with DNA sequences, but for
simplicity we only
consider 1st order. Note that a MC of any order
can be vectorized to give 1st order MC.
Recall the notation
for
stationary transition probabilities.
Under such a model, we can write

Let
be the dinucleotide frequencies in
a string of length N. i.e.

etc. Write
.
If the sequence
starts at s
and finishes at f, the following holds:
and
differ by 1;
and
differ by -1;
if
.
All sequences S with given
are
equally probable in the independent case
and there were
of them. It is known that similar expression holds for
Markovian model.

Proof: See Whittle (1955).
Here is an application of this to DNA sequences.

Proof: We follow the line of proof for the independence model. Let k be the length of w.

Here
where
is
,
with the dinucleotide counts in w subtracted and
added.
Observing that
we get the desired result.
An expression for
was recently calculated by
Prum et al (1995) (see also S. Schbath(1995)),
and used to standardize.
Leung et al (1996) found that
words of the some length (
)
could be ranked fairly accurately
using
the simplest standardization :
.
However, the more elaborate formula are to be preferred
for particular words of interest.