It is possible to model DNA sequence with an HMM. Recall that to specify an HMM
we have to define a Markov chain with its transition probabilities, and output
distributions for each state in the chain. A very basic model is presented in figure
3 on page
- here we have a non-coding state which
will have some kind of output distribution over
and states for being
in the first, second and third codon positions. Note that this doesn't allow for
any insertions or deletions in the sequence (since the first codon state must
be followed by the second codon position, and so on. For an output distribution on
this Markov chain, we could use the information such as in table 1 on
page
.
Figure 3: A simple HMM for modeling prokaryotic genes, using frame dependent base
compositions. Each state has a multinomial distribution over
associated
with it.
The problem of deciding what values to use for all the parameters of the HMM (parameters include both the output distributions and the transition probabilities of the Markov chain) is known as the training problem. This is usually dealt with by making estimates from data for which we already know the answer. Alternatively, the standard approaches to estimating parameters in HMMs can be used.
Then the object is to analyse a given sequence of DNA and decide if and where genes exist in it. The Viterbi algorithm is just what is needed here, it figures out the most likely path through the Markov chain, given the data, and so will give an estimate of which parts of the sequence are codons. The forward-backward algorithm can be used to evaluate the likelihood of the given sequence according to the current model being used.
Note that it is very easy to substitute other HMMs for states in the original HMM. For example, the state for the non-coding region could be replaced with a more sophisticated HMM if it is appropriate.
We have only looked at the simplest of HMMs, but in reality there are other features
that should be incorporated. Figure 4 (page
incorporates
the fact that coding must start with ATG and end with one of the three stop codons TAA, TAG,
TGA. The states labeled ``Basic Codon Model'' and ``Inter-Genic Model'' can themselves be
HMMs.
Figure 4: A more sophisticated HMM for modeling prokaryotic genes. Note that the states
``Basic Codon Model'' and ``Inter-Genic Model'' are themselves HMMs
Some possibilities for the underlying basic codon model are:
)
assuming the underlying sequence to be iid.
, this is
the underlying method in [6])
order - i. e. based
on hexamer frequencies)
Exercise 1: show that in a
order Markov chain, if
is defined as the frequency of the sequence
,
then the probability of a sequence of states
is given by
