next up previous
Next: Using HMM's to Up: Finding Genes Previous: A first pass

Using HMM's to model genetic sequence: prokaryotes

It is possible to model DNA sequence with an HMM. Recall that to specify an HMM we have to define a Markov chain with its transition probabilities, and output distributions for each state in the chain. A very basic model is presented in figure 3 on page gif - here we have a non-coding state which will have some kind of output distribution over and states for being in the first, second and third codon positions. Note that this doesn't allow for any insertions or deletions in the sequence (since the first codon state must be followed by the second codon position, and so on. For an output distribution on this Markov chain, we could use the information such as in table 1 on page gif.

  
Figure 3: A simple HMM for modeling prokaryotic genes, using frame dependent base compositions. Each state has a multinomial distribution over associated with it.

The problem of deciding what values to use for all the parameters of the HMM (parameters include both the output distributions and the transition probabilities of the Markov chain) is known as the training problem. This is usually dealt with by making estimates from data for which we already know the answer. Alternatively, the standard approaches to estimating parameters in HMMs can be used.

Then the object is to analyse a given sequence of DNA and decide if and where genes exist in it. The Viterbi algorithm is just what is needed here, it figures out the most likely path through the Markov chain, given the data, and so will give an estimate of which parts of the sequence are codons. The forward-backward algorithm can be used to evaluate the likelihood of the given sequence according to the current model being used.

Note that it is very easy to substitute other HMMs for states in the original HMM. For example, the state for the non-coding region could be replaced with a more sophisticated HMM if it is appropriate.

We have only looked at the simplest of HMMs, but in reality there are other features that should be incorporated. Figure 4 (page gif incorporates the fact that coding must start with ATG and end with one of the three stop codons TAA, TAG, TGA. The states labeled ``Basic Codon Model'' and ``Inter-Genic Model'' can themselves be HMMs.

  
Figure 4: A more sophisticated HMM for modeling prokaryotic genes. Note that the states ``Basic Codon Model'' and ``Inter-Genic Model'' are themselves HMMs

Some possibilities for the underlying basic codon model are:

  1. Frame dependent base composition (as in table 1, page gif) assuming the underlying sequence to be iid.
  2. Independently distributed codon model, with the assumption that codons are independent (as in table 2, page gif, this is the underlying method in [6])
  3. Stationary Markov chain (typically order - i. e. based on hexamer frequencies)
  4. Nonstationary Markov chain, usually with frame-dependent transition probabilities --- for further details see [3] and the references therein.

Exercise 1: show that in a order Markov chain, if is defined as the frequency of the sequence , then the probability of a sequence of states is given by



next up previous
Next: Using HMM's to Up: Finding Genes Previous: A first pass



Simon Cawley
Fri May 1 15:50:13 PDT 1998