next up previous
Next: Dynamic Programming Up: Stat 260: Statistics Previous: Introduction

Global pairwise alignment

A global alignment is a path from top left to bottom right of the dot matrix where the possible moves are , and , k>0. These correspond to the alignment moving along the sequence with no gaps or with a gap of length k in sequence a or b respectively. We might look for the global alignment maximizing the number of corresponding proteins, which can be thought of as maximizing the sum over the path of a function taking value one for matching proteins pairs and zero for non-matching proteins and not penalizing gaps. Although there are a large number of possible paths, we can find a best path with computational effort by working backwards through the dot matrix and using the property that the maximum total sum collected from to the end of the alignment is the maximum of the score for added to the maximum score for all legal completions of the alignment path from . This method is illustrated in Figure 2 and discussed in more detail in Section 3.

  
Figure 2: Recursive calculation of the highest scoring path. The boxed value is 5 because it is at a matching CC location and the highest scoring completion of the alignment path has value 4. If the highest scoring path goes through the boxed position then it will also go through the position with value 4. Stars correspond to matching sites in the unfilled part of the table.

The simple binary scoring method can be modified by using scores that reflect the amount of similarity between the pairs of amino acids on the alignment path. The matrix of the scores for matching amino acid u in sequence a with v in sequence b is called a substitution matrix. Substitution matrices may be defined by considering the biochemical properties of the amino acids, but all have an interpretation in terms of log likelihoods of independent identically distributed transitions for . The PAM (point accepted mutation) and BLOSUM matrices are examples of substitution matrices determined by empirical calculation of these transition probabilities in proteins where the true alignment is known. They are standardized so that, for example, one `PAM' corresponds to an expected change of of the amino acids in the sequence. Any required PAM matrix can be calculated from the originally published PAM-250 of Dayhoff et al. (1978) by taking the appropriate matrix power.

In the case of DNA sequence alignment, if we ignore for the moment gaps that may occur when one or more bases get inserted into one of the sequences, the Jukes/Cantor model (see notes to week 13) leads to a simple substitution matrix. Set as the chance of a nucleotide substitution under evolution, q=1-p, n the total number of bases to be aligned and d the number of sites which do not match in the alignment. Then the likelihood ratio of the probability under the evolution model to that under a null model of iid random bases is

So taking and for we see that the total score of a path is the log likelihood ratio. Note also that and for . In this case, with no gaps, we do not have any freedom to decide what the alignment is but the score can be used for a likelihood ratio test of the hypothesis that there is a genuine alignment against an iid alternative.

Any substitution matrix corresponds to a likelihood ratio under a similar model (Altschul, 1997) by writing it as

where is the background proportion of u, the are target frequencies in aligned regions that sum to 1, and is a positive constant.

In practice it is also important to assign gap penalties for each gap of length k>0 in the proposed alignment. There are currently no probability models or biochemical models suggesting a particular gap penalty, and a common practical solution is to try gap penalties of the form and determine and by trial and error.



next up previous
Next: Dynamic Programming Up: Stat 260: Statistics Previous: Introduction



Simon Cawley
Fri May 8 16:20:53 PDT 1998