A global alignment is a path from top left to bottom right of the dot
matrix where the possible moves are
,
and
,
k>0. These correspond to the alignment moving along the sequence
with no gaps or with a gap of length k in sequence a or b
respectively. We might look for the global alignment maximizing the
number of corresponding proteins, which can be thought of as
maximizing the sum over the path of a function taking value one for
matching proteins pairs
and zero for non-matching proteins
and not penalizing gaps. Although there are a large
number of possible paths, we can find a best path with
computational effort by working backwards through the dot matrix and
using the property that the maximum total sum collected from
to the end of the alignment is the maximum of the score for
added to the maximum score for all legal completions of the alignment
path from
. This method is illustrated in Figure 2
and discussed in more detail in Section 3.
Figure 2: Recursive calculation of the highest scoring path. The boxed
value is 5 because it is at a matching CC location and the highest
scoring completion of the alignment path has value 4. If the highest
scoring path goes through the boxed position then it will also go
through the position with value 4. Stars correspond to matching sites
in the unfilled part of the table.
The simple binary scoring method can be modified by using scores that
reflect the amount of similarity between the pairs of amino acids on
the alignment path. The matrix of the scores
for matching
amino acid u in sequence a with v in sequence b is called a
substitution matrix. Substitution matrices may be defined by
considering the biochemical properties of the amino acids, but all
have an interpretation in terms of log likelihoods of independent
identically distributed transitions for
.
The PAM
(point accepted mutation) and BLOSUM matrices are examples of
substitution matrices determined by empirical calculation of these
transition probabilities in proteins where the true alignment is
known. They are standardized so that, for example, one `PAM'
corresponds to an expected change of
of the amino acids in the
sequence. Any required PAM matrix can be calculated from the
originally published PAM-250 of Dayhoff et al. (1978) by taking
the appropriate matrix power.
In the case of DNA sequence alignment, if we ignore for the moment
gaps that may occur when one
or more bases get inserted into one of the sequences, the Jukes/Cantor model
(see notes to week 13) leads to a
simple substitution matrix. Set
as the chance of a nucleotide substitution under evolution, q=1-p, n the
total number of bases to be aligned and d
the number of sites which do not match in the alignment. Then the
likelihood ratio of the probability under the evolution model to that
under a null model of iid random bases is

So taking
and
for
we see that the total score of a
path is the log likelihood ratio. Note also that
and
for
. In this case, with no gaps, we do
not have any freedom to decide what the alignment is but the score can
be used for a likelihood ratio test of the hypothesis that there is a
genuine alignment against an iid alternative.
Any substitution matrix corresponds to a likelihood ratio under a similar model (Altschul, 1997) by writing it as

where
is the background proportion of u, the
are
target frequencies in aligned regions that sum to 1, and
is
a positive constant.
In practice it is also important to assign gap penalties
for
each gap of length k>0 in the proposed alignment. There are currently
no probability models or biochemical models suggesting a particular gap
penalty, and a common practical solution is to try gap penalties of the
form
and determine
and
by trial
and error.