The comparison of DNA and protein sequences is important in studying evolution and molecular biology. Evolutionary biologists can infer phylogenies from comparing aligned protein and DNA sequences (see notes to week 13) . The evolutionarily conserved and non-conserved regions give clues about the gene structure. Molecular biologists look for clues about structure and function from similarities in protein sequences, since sequencing is now routine and determining the 3D structure and function of a protein can be much harder.
Sequence alignment may be split into global and local, depending on whether or not the similarity is expected to continue right along the sequences being investigated. Another split is between pairwise (two sequence) and multiple (more than two sequence) comparisons. One may also split the problems into those where a relationship is known and where the relationship between the sequences is unknown and the statistical significance of any similarity detected should be checked.
Historically, global pairwise alignment was considered first, in Needleman and Wunsch (1970), and we will conserve this order. Their method involves a dynamic programming type algorithm. In Section 3 we will describe dynamic programming at a more abstract level, and in Section 3.1 we describe the Smith and Waterman (1981) algorithm for local pairwise alignment. The Needleman and Wunsch algorithm can then be seen to be a special case of this.
We will only look at pairwise comparison methods here. Local multiple comparison is an important problem in searching for characteristic protein motifs, however there is not yet a generally acceptable and accepted method, partly as the computational burden can easily become excessive. Heuristic extensions of pairwise methods are a popular way to make some progress with multiple alignment.
Identical techniques are appropriate to the protein, DNA and RNA cases. The sequences will be written as

where
and
come from an alphabet
or
or
.
Figure 1: A dot matrix of Schizosaccharomyces pombe Ste6 (68)
protein (y-axis) against Saccharomyces cerevisiae Cdc25 (69)
protein, showing two homologous domains. Taken from Altschul
(1997).
An exploratory tool to visualize the data is the dot matrix of Gibbs
and McIntyre (1970). An
plot is made with a point marked on
at
if
. This is demonstrated in
Figure 1. Although the dot matrix loses the information
that some substitutions are more compatible with sequence alignment
than others, it is a good way to look for features such as sequence
reversals and secondary matches that may be overlooked by a sequence
aligning algorithm.