next up previous
Next: Global pairwise alignment Up: Stat 260: Statistics Previous: Stat 260: Statistics

Introduction

 

The comparison of DNA and protein sequences is important in studying evolution and molecular biology. Evolutionary biologists can infer phylogenies from comparing aligned protein and DNA sequences (see notes to week 13) . The evolutionarily conserved and non-conserved regions give clues about the gene structure. Molecular biologists look for clues about structure and function from similarities in protein sequences, since sequencing is now routine and determining the 3D structure and function of a protein can be much harder.

Sequence alignment may be split into global and local, depending on whether or not the similarity is expected to continue right along the sequences being investigated. Another split is between pairwise (two sequence) and multiple (more than two sequence) comparisons. One may also split the problems into those where a relationship is known and where the relationship between the sequences is unknown and the statistical significance of any similarity detected should be checked.

Historically, global pairwise alignment was considered first, in Needleman and Wunsch (1970), and we will conserve this order. Their method involves a dynamic programming type algorithm. In Section 3 we will describe dynamic programming at a more abstract level, and in Section 3.1 we describe the Smith and Waterman (1981) algorithm for local pairwise alignment. The Needleman and Wunsch algorithm can then be seen to be a special case of this.

We will only look at pairwise comparison methods here. Local multiple comparison is an important problem in searching for characteristic protein motifs, however there is not yet a generally acceptable and accepted method, partly as the computational burden can easily become excessive. Heuristic extensions of pairwise methods are a popular way to make some progress with multiple alignment.

Identical techniques are appropriate to the protein, DNA and RNA cases. The sequences will be written as

where and come from an alphabet or or .

  
Figure 1: A dot matrix of Schizosaccharomyces pombe Ste6 (68) protein (y-axis) against Saccharomyces cerevisiae Cdc25 (69) protein, showing two homologous domains. Taken from Altschul (1997).

An exploratory tool to visualize the data is the dot matrix of Gibbs and McIntyre (1970). An plot is made with a point marked on at if . This is demonstrated in Figure 1. Although the dot matrix loses the information that some substitutions are more compatible with sequence alignment than others, it is a good way to look for features such as sequence reversals and secondary matches that may be overlooked by a sequence aligning algorithm.



next up previous
Next: Global pairwise alignment Up: Stat 260: Statistics Previous: Stat 260: Statistics



Simon Cawley
Fri May 8 16:20:53 PDT 1998