Likelihood methods for phylogenies were first introduced by Edwards and Cavalli-Sforza (1964) for gene frequency data. Neyman (1971) applied likelihood to molecular sequences and this work was extended by Kashyap and Subas (1974). Felsenstein (1973,1981) brought the maximum likelihood framework to nucleotide-based phylogenetic inference. As an example, we will compute the likelihood for a given tree using DNA sequences, but these procedures can be used for other characters as well. This example is borrowed from Felsenstein (1981).
Figure 2. An example tree for Maximum Likelihood
We have a set of aligned sequences with m sites. To compute
the probability of a tree, we must have a phylogeny with
branch lengths and an evolutionary model giving the probabilities
of change along the tree. This model will allow us to compute
the transition probabilities
, the probability that state j
will exist at the end of a branch of length t, if the state
at the start of the branch is i. We must remember the t
is a measure of branch length, not time. There are
two central assumptions: (1) evolution at different sites
on the tree is independent (and identically distributed)
and (2) evolution in different lineages is independent.
This gives us the Markov property down trees (cf. pedigrees).
The first assumption allows us to express the probability as a product, with one term for each site:

where
is the data at the i-th site. Therefore,
we need to know how to compute the probability at a single
site. If we have the tree and the data at a site, the probability
is the sum, over all possible nucleotides that may have
existed at the interior nodes of the tree, of the probabilities
of each scenario.
)

Based on the Markov property (2) above, we can decompose the probability on the right side of the equation into a product:

If we assume that evolution has been proceeding for a long time according to the model of substitution that we are using, then the Pr(x) is the equilibrium probability of base x under that model.
However, there are still a large number of terms in the previous equation, such that
for each site we sum
terms. The number of terms
will rise exponentially with the number of species. On a tree
with n species, there are n-1 interior nodes, each of which
can have one of 4 states. If we need
terms, then for
10 taxa, this would be 262,144 terms!
In order to make the computation more realistic, Felsenstein (1973, 1981) introduced a "pruning" method, which is a version of the "peeling algorithm" introduced by Hilden (1970), Elston and Stewart (1971) and Heuch and Li (1972). As we saw in a previous week, these in turn are analogous to the HMM forward-backward algorithms, and all are special cases of the idea of dynamic programming over a directed acyclic graph (of Week 14). The method can be derived simply by trying to move the summation signs as far right as possible and enclose them in parentheses where possible.

The pattern of parentheses and terms for tips in this expression is

which corresponds exactly to the structure of the tree. The flow of computations is from the inside of the innermost parentheses outwards, with a flow of information down the tree).
We will utilize the conditional probability of a subtree,
, which is the probability of everything observed from
node k on the tree on up, at site i, conditional on node k
having state s (cf. HMM backward probabilities). In the previous
equation, the term

is one of the quantities, i.e. it is the probability of everything
seen above that node, given that the node has base w. There
will be four such quantities, one for each of the values of w.
The key to the pruning algorithm is that once the four numbers
are computed, they don't need to to be recomputed again.
The algorithm is a recursion that computes
at each
node of the tree from the same quantities in the immediate
descendent node. The algorithm is applied starting at the node
which has all of its immediate descendents being tips.
Then it is applied successively further down the tree, not
applying it to any node until all of its descendents have
been processed. The result is
for the bottom node.
The evaluation of likelihood is then completed for this
site by making a weighted average of these over all four bases,
weighted by their prior probabilities under the model:

Estimation of branch lengths (
) can be done by a form of
the EM algorithm adapted to trees in the same way as the HMM
estimation algorithm is adapted to (linear) Markov chains. Forward
(above) and backward (below) probabilities can be combined to
give joint probabilities of states at adjacent nodes given the data,
and these become the E-terms in the EM iteration. Details can be
found in Felsenstein (1981) or worked out independently by analogy
with the HMM case.