next up previous
Next: References Up: STS-content mapping Previous: STS-content mapping

Distance model: Maximum likelihood approach

In this section we present a maximum likelihood approach based on pairwise STS coretention patterns to order the STSs and estimate the distances between STSs.

Imagine the genome as a long interval of size G on the line. Suppose there are m STSs of interest, , independently and uniformly distributed along the genome. Since each STS is relatively short compared to the genome (typically 1-200 bp versus megabases of the genome), we can model each STS as a point on the line. And suppose there are n uniformly and independently distributed clones of constant, known length L. For a pair of STSs () with distance D apart, we can count the number of clones retaining both STSs (), only (), only (), and neither of the two STSs (). Based on the above assumptions on the processes of STSs and clones, the coretention data = () can then be modelled as a multinomial random variable with cell probabilities . Denote the normalized clone length by l, and the normalized STS pairwise distance by d. The cell probabilities can then be rewritten as . Note that when , i.e., the distance between is larger than the clone size, no clone can retain both STSs simultaneously. So , and has degenerate cell probabilities (0, l, l, 1-2l). Hence the data gives no information about d. The log-likelihood function of d for (, ) under the above model, given the data = (), is

Define the likelihood for each ordering of STSs as the sum of maximized log-likelihoods of consecutive pairs, i.e., given a ordering of the STSs (), say, (),

The idea is that the true order is the maximizer of Lik(). Thus the STS ordering problem is equivalent to a Traveling-Salesman-Problem (TSP) with the STSs as vertices and the pairwise likelihoods as the distances.

Proposition 1. Suppose that we have clone-STS incidence data, where n clones and m STSs are uniformly and independently distributed along the genome. Assume further that clones are of constant, known length L, and m is large enough so that no consecutive STSs are more than L apart. Suppose that for each order of STSs, we fit the data by maximum likelihood under the above model. Then with probability 1 for n sufficiently large, the maximized likelihood will be largest for the true order.



next up previous
Next: References Up: STS-content mapping Previous: STS-content mapping



Simon Cawley
Thu Apr 30 03:30:28 PDT 1998