In this section we present a maximum likelihood approach based on pairwise STS coretention patterns to order the STSs and estimate the distances between STSs.
Imagine the genome as a long interval of size G on the line.
Suppose there are m STSs of interest,
, independently
and uniformly
distributed along the genome. Since each STS is relatively
short compared to the genome (typically 1-200 bp versus megabases of
the genome),
we can model each STS as a point on the line.
And suppose there are n uniformly
and independently distributed clones of constant, known length L.
For a pair of STSs (
) with distance D apart, we can count the number of clones
retaining both STSs (
),
only (
),
only
(
), and neither of
the two STSs (
). Based on the above assumptions
on the processes of STSs and clones, the coretention
data
= (
) can then be
modelled as a multinomial random variable with cell probabilities
.
Denote the normalized clone length
by l, and the normalized
STS pairwise distance
by d. The cell probabilities
can then be rewritten as
. Note that when
, i.e., the distance
between
is larger than the clone size, no clone can retain
both STSs simultaneously. So
, and
has degenerate cell probabilities (0, l, l, 1-2l).
Hence the data gives no information about d.
The log-likelihood function of d for
(
,
) under the above model, given the data
= (
), is

Define the likelihood for each ordering of STSs as the sum of
maximized log-likelihoods of consecutive pairs, i.e., given a ordering
of the STSs (
), say,
(
),

The idea is that the true order is the maximizer of
Lik(
). Thus the STS ordering problem is equivalent to a
Traveling-Salesman-Problem (TSP) with the STSs as vertices and the
pairwise likelihoods as the distances.
Proposition 1. Suppose that we have clone-STS incidence data, where
n clones and m STSs are uniformly and independently distributed
along the genome. Assume
further that clones are of constant, known length L, and m is large
enough so that no consecutive STSs are more than L apart. Suppose
that for each order
of STSs, we fit the data by maximum likelihood under the above model.
Then with probability 1 for n sufficiently large, the maximized
likelihood will be largest for the true order.
