First let us note that if we put our data into a
table of
counts
as above, then it is not necessary to know the order of the three
loci. In the above table, the smallest entry is for the double recombinant
category because the best order is in fact w-m-f. But if we had labelled
our rows by recombination or not across w-f (say), leaving the columns the
same, the entries would simply be permuted.
Exercise 4: Check this last assertion.
Some time early in the study of such three-point test-crosses (as the above cross is known in the genetic literature), it was realized that the best order is that which assigns the smallest of the four cell entries to the double recombinant category. (Here and below, I assume that we have four counts with a unique smallest and largest. When there are ties, the details are a bit messier, and the conclusion less neat.)
My aim in this section is to prove that the above rule for determining order is in fact the ML order, under the assumption of No Chromatid Interference (NCI). It is also true under the assumption of independence of recombinations across disjoint intervals, but this stronger assumption is not necessary. Indeed the fact that it is true under an assumption that is often contradicted by the data is of interest in itself.
To get the proof going, we need some notation and some preliminary facts.
First, suppose that we have three loci whose true order is 1-2-3,
and denote the intervals 1-2 and 2-3 by
and
. Define the joint
recombination probabilities
as follows:
= pr( rec. across
& rec. across
)
= pr( rec. across
& no rec. across
)
and similarly for
and
. These probabilities can be put into a
table with rows labelled by rec. or not across
(1 or 0) and
colummns labelled by rec. or not across
(1 or 0),
but we repeat an earlier assertion in a slightly different form: if we
labelled the rows and columns by rec. across different intervals (e.g. 1-3 and
3-2), the entries
would simply be permuted. A reformulation of this
assertion is as follows: the joint distribution of recombinations across the
two intervals can be described by such a set of
relative to
any order,
not necessarily the true one. However, for what we want next, the labelling
must refer to the true order, but the expressions involved to not refer to
observable quantities, so this is no impediment. We define probabilities q
by:
= pr(# exchanges in
& # exchanges in
)
= pr(# exchanges in
& # exchanges in
)
and similarly for
and
. Clearly these qs refer to the 4-strand
bundle, and so to unobservable events. Nevertheless, they play an important
role in the theory under NCI.
An exercise for last week was to generalize Mather's formula to 2 intervals, and here is where we need it:
Mather formula for 2 intervals: Under NCI,
If we assume that all the
are non-zero, then we see that under NCI,
is always the smallest, and
the largest of a set of such
probabilities, when labelled in this way corresponding to the true order.
In general, however they are labelled, the smallest p corresponds to the
double recombinant category, and the largest to the non-recombinant category,
and this is sufficient to single out just one locus order (up to a reversal).
Exercise 5: Prove this last assertion, that is, a set of p's satisfying the inequalities that follow from the Mather relations, defines a unique (up to inversion order) for the loci.
If
are not necessarily given relative to the true order, then there
are three possible sets of inequalities they might satisfy (under NCI),
and we call them
(corresponding to 2-1-3),
(1-2-3) and
(1-3-2).
Exercise 6: Write out the three sets of inequalities, with the ps written w.r.t order 1-2-3.
Estimating the order is in essence deciding which of these sets of inequalities is most compatible with the data. The easiest case is when there are no ties, for then there is a unique such order, and this must be the ML order. The proof of our main result is left to you, now all the preliminaries are available.
Exercise 7: Suppose that our data are denoted by
as above,
relative
to labelling 1-2 of rows and 2-3 of columns. Assume that there is a unique
smallest and a unique largest count. Prove that under a multinomial model
for the data, there is an ML order and it is the one with the smallest
count as the count of double recombinants.
The final item of interest in this story is the equivalence of the ML order under NCI with that under the independence model. The proof (first given by Mary Sara McPeek in her Berkeley thesis) rests on a neat inequality. As before we suppose that our counts and the ps are written relative to the true order, leaving the modification to other orders as an exercise. Under the independence model, the ps have the form
,
,
If the counts are a, b, c and d with a < b, c < d, then the likelihood
under a multinomial model and
is
and its maximum value over the parameter space
is
where n = a+b+c+d.
Similarly we can calculate
and
. (Note that we have
assumed that
here, without assuming NCI. Can you
prove these bounds under the independence model?)
Exercise 8: Using the strict monotonicity of
on
, prove that the ML order here is 1-2-3.