The idea behind interval mapping is simple: one can gain power in testing
the null hypothesis
against the alternative that there is a QTL at
a specified locus, by incorporating marker information from either side of
the locus. Specifically, if we are testing at s, we make use of the two
markers flanking s, the ones at loci L (for left) and R (for right) say.
The original interval mapping method assumed normality, and since it is a
nice exercise in the use of the EM with mixtures, I'll run through it. These
days a rank variant is more popular, particularly for QT which are
obviously not normally distributed in either of the pure lines or the
.
Suppose that our DATA are (
), and that our
statistical model is:
Given the entire genotype g, y
.
Write
. Our null hypothesis is
= 0, and our alternative
. The genotype
at
s is not part of our DATA, so we enlarge DATA to the augmented data
ADATA defined to be (
). Although it would not be hard to carry out this analysis
using a realistic model for recombination, we adopt the standard
Assumption: Recombinations across disjoint intervals (same
meiosis) are mutually independent.
A little thought reveals that our model for DATA is a set of four 2-component
normal mixtures, one for each of the 4 combinations of genotype at L and R.
However, the model of ADATA is straightforward, and the corresponding
likelihood
has
up to a
constant in
given by
Now the EM algorithm makes use of the Baum et al lemma given last week,
where here
Q(
) =
| DATA].
In order to carry out the E-step and so compute Q, we need the following
table, where
,
and r are the recombination fractions
between L and s, s and R, and L and R respectively.
Conditional distribution of
given
and

Having computed Q, we next need to maximize it in
, which is easy,
because Q is essentially the log likelihood associated with a normal linear
regression with a single regressor. Many iterations later, we have
under the alternative, and again under the null, and so can compute
LOD at
where
(resp.
) is the likelihood under the null (resp. alternative),
and the
s are the corresponding MLEs.
Exercise 4: Complete the details of the E part of the above EM.
What happens next is that the LOD score is plotted as a function of s. A
significant peak (see next section) is then declared to be a putative QTL. That
is almost what we see in the Nature Genetics paper, which used MAPMAKER/QTL.
There is a difference between those plots and our current
discussion. They were considering an
intercross, and so there
were 3 possible genotypes at each locus, arising in proportions 1:2:1, and
leaving 2 d.f. for differences in means. However, we can easily
define contrasts essentially equivalent to our 1 d.f. story.
A recent enhancement to MAPMAKER/QTL permits a more robust analysis,
suitable for QTs which are not normally distributed, given genotype.
Here is a quick description of what this is, following Kruglyak and Lander
(1995). Define the test statistic
at s by
where
is the rank of
within (
, ...,
), and
is -1 or +1 according as
is A or H. When all
data are available at the flanking markers L and R, the
conditional expectation of
can be calculated using the table
above. In general one can go to the first pair of flanking markers on
which full data are available.
The statistics
can be normalized by the square root of its mean square
under the null hypothesis, as it has null mean zero. Kruglyak and Lander call
the resulting statistic
, and either plot that or a LOD equivalent.
Exercise 5: Check that
, and that
.
Haley and Knott (1992) have given a simple approximation to interval mapping,
which requires only one regression (rather than iteration) at each
location s. Their idea is to regress
on pr(
), and to compute the LOD at s from the RSS of this regression
in the natural way. This appears to give a pretty good approximation
to the LOD curve.
Finally, we must turn to the ever important question of