We will consider problems such as that addressed in the Nature
Genetics handout: mapping QTL based upon data from planned crosses
involving inbred lines exhibiting extreme forms of the phenotype. In
the malaria example, there were three inbred strains of mice, one of
which was relatively resistant to malaria, and so exhibited low peak
parasitemia values, and two other strains which were both highly
susceptible to malaria, and so exhibited high peak parasitemia
values. In that case an
intercross was made, but for simplicity
we will discuss backcross (BC) data.
As in our discussion last week, we suppose that we begin with two inbred lines
denoted by aa and bb, we form the
of type ab, and then we
backcross these to aa. So each individual has one of every homologous
pair of chromosomes of type a, and the other a mosaic of a and b
chromosomes. Denote the complete genotype of an individual by g, with
g(s) being the genotype at locus s : aa or ab, denoted by A or H,
respectively. The QT value will be denoted by y, and the locations at
which markers are scored by
, ...,
. We will suppose that we
have relatively dense marker information, such as an average separation
between markers of about 10 cM. (This is roughly what people have
these days, and simulations suggest that for the sample sizes
typically used, say about 200-300 BC or
animals, there is not
much to be gained by having more dense markers.) Finally, suppose that
our QT is determined by the genotype at QTL
, ...,
.
In summary, our data take the form DATA = (
).
A simple statistical model for the QT might have the following form:
This permits more or less arbitrary functions taking
different values,
and at present is far more general than people consider. Linear models are
the usual starting points of statistical analyses, although interactions
often enter the picture at some point. In other words, the above general
model is usually restricted to
where
is 1 if
= H and = 0 otherwise. This degree of
simplification might seem too great, but it should be apparent
intuitively that if there is only one QTL per chromosome, looking for
them one at a time may be a good way to start. Of course if there are
two or more QTL close to one another, it may be hard to separate them,
and they may even act in such a way as to make it difficult to find
any one of them. But it is natural to begin with a simple story. So we
begin with p=1 in the above, and write
= q and
.
Let us compare the phenotype values of individuals according to their
genotypes at a marker located at s, and denote the recombination fraction
between this locus and the QTL by r. Initially considering just means
and variances, let us look at the difference between the means
. Even if an individual is H at s, they may
be A at the locus q. This will occur with probability r, and so
Similarly, individuals A at s will be H at q with probability r, and so
Now the sum and difference of these two estimable quantities is
2
, and
respectively, so at least these
are identifiable. Apparently neither r nor
is not estimable (at
least linearly), and if
, i.e. if the locus s is unlinked with
the QTL q, then no mean difference will be visible between these two
groups. In general, the distribution of y for individuals who are H at
s will be a mixture of two distributions, say
, where
is the distribution of y for individuals with i copies of
at
q, i= 0 or 1. Similarly, the distribution of y for individuals who are A at s
will be the mixture
.
Exercise 3: Calculate the variances of y given that g(s) = A or
H. Are r or
estimable from a knowledge of the conditional means
and conditional variances, given marker data at s?
To carry out a more formal analysis, we could assume a form for the
and
proceed to a likelihood analysis, to test the null hypothesis
.
Alternatively, we could use least squares, i.e. carry out a regression of y
on the dummy variable
, again seeking to test this hypothesis. Either
way, it is customary in the genetic literature to summarize the results of
such analyses by what are known as LOD scores. These are
of likelihood ratios, one computed under the null hypothesis
,
and the other under the alternative hypothesis that the QTL is exactly
at the location s of the marker. Typically, there will also
be nuisance parameters in such likelihoods, as well as the parameter
of interest r, and the likelihood is usually maximized over these,
although there are now many Bayesian analyses where full posterior
distributions are calculated. In the case of regression, normality would be usually be
assumed, perhaps after a transformation of the QT y. Alternatively,
some use rank tests, e.g. Wilcoxon's, to achieve a measure of robustness against non-normality.
Rather than pursue mixture modelling with information from a single marker, I will go on to what is known as interval mapping, introduced in 1989 by Lander and Botstein. This was one of the analyses carried out on the mouse peak parasitemia data, and so you should know about it.