Conducting one of the analyses I have just described at many loci, say every
1 cM along a genome, is known as a genome(-wide) scan in search of
QTL. Clearly we are only looking for one QTL at a time, and ignoring
the possibility of closely linked QTL, of interactions, and other
complications, but this is a start. A question of great interest to
biologists is: what value of LOD or
should be regarded as high
enough to warrant rejecting the null hypothesis? These days there are
always two approaches to significance: one using some formula
based upon an exact or asymptotic distribution, and one using a
computer: simulation, randomization, or bootstrapping. In this
context, MAPMAKER/QTL makes use of a formula, but there is also a very
simple method of obtaining critical values computationally, first
outlined by Churchill and Doerge (1994). It is so simple I can say it
quickly: after carrying out a genome scan, using any approach at all,
simply randomly reassign genotype data to QT values using a random
permutation of
, and redo the analysis using the same
method with this relabelled data. Repeating this random reassignment
another 999 (or 9,999) times, one can then obtain a threshold for a
LOD or Z score, above which one will not go (under the null
hypothesis) more than any prescribed fraction of scans.
Exercise 6: What precisely does the procedure just described test? How would you adapt it to looking for multiple unlinked QTL ?
We close this section by outlining the derivation of Lander and Botstein's
asymptotic formula giving threshold for carrying out genome-wide scans using
a statistic which is marginally like a Z-score. Note that we only get a genuine
Z score at markers. In between markers the EM algorithm has been in action, and
this will mean that the test ``Z-statistic'' at a locus s in between markers is
really only asymptotically N(0,1), as the sample size gets large and as the
intermarker spacing gets uniformly small. It should also be apparent that Z
scores at nearby loci are highly correlated, because most marker data at two
nearby loci will be the same. It is thus not surprising that the correlation
function is essentially given by the relation between recombination fractions
and map distance. Specifically, under our independence of recombinations
assumption, the process of Z scores is, asymptotically as
and the inter-marker spacing
uniformly, and under the null, an Ornstein-Uhlenbeck (OU) process,
with mean zero and covariance function
, where s
is measured in Morgans, and
is a constant, here 2. (A different
model for the relation between map distance and recombination fraction
would lead to a different Gaussian process, but it would have the same
correlation function near s=0, and that is what matters here.)
Our problem of determining the threshold for genome-wide scans has thus been reduced to calculating a boundary-crossing probability for an OU process, at least in the case of dense markers and large n. (Note that the permutation approach does not make any assumptions about sample size and marker density.)
The approximation Lander et al give is as follows: the chance that the
Z process exceeds a threshold T say, in a genome-wide scan of C
chromosomes having total map length G Morgans, is
times the corresponding marginal threshold probability, whether one or
two-sided.
This result can be pieced together from different approximations to
OU crossing probabilities, for which I refer you to books by
Leadbetter et al (1983, chap. 12), Siegmund (1985, chap. 4), or Aldous
(1989, chap. D). One first needs to establish the approximation for a
single chromosome of length
, say, obtaining a factor of
. Then these need to be combined across C chromosomes,
using a Bonferroni argument. For some forms of the approximate
crossing probabilities, you will need to recall that for large x,
is approx. small
.
Exercise 8: Complete the details in the derivation just outlined.