next up previous
Next: Interval mapping for Up: Stat 260: Statistics Previous: The Multiple Factor

Mapping QTL: preliminaries

We will consider problems such as that addressed in the Nature Genetics handout: mapping QTL based upon data from planned crosses involving inbred lines exhibiting extreme forms of the phenotype. In the malaria example, there were three inbred strains of mice, one of which was relatively resistant to malaria, and so exhibited low peak parasitemia values, and two other strains which were both highly susceptible to malaria, and so exhibited high peak parasitemia values. In that case an intercross was made, but for simplicity we will discuss backcross (BC) data.

As in our discussion last week, we suppose that we begin with two inbred lines denoted by aa and bb, we form the of type ab, and then we backcross these to aa. So each individual has one of every homologous pair of chromosomes of type a, and the other a mosaic of a and b chromosomes. Denote the complete genotype of an individual by g, with g(s) being the genotype at locus s : aa or ab, denoted by A or H, respectively. The QT value will be denoted by y, and the locations at which markers are scored by , ..., . We will suppose that we have relatively dense marker information, such as an average separation between markers of about 10 cM. (This is roughly what people have these days, and simulations suggest that for the sample sizes typically used, say about 200-300 BC or animals, there is not much to be gained by having more dense markers.) Finally, suppose that our QT is determined by the genotype at QTL , ..., .

In summary, our data take the form DATA = ().

A simple statistical model for the QT might have the following form:


This permits more or less arbitrary functions taking different values, and at present is far more general than people consider. Linear models are the usual starting points of statistical analyses, although interactions often enter the picture at some point. In other words, the above general model is usually restricted to


where is 1 if = H and = 0 otherwise. This degree of simplification might seem too great, but it should be apparent intuitively that if there is only one QTL per chromosome, looking for them one at a time may be a good way to start. Of course if there are two or more QTL close to one another, it may be hard to separate them, and they may even act in such a way as to make it difficult to find any one of them. But it is natural to begin with a simple story. So we begin with p=1 in the above, and write = q and .

Let us compare the phenotype values of individuals according to their genotypes at a marker located at s, and denote the recombination fraction between this locus and the QTL by r. Initially considering just means and variances, let us look at the difference between the means . Even if an individual is H at s, they may be A at the locus q. This will occur with probability r, and so

Similarly, individuals A at s will be H at q with probability r, and so

Now the sum and difference of these two estimable quantities is 2, and respectively, so at least these are identifiable. Apparently neither r nor is not estimable (at least linearly), and if , i.e. if the locus s is unlinked with the QTL q, then no mean difference will be visible between these two groups. In general, the distribution of y for individuals who are H at s will be a mixture of two distributions, say , where is the distribution of y for individuals with i copies of at q, i= 0 or 1. Similarly, the distribution of y for individuals who are A at s will be the mixture .

Exercise 3: Calculate the variances of y given that g(s) = A or H. Are r or estimable from a knowledge of the conditional means and conditional variances, given marker data at s?

To carry out a more formal analysis, we could assume a form for the and proceed to a likelihood analysis, to test the null hypothesis . Alternatively, we could use least squares, i.e. carry out a regression of y on the dummy variable , again seeking to test this hypothesis. Either way, it is customary in the genetic literature to summarize the results of such analyses by what are known as LOD scores. These are of likelihood ratios, one computed under the null hypothesis , and the other under the alternative hypothesis that the QTL is exactly at the location s of the marker. Typically, there will also be nuisance parameters in such likelihoods, as well as the parameter of interest r, and the likelihood is usually maximized over these, although there are now many Bayesian analyses where full posterior distributions are calculated. In the case of regression, normality would be usually be assumed, perhaps after a transformation of the QT y. Alternatively, some use rank tests, e.g. Wilcoxon's, to achieve a measure of robustness against non-normality.

Rather than pursue mixture modelling with information from a single marker, I will go on to what is known as interval mapping, introduced in 1989 by Lander and Botstein. This was one of the analyses carried out on the mouse peak parasitemia data, and so you should know about it.



next up previous
Next: Interval mapping for Up: Stat 260: Statistics Previous: The Multiple Factor



Simon Cawley
Mon Apr 20 19:59:26 PDT 1998