next up previous
Next: About this document Up: Stat 260: Statistics Previous: ML estimation of

Multilocus mapping in experimental crosses

This section is in essence an introduction to the MAPMAKER programs. We begin with two motivational subsections. But first a blanket assumption:

In all that follows, we assume recombinations independent across disjoint intervals.

At some point I may comment on ways of dropping this assumption, but the short answer is that we pay a big price for doing so, and its not worth it. I also need to explain the ad hoc fix that MAPMAKER and others carry out at the end of an analysis under this assumption: the use of Kosambi's map function to get more realistic map distances.

5.1 Evaluation of probabilities of multilocus genotypes.

When we have lots of loci, a notational switch becomes necessary. At the risk of confusing you, I'm now going to label loci 1, 2, 3, ..., L, with alleles , , , , ..., at locus l. So here a, b, ... do not refer to loci A, B, ... Alleles that seem distinct notationally are to be regarded as distinguishable, and until I say so, no complete dominance will be present. So , etc are all codominant at locus l. Also, let the loci be in order 1-2-3-....., and denote the successive recombination fractions by , , ..., for 1-2, 2-3, ... Finally, for simplicity we'll suppose that male and female recombination fractions are equal. You will see how a little extra notation will permit dropping this assumption.

To get an idea of the nature of our task, recall that with L loci there are gametic types (or haplotypes) possible, and so zygotic combinations, collapsing into genotypes. For L=10 or L=20, we have a lot to consider. Of course in a cross with at most a few hundred offspring, most of these combinations will not occur, but we do not know in advance which will.

We are first going to consider progeny from the phase-known cross

Question: What is the frequency with which individuals having genotype

.....

(say) arise? This is easy to answer because we can see everything we need from the genotype: we need a recombination across the 2-3 interval in the gamete that supplies ..., and recombinations across all three intervals to get ..., plus some ... to cover unspecified recombinations in the ... part. The answer to our question is thus

It is an easy consequence of the independence assumption concerning recombinations that here genotypes form a Markov chain along the loci.

Exercise 9: Check this last statement.

The transition matrix of the chain is just the following, applying for all (omitted) subscripts . Row and column labels are for genotypes at consecutive loci.

Note that in an obvious abbreviated notation, our question (*) can be written: what is the probability of going in the above chain. One readily checks that it is .

Thus we have an easy way to calculate probabilities of multilocus genotypes in crosses of the kind we describe as . What about the more conventional (Mendelian) phase known intercross, describable in this notation as , i.e. the foregoing with a=c and b=d? Interestingly, we can still calculate multilocus genotype probabilities in the same way, for they are still Markovian along the loci.

Exercise 10: Prove this, at least in some cases, from the array of 3 locus genotypes.

The easiest way I know to see this fact is to use the theory which tells us when a function of a Markov chain is again Markov. In our case, we want to put a=c and b=d, reducing the state space to aa=A, ab=H and bb=B, to use MAPMAKER notation. Because the probabilities in T into and out of ad agree with those into and out of bc, we can amalgamate these states (and relabel) to get a Markov chain over {A, H, B} with transition matrix:

For a theorem that justifies this step, see....[to be inserted] What if we have dominance...will this collapsing Markov chain trick still work? Sadly, the answer is no: amalgamating the states aa and ab into one state D (MAPMAKER notation) and keeping B does not result in a Markov chain, so that in general another method will have to be used to calculate multilocus probabilities in a cross , where is dominant in relation to for some l.

Exercise 11: Check that the Markovian condition fails in the dominant case for certain triples of genotypes.

How can we calculate multilocus probabilities in general phase known crosses, when there may be dominance (or missing data) at some loci? There is a method known as the forward-backward method, which may have arisen earlier, but for which the earliest reference I know is Baum et al (Ann. Math. Stat. 41, 1970: 164---171). where it appears together with a very general form of what now gets called the expectation-maximization algorithm.

In order to show the quick and easy way to compute these multilocus probabilities, let us start with a Markov chain , , ..... with state sets , , ....having generic elements , , ... Given the preceding discussion we could use either the 4-state chain having a relatively simple transition matrix, or the 3-state chain with the slightly more complex transition matrix. In the first case, we would have , while the 3-state chain has .

Regardless of our choice of underlying Markovian structure, we need to define some function of the chain to be our observation. In effect, the sets on which is constant define an amalgamation of our chain. For example, for a locus with the allele dominant, sends and to the same value D, and to a different value B. (Clearly a similar could be defined on the 4-element state space of we began there.) Another example would be where the genotype at a locus was missing: here the sends everything to the same destination -. Denote by , , .... the observations made at locus 1, 2, ..., i.e. the values , , .

With these preliminaries, we can get down to business. Our task is to evaluate efficiently the chance of obtaining an offspring with genotypic data

........(**)

from a phase known multilocus cross of the form , where we may have a=c and b=d (phase known intercross), or even b=c=d (phase known backcross). Further, specified alleles may be dominant at certain loci, and there may be missing or amalgamated data, all of which is expressed by the functions .

The slow way to calculate the chance of (**) would be to evaluate

where we have written . The efficient way is as follows:

,

,

where ;

.

Clearly this is just the forward part of the forward-backward algorithm; we'll meet the backward part shortly.

This completes the first part of our motivation.

5.2 Counting methods in genetics.

In the mid-1950s, a number of geneticists including Ceppelini, Sinascalco and CAB Smith made extensive use of what they termed counting methods....precursors to the now familiar expectation-maximization algorithm in the context of amalgamated multinomials. Let us revisit the Fisher-Balmukand analysis of the phase known intercross with alleles at both loci exhibiting complete dominance. In that analysis we saw 16 potentially distinguishable categories amalgamated to 4, with the result that the two parameters and were no longer identifiable, but only the function .

We can do nothing about this, but we can look at the ML estimation another way...by gene counting. Denote the observed counts of the phenotypic categories AB, Ab, aB and ab by , etc. If we knew the recombination fractions and , we could estimate the counts in the 16 cells by suitably scaling the observed counts. For example, the count can be estimated from by the relation

where this expectation is calculated w.r.t. the multinomial distribution with 16 cells and known probabilities. This can be done for the other 14 cells (we already know ), e.g.

Exercise 12: Check that these expressions are indeed conditional expectations in the associated multinomial.

This is the first half of the expectation-maximization....the second half is to use the preceding formulae computed with initial estimates of and to get improved estimates of these quantitiies. For example, 8 of the cells correspond to recombination in the maternal meiosis, and so we improve by writing

,

where the s are computed by the above formular using and .

Similarly for . This process can be iterated and in fact it converges to the MLE which we could calculate directly.

Exercise 13: Carry out this iteration. Explain how the fact that only is identifiable arises in the context.

one final comment here. Although these expectation-maximazation algorithms arose in the context of genetics (and elsewhere), Baum and colleagues invented them independently in their HMM context, see below.

5.3 The Lander-Green algorithm for CEPH pedigrees.

In this subsection, we'll combine ideas in the last two into the Lander-Green algorithm for simultaneously estimating many recombination fractions and ordering the associated loci. It was published in 1987, coming just as the first large-scale genetic maps for humans and many newer experimental organisms were being developed using what are known as restriction fragment length polymorphisms (RFLPs). Many of these maps - certainly the human one - would not have been possible without the algorithm, so it has played a pretty important role in our subject. It still does in the newer context of mapping disease genes, as we'll see next time.

First we formalize the notion of Hidden Markov Chain (or Model), abbreviated HMC (HMM). This seems to have been formulated first by Baum and colleagues, although it is simply a more general form of the state-space models of the early 1960s that underly the famous Kalman filter. We postulate a bivariate Markov chain , with the special property that

(bivariate Markov property)

,

where this last factorization tells us that X is a Markov chain on its own, and that Y is a sequence of random variables conditionally independent given the chain X. In practice, Y is observed and X is not, and so the probabilistic structure of Y needs to be analysed by involving X (the ``hidden" chain). When the hidden chain has a modest number of states, or is Gaussian, there are efficient algorithms for carrying out the following computations:

# 1:

# 2: , or

# 3: .

Here denotes any parameters in the HMM, a notation I have taken from an excellent review paper on HMMs by Rabiner (Proc IEEE vol 77, 1989: 257--285). I'm not going to spell out the forward-backard algorithms here, as they are readily accessible to anyone who is interested. What I will do is explain how Lander and Green used this formalism to great effect in the multilocus mapping problem.

Before doing this a word on CEPH pedigrees. CEPH is an abbreviation for the French organization Centre D'Etude du Polymorphisme Humain (Centre for the Study of Human Polymorphism). At about the time when DNA from families was needed to generate genetic data for mapping RFLP markers, CEPH turned out to have the best resources: several dozen 3-generation families which all had 4 grandparents, 2 parents and a large number (10 or more) of children. So that's what a CEPH family looks like: lots of kids, their parents and their parents' parents. Now it will not escape your attention that a large experimental cross has a similar structure: initial parental lines, say aa and bb, (and we'll add cc and dd, although usually a=c and b=d), first generation hybrids ab (and cd), and F-2 offspring the result of the cross . The main differences are as follows. Firstly, we would not normally know phase (at least not all phases) in a CEPH pedigree, and secondly, 10 children is rather fewer than 100--200 progeny. A not so obvious consequence of phase being known in the experimental cross is that the offspring genotypes and phenotypes are independent, which is not generally the case in the CEPH children. We'll see how these differences play out shortly, but for the moment keep CEPH families in mind.

As in our first motivational segment, the underlying (hidden) chain steps through one of four possible states at each locus, but this time we have a component of the chain for every one of the offspring. This compound chain is simply the product of all the component chains, but in general the components are not (unconditionally) independent. Following Lander and Green denote the states by v (for inheritance vector): v has two components for each of the offspring: one maternal and one paternal, and each component has two possible states, conventionally denoted by 0 and 1. We define a component to be in state 0 if the individual's allele at that position is grandmaternal; otherwise it is in state 1. [It should be clear that we have one component for every segregation leading to one of our offspring. Elizabeth Thompson calls v the vector of segregation indicators, but we'll stick with the Lander Green terminology, as we are discussing their algorithm.]

Exercise 14: Check that this is consistent our discussion in 5.1. There the state ac corresponds (say) to our 00, etc.

So the state vector of our HMM is a binary 2n-tuple, if there are n offspring (progeny). Clearly gets large quickly, and for n=200 does not bear thinking about. Fortunately we do not need this for experimental crosses, because the progeny are independent, see shortly.

The transition matrix of our chain is simply the Kronecker (or tensor) product of matrices of the form T above. Indeed we now see that T itself is a product of two matrices: one maternal and one paternal, each having the form

Another way of describing the transition matrix of our product chain is as follows: the probability of moving from one binary 2n-tuple v to another w as we go from locus l to locus l+1 is

where is the Hamming distance between v and w, i.e. the number of discordances between them. Note that this expression easily factorises into a maternal and a paternal component if we want to distinguish maternal and paternal recombination fractions.

Exercise 15: Describe the transition matrix with sex-specific recombination fractions.

Having defined the Markov chain and its transition matrix, (the X in the HMM), it remains to define the Y --- the observations. As in our motivation, these are the genotypes, including missing or amalgamated data, and are given by a locus-specific function , say. The probability distribution of the observations on the offspring given the inheritance vector v is simple: it is 1 if the observation is compatible with v, and 0 otherwise. (If there are missing parents or grandparents, this story changes slightly. The inheritance vector gets enlarged, and population frequencies play a role. The same thing happens when, as in GENEHUNTER, a small pedigree version of MAPMAKER, we have other family structures.)

With this all set up, the HMM algorithms permit efficient computation of the following, where :

a)

b)

c) .

In our context, all three problems are of interest. The first is calculating the likelihood of , clearly necessary for the third. Our is the set of all recombination fractions. If these were known, but phase was not, then the solution to b) becomes of interest: determining the most probable haplotype.

The intermediate quantities needed to compute the above are (in Baum et al's notation)

As mentioned above, each of these can be efficiently calculated by simple recursions, but it is importantr to remember that these recursions involve summing over all inheritance vectors v....clearly not a feasible task when n=200.

With this background, I turn to the expectation-maximization part of the proceedings. All rests on Baum et al's simple but important 1970 lemma involving

where , and similarly for y.

Lemma 1. If , then .

Proof: Write as

and notice that the first term in the product sums to unity. We use Jensen's inequality applied to the log function, to deduce that

.

We use this in the following way: is increased by finding such that . Indeed we can explicitly maximize in ....this is true beauty of the algorithm. I return now to the situation where maternal and paternal recombination fractions are distinguished.

Lemma 2. In our problem, the * which maximizes in has maternal component at locus l given by

where is the maternal Hamming distance (the number of discordances between the 2n-tuples in maternal components). Obviously there is a similar expression for the paternal , or a pooled one of we do not distinguish the recombination fractions.

This quantity is readily calculated by the formula .

Exercise 16: Write out a proof of this lemma. (It is very straightforward as Q has the character of a binomial likelihood.)

5.4 The Lander-Green algorithm for experimental crosses.

For a long time I had wondered how they got over the problem of having inheritance vectors with 100 progeny, and last week I asked Mark Daly. He told me that the calculation for experimental crosses was done on each individual separately, and their likelihoods combined. Section 5.1 is what I made of this. In effect, MAPMAKER/EXP evaluates the likelihood for an experimental cross by running the Lander-Green algorithm for CEPH pedigrees n times, each with a sibship of 1, and adds the logs of the probabilities. This can efficiently cope with dominant alleles at loci, missing data etc.

How about estimating the recombination fractions? I don't know at this point exactly how it works, but it is almost certainly in one of two ways. One could calculate the expressions in Lemma 2 separately in each little pedigree with one child, and average them. Alternatively, one could go to a more conventional EM algorithm as described in section 5.2. Either way, one can update and ultimately get MLEs of every recombination fraction, with or without sex differences. Under our independence of recombination assumption, pairwise estimates of recombination fractions are in fact joint MLEs. Thus we can calculate the maximum likelihood of any locus order in these two steps.

Of course there are many important technical details I did not have time to discuss. How did we get an allocation of loci to chromosomes? (Short answer: by computing and comparing all pairwise recombination fractions.) How do we get a preliminary order for loci within a chromosome? (Short answer: from the pairwise recombination fractions, possibly supplemented by all three-point analyses.) How do we move from a prelinminary order to a maximum likelihood order? (Short answer: its hard, and there are no guarantees. We can mimic travelling salesman algorithms). What about errors in the data? (Indeed! there are useful supplements to the LG algorithm for these.) And so on...



next up previous
Next: About this document Up: Stat 260: Statistics Previous: ML estimation of



Simon Cawley
Mon Apr 20 19:52:17 PDT 1998