next up previous
Next: Theory Up: Stat 260: Statistics Previous: Stat 260: Statistics

Introduction

An important source for genomic and genetic study of an organism is a complete ` `map'' of the genome. There are two main categories of such a map: genetic maps, and physical maps. A genetic map gives the order and distances (in terms of the recombination fractions or map distances) between functional units ( genes) or genetic markers of the genome, while a physical map shows the relative positions of overlapping cloned DNA fragments in the genome and/or sets of ordered markers, and the physical distances (base pairs) between them. Such a map also provides a basis for further genomic sequencing, which is the ultimate goal for most genome projects. We have learned a lot about genetic mapping in the past weeks. Now we will switch gears to the context of physical mapping.

A physical map is essential in a genome project due to the huge gap of the mega-base pair genome size to the manageable hundreds of base pair sequences. For example, the human genome is roughly 3 billion bp, the fruitfly Drosophila melanogaster is 120 Mb, E. coli. has 4.7 Mb of sequence. Typically several stages of physical mapping for the genome are required before shotgun sequencing can be done. Suppose now we want to sequence human chromosome 5. To start, take copies of chromosome 5, shear them by sonication or restriction digests, and then insert the DNA fragments into a vector. A vector carries the inserted DNA fragment, and self-replicates along with the inserted DNA in an appropriate host. This provides a way to produce the amount of DNA required for biochemistry reactions. Different vectors have different capacities for the size of the insert. For example, bacteriophage P1 is capable of carrying inserts 80 kb, YACs (Yeast Artificial Chromosomes) carry hundreds of kb up to 1 Mb, the insert size in a plasmid is 3 kb. Suppose now we have a P1 clone library for chromosome 5, in which each clone is a replicate of a roughly 80 kb long random segment of chromosome 5. Unfortunately the positional information of the DNA fragments is lost in the process of cloning, so strategies must be employed to recover the order of the clones along the chromosome. A common strategy is to infer pairwise clone overlaps by shared markers (e.g. STSs - sequence-tagged-sites) or shared restriction fingerprints, and then reconstruct the order of the clones. A tiling path (a subset of overlapping clones) is then chosen, and the chromosome 5 sequencing project is reduced to many smaller sequencing projects of those P1 clones in the selected tiling path. To sequence a P1 clone, a similar scheme is applied again one or more times. We can either sequence lots of small random fragments, and then recover the the nucleotide sequence of the target P1 clone by pairwise sequence consensus, or create a subclone map for the P1, and further sequence a subclone tiling path.

 
Figure 1: physical mapping scheme.

In order to plan a cost-efficient physical mapping/sequencing project, it is very important to understand the progress of a project in relation to the key parameters such as the number of clones, genome size, and clone lengths. For example, people are interested in the following questions:

An island is a set of overlapping clones, and a contig is an island with two or more clones. Due to the nature of the clone library creation, we can answer the above questions by modelling the target genome as a long line, and the clones as random intervals along the line. This type of theory invented by Lander and Waterman (1988, [8]) essentially is what is known in probability/statistics literature as theory of coverage processes. (cf: P. Hall's book [7])



next up previous
Next: Theory Up: Stat 260: Statistics Previous: Stat 260: Statistics



Simon Cawley
Thu Apr 30 03:30:28 PDT 1998