An important source for genomic and genetic study of an organism is a complete ` `map'' of the genome. There are two main categories of such a map: genetic maps, and physical maps. A genetic map gives the order and distances (in terms of the recombination fractions or map distances) between functional units ( genes) or genetic markers of the genome, while a physical map shows the relative positions of overlapping cloned DNA fragments in the genome and/or sets of ordered markers, and the physical distances (base pairs) between them. Such a map also provides a basis for further genomic sequencing, which is the ultimate goal for most genome projects. We have learned a lot about genetic mapping in the past weeks. Now we will switch gears to the context of physical mapping.
A physical map is essential in a genome project due to
the huge gap of the mega-base pair genome size to the manageable
hundreds of base pair sequences. For example, the human genome is
roughly 3 billion bp, the fruitfly Drosophila melanogaster
is 120 Mb, E. coli. has 4.7 Mb of sequence. Typically several stages
of physical mapping for the genome are required before shotgun
sequencing can be done. Suppose now we want to sequence human
chromosome 5. To start, take copies of chromosome 5,
shear them by sonication or restriction digests, and then insert the
DNA fragments into a vector.
A vector carries the inserted DNA fragment, and
self-replicates along with the inserted DNA in an appropriate host.
This provides a way to produce the amount of DNA
required for biochemistry reactions. Different vectors have different capacities
for the size of the insert. For example, bacteriophage P1 is capable
of carrying inserts
80 kb, YACs (Yeast Artificial
Chromosomes) carry hundreds of kb up to 1 Mb, the insert size in a
plasmid is
3 kb. Suppose now we have a P1 clone library
for chromosome 5, in which each clone is a replicate of a roughly 80
kb long random segment of
chromosome 5. Unfortunately the positional information of the DNA
fragments is lost in the process of cloning, so strategies must be
employed to recover the order of the clones along the chromosome. A
common strategy is to infer pairwise clone overlaps by shared
markers (e.g. STSs - sequence-tagged-sites) or shared restriction
fingerprints, and then reconstruct the order of the clones. A tiling path
(a subset of overlapping clones) is then chosen, and the chromosome 5
sequencing project is reduced to many smaller sequencing projects of
those P1 clones in the selected tiling path. To sequence a P1 clone,
a similar scheme is applied again one or more times. We can either sequence
lots of small random fragments, and then recover the the nucleotide
sequence of the target P1 clone by pairwise sequence consensus, or
create a subclone map for the P1, and further sequence a subclone
tiling path.
Figure 1: physical mapping scheme.
In order to plan a cost-efficient physical mapping/sequencing project, it is very important to understand the progress of a project in relation to the key parameters such as the number of clones, genome size, and clone lengths. For example, people are interested in the following questions:
An island is a set of overlapping clones, and a contig is an island with two or more clones. Due to the nature of the clone library creation, we can answer the above questions by modelling the target genome as a long line, and the clones as random intervals along the line. This type of theory invented by Lander and Waterman (1988, [8]) essentially is what is known in probability/statistics literature as theory of coverage processes. (cf: P. Hall's book [7])