next up previous
Next: Simple IID model Up: Stat 260: Statistics Previous: Stat 260: Statistics

Base composition



next up previous
Next: Simple IID model Up: Stat 260: Statistics Previous: Stat 260: Statistics

Base composition

Base composition is one of most fundamental features of a DNA sequence. It is given by the percentages of 4 different nucleotides, all taken on one strand. We denote them by A, G, C, T.

It is an observational fact that G C and T A generally holds, so we usually speak of G+C (or A+T) content, see Fickett et al (1992).

Our progression is from base frequencies goes roughly as follows: look at them empirically, within and between genomes, use simple models (eg. i.i.d or just i.d.), and ultimately more complex, Markov or Hidden Markov models.

Why do we use simple statistics? Simple statistics are often full of biological meaning (eg. ``sites''), or at least have biological value, e.g. reconstructing or estimating phylogenetic (evolutionary history) using base composition. Simple models can be useful as background models relative to which other sequence features are located.

Before we go on, it's useful to keep the structure of (eukaryotic) genes in mind. You need to make yourself familiar with certain terms such as exon, intron, triples, stop codon etc. and what happens when transcription occurs. Refer to some books for these matters. What we are about to do is to identify features of this process.

By now, several (more than 10) (prokaryotic) bacterial genomes have been fully or partially sequenced, and there is much sequence from certain other organisms. Some base compositions were found to be as follows.

In studying single sequence statistics, we can consider

Specifics a specific hexamer may be entirely absent from a small genome. If the genome consists of 5Mbp, we might expect about occurrences of it, if bases are appearing randomly. So, the fact that we don't see any of them is a good indication that they are not appearing randomly.

Mean levels and trends Here scale matters. Isochores in mammalian genomes are regions of approximately constant A+T within genome, of length 100s of kb, where the A+T can be as low as 35 and as high as 70. (within-genome heterogeneity.)

Variation and covariation.

Let's review some recent research to get more motivation for this study.

The bases A, G are called purines and C, T pyrimidines. One can assign 1 to each occurrence of purine and -1 to each pyrimidine, and add them up starting from one point of a strand. The resulting sequence of sums, which is a kind of random walk, is called purine excess. The shape of its path is very interesting and shows regular pattern in many cases. One can as well group A, T and G, C to get what is called Keto excess. Freeman et al (1998) show how in some bacteria, the origins and termini of replication are extremes of these random walks.

Another is about the regular pattern of base composition found at splice site. How one can use statistics to detect splice sites? See Stephans & Schneider for `sequence logos' on this point.

There are papers showing long-range auto-correlation in the occurrence of some of the 16 dinucleotides AA, AT,...,CC. see the review by Li(1997).

To summarize these results, you can learn a lot even from a simple statistics.



next up previous
Next: Simple IID model Up: Stat 260: Statistics Previous: Stat 260: Statistics



Simon Cawley
Thu May 14 03:30:08 PDT 1998


next up previous
Next: Simple IID model Up: Stat 260: Statistics Previous: Stat 260: Statistics



Simon Cawley
Thu May 14 03:30:09 PDT 1998