Next: A first pass
Up: Stat 260: Statistics
Previous: Gene Structure
Current approaches to gene finding in newly-sequenced genomic DNA can be roughly classified into
the following categories (or some combination of them) :
- pattern recognition: (aka statistics!) we will spend most time
discussing this approach
- protein homology: Here we translate the sequence in all 6 reading frames (3 forward,
3 reverse) and compare each translated sequence to the entries in a protein database.
``Significant hits'' (see the notes for week 15) signal that we are in a coding region.
As databases become more and more comprehensive this method becomes more effective.
- use of expressed sequence tags (ESTs): ESTs are short fragments of
cDNA (which is DNA derived by reverse transcription from mRNA floating
around in the cell).
Since this cDNA has been derived from mRNA (which has already been processed by
spliceosomes), dealing with exons/introns and transcripton start/stop
sites is no longer a problem - though one has to bear in mind that for
certain genes there may be multiple ways of splicing together the exons.
There are now very large databases of ESTs and raw sequence can be compared directly
with these, without having to consider translation.
- intuition / ``domain knowledge'': arguably the best approach,
as of today human experts can outperform machines in many cases.
From here on we focus on the pattern recognition approach to gene finding.
Gene structure varies from the reasonably simple (for example,
-globin, which
has 3 exons and spans about 1500 bp) to the complex (for example, human factor VIII, which
is composed of 26 exons spanning 186,000 bp). There is also great diversity in
approaches to finding genes, see [3] for a summary of about 40
different pattern-recognition techniques which have been developed. The current state
of the art is a program called GENSCAN by Burge and Karlin, 1997 [1].
Next: A first pass
Up: Stat 260: Statistics
Previous: Gene Structure
Simon Cawley
Fri May 1 15:50:13 PDT 1998