Another important step in the physical mapping process is the assembly of the map. In this section, we will discuss methods of reconstructing clone and marker orders in the context of STS-content mapping.
In STS-content mapping, we have N clones and M STSs, and also the
incidences between the STSs and clones. Clones hit by a common STS
overlap due to the uniqueness of the STS in the genome. The goal is to
retrieve the order of the clones and the STSs along the
genome, and estimate the distances between them if plausible. Ideally
either ordering result can be
inferred by the other, provided we have enough information. We will not
worry about the condition for the implication of clone order from the
STS order in this discussion.
We can reformulate the problem as the following: given a
incidence matrix I =
where
if STS j hits clone
i;
otherwise, find a permutation of the rows and the
columns of I such that the rows and the columns both have
the consecutive ones property. A 0-1 matrix is said to have
the consecutive ones property in the columns if all the 1s in each
column are grouped. The incidence matrix in the correct order
has the consecutive ones property since STSs hit by a common clone
should be grouped, and the clones hit by a STS are overlapping
neighboring clones. Apart from the impractical exhaustive search, this
problem actually can be solved in linear time (
)by a nice PQ-tree
algorithm (Booth and Leuker, 1976 [3]), assuming no errors
in the incidences. See Figure 6 for illustration of an
example.
Figure 6: An example of PQ-tree algorithm: A P-node is denoted by a circle, whose children may be rearranged in any order; a Q-node is represented by a rectangle, whose children are in order, up to reversal. The process illustrates the use of the PQ-tree data structure to permute the columns (the STSs) of the above clone-STS incidence matrix such that the resulting matrix has the consecutive-ones property. To start, put every STS (column) as a child node of a root P-node. At each step i, group the STSs that hit the same clone i, i.e., those nodes with a 1 in row i should be neighbors. If three or more nodes can only be arranged in one way (up to reversal) accordingly, put them as the children of a Q-node. Iterate the above step until all rows (constraints) have been checked. The final PQ-tree actually gives all compatible orderings in a very compact way. In this example, only two orderings are possible (which is the correct order of the STSs): FAEBCDG, and its reversal GDCBEAF.
With the presence of errors, one can either screen out possible errors prior to assembly and employ this PQ-tree algorithm, or use methods that are more robust to errors. There have been a lot of algorithms for this type of reordering problem in the literature, most of them use combinatorial optimization approaches that optimize an objective function w.r.t. the order, and use different techniques to search the permutation space. Here is a survey of them:
Use the well-developed Travelling-Salesman-Problem (TSP) algorithms to optimize an likelihood-based objective function. An equivalent form of the model will be discussed later.
Use the simulated annealing algorithm to minimize the path length of the probe order, where the distance between a pair of probes a, b,
, is given by the maximum-likelihood estimate
under the following model.
is the number of clones which hit both probes a, b, and
is the number of clones that hit either a or b. Assume that the clones are independent, then given
,
.
Use TSP algorithms to minimize the Hamming distance of the clone-probe incidence matrix, which corresponds to the number of gaps (0 in the block of 1's) of the probe ordering. Alternatively, minimize a weighted sum of the number of chimeric clones, the number of false positives, and the number of negatives.
Use of a number of different algorithms to optimize a gap-based cost function: the simulated annealing algorithm, greedy search, directed search, migration search, meta searches. And the cost function might be chosen from the number of gaps, sum of the sizes of gaps, sum of the square root of the sizes of gaps, etc.
They formulate the problem as a weighted betweenness problem, assuming the probes are from both ends of all clones. So the problem is cast as an integer linear programming problem with linear constraints derived from the ``betweenness'' of end probes, and an objective function as in Alizadeh [1] is optimized using the well-studied linear programming techniques.