Alternating Chip for SBH
In the paper [26], Pevzner and Lipshutz proposed three different non-classical chips for the SBH. Here, we will discuss one of them, the alternating chip. Such a microarray uses an unspecified nucleotide denoted as x along with normal nucleotides represented by the letters from ΣDNA = {A, C, G, T}. Chip capacity tells how much probes the chip has. The total capacity of the alternating chip is ||C
alt(k)|| = 2 × 4k. The chip is composed of all probes of two types. They are described by the following patterns:
$${{N}_{\text{1}}}x{{N}_{\text{2}}}x\ldots x{{N}_{k}}~\text{and}~{{N}_{\text{1}}}x{{N}_{\text{2}}}x\ldots ~x{{N}_{k-1}}{{N}_{k}}.$$
The number of x symbols is equal to k − 1 for the first type of probes and k − 2 for the second type. For both types, the number of known nucleotides denoted above as N is equal to k. These two types of probes form two sets of the hybridization spectrum. These sets are described, respectively, as A
1 and A
2. The length of oligonucleotides in A
1 is l
1 = 2 × k − 1, while in A
2, it is l
2 = 2 × k − 2. While in the classical approach, each probe consists of multiple copies of exactly the same oligonucleotide, in the proposed non-classical microarray, every probe is described by some pattern. This pattern, different for each probe, defines a set of natural oligonucleotides, i.e., the ones which can be described as strings over the alphabet ΣDNA.
For example, in a probe denoted as CxGxG, there are 16 different types of oligonucleotides: CAGAG, CAGCG, CAGGG, ..., CTGTG. The number of oligonucleotides types in each probe is equal to 4k−1 and 4k−2 depending on the type of probe, as explained in the previous paragraph.
After the hybridization experiment, the probes which attached the analyzed DNA are used to form two subsets A
1 and A
2.
Hybridization Errors
There are two types of errors that can occur during the hybridization phase: negative and positive ones. The first type is connected with a loss of information in spectrum. There are two sources of such errors. The first one is connected with the detection technology. There can be probes that did not hybridize when they should or the signal from them is so weak that it is not detected at all. There is also a second source of negative errors. The analyzed DNA can be built from two or more identical subsequences, i.e., there are repetitive fragments. Such fragments hybridize with the same probe. In our algorithm, we assume no knowledge about the number of times when different fragments hybridized with the same probe. There are articles dealing with such a problem, for example [16–18], but in our approach, we consider negative errors from repetitions as missing data.
The errors of the second type are called positive errors. They occur when for some reason probes hybridize (or are detected as such) when they should not, resulting in additional, false elements within the spectrum. When there are no hybridization errors at all, one obtains an ideal spectrum.
Depending on the type and source of errors, our proposed algorithm behaves differently. We will now discuss these different scenarios connected with the explained types of the hybridization errors.
The most simple case occurs when an ideal spectrum is obtained. The cardinality of set S
1 is equal to n − l
1 − 1, while the cardinality of set S
2 equals n − l
2 − 1. Such sets of a spectrum are subsets of A
1 and A
2 containing elements representing probes that hybridized to the target DNA. The length n of the target DNA sequence is known. It means that there are no positive or negative errors of any type within the spectrum. If that ideal case could be achieved, then the task would be to find a sequence that contains every element from both spectrum sets. Such strict conditions for the number of spectrum elements could allow quite fast and exact DNA reconstruction. The downsize of such a scenario lies in its small likelihood. In practice, hybridization errors usually occur both due to technical imperfection of the hybridization experiment and because of the repetitions of subsequences of the target DNA which cause the negative errors.
The next considered scenario assumes only negative errors within a spectrum. They can appear in both S
1 and S
2, which can be easily detected when the sets cardinality are less than theoretical value given in the previous paragraph. In this case, all the elements have to be used from both sets to reconstruct the DNA. This scenario is more difficult, because the algorithm must compensate the missing elements in spectrum with the ones that are present, which can lead to lowered efficiency in finding the exact DNA sequence. There is a higher probability for reconstructions which can be similar, but not exactly the same as the analyzed DNA fragment.
The third case when only positive errors are present in spectrum is not as difficult as the previous one. Having the size of the analyzed DNA and the length of oligonucleotides in the microarray, the algorithm computes the sizes of the ideal spectrum sets. Then, it only allows the solutions that contain exactly that number of elements from both sets.
The most complex case, when all types of errors are present, is unfortunately the most realistic one. The algorithm has no precise information how many elements from the spectrum have to be used. Both positive and negative errors are present and their number can only be estimated. In such a scenario, there can be many sequences given as a result of the computational phase, and without additional hybridization experiments, it is impossible to decide which one of them is the target sequence. This is a huge problem in the classical SBH approach. Algorithm we propose can handle such a realistic case, being able to produce reconstructed sequence precisely and fast.
The Algorithm
The algorithm is able to reconstruct the analyzed DNA sequence using a spectrum obtained from a non-classical alternating chip probes in the hybridization experiment. Input data for the algorithm are as follows:
-
1.
spectrum obtained using alternating chip, consisting of sets S
1 and S
2 (they are the subsets of A
1 and A
2 defined in alternating chip description);
-
2.
the length n of the DNA fragment;
-
3.
parameter k denoting the length of oligonucleotides. l
1 and l
2 can be computed as described in Sect. 2;
-
4.
the sequence of the first l
1 + 1 nucleotides in the analyzed DNA fragment;
-
5.
estimated, arbitrarily taken percentage values of negative and positive errors which must exceed the real ones.
On the basis of the values of parameters n and k, the algorithm computes the theoretical number of elements that would have been needed to reconstruct the target DNA in an ideal case with no errors. Using only the elements from set S
1, a graph G
alt is constructed, in which two separate paths will be searched in commutative order—the first one for odd nucleotides of the target DNA sequence and the second one for its even nucleotides. Elements of S
1 are built over the alphabet Σalt = {A, C, G, T, X}. Every element from S
1 is a vertex in the graph G
alt. Arcs are being created on the basis of the overlapping of letters only from the alphabet ΣDNA = {A, C, G, T}. Each S
1 element consists of l
σ
= (l
1 + 1)/2 letters from ΣDNA. Therefore, a possible overlapping is in a range from l − 1 to 1. Maximum overlapping of l
σ
− 1 letters has weight equal to 1, while the minimal possible overlapping has the maximum weight equal to l
σ
− 1. For example, vertices AxGxG and GxGxC overlap on GxG and G labels. This corresponds to two arcs going from AxGxG to GxGxC, having, respectively, weights 1 and 2.
From the input data, one has knowledge about the very beginning of the target sequence. Assuming that letter x in spectrum elements overlaps freely to any letter from ΣDNA, one can easily obtain two S
1-like elements corresponding to two starting vertices in graph G
alt. For example, if the starting element in the target DNA is ACGCGAAT, then two starting nodes (for, respectively, odd and even nucleotide paths) are: AxGxGxA and CxCxAxT. Later, we will call odd and even nucleotides paths as P
o and P
e, respectively.
The x symbol overlaps with itself and with every letter from ΣDNA. This feature will be used later by the algorithm. For this reason, in the phase of graph creation, overlapping only on letters from ΣDNA is the only logical choice. In addition, to allow the overlapping for such a vertex labels, the normal nucleotides must be evenly placed—this is true for the elements from S
1 subset, but not for the S
2. For the latter, each of its elements is one base shorter than those from S
1. In addition, its last two letters (from ΣDNA) are placed directly one after the other, without the x separator. The last nucleotide in each element from S
2 is on the even position, like all its x letters. This means that if one would like to create a graph using elements from this subset, the maximal possible overlapping would be one position shorter, compared to the situation, where elements from S
1 are used. This would result in a more densely connected graph, and as a result in an increased difficulty to find correct paths in it. For this reason, only elements from S
1 create the search graph, while elements from S
2 are being used for the verification of connections for the searched paths, as will be later explained in details. The pseudo-code for the main loop of the algorithm is in Fig. 1.
As a first step, the algorithm chooses one new vertex for the odd nucleotides path P
o. Then, it chooses a new one for P
e and the procedure continues until both paths have the desired length which allows target DNA reconstruction. In each step, elements from subset S
2 are being used to verify a validity of a new chosen vertex. The algorithm adds one or more nucleotides of given type (odd or even) to the reconstructed sequence depending on the overlap value of the new vertex used, i.e., a weight of a given arc taken. Weight 1 means adding a single new nucleotide, weight 2—adding two nucleotides, etc. The list of already visited vertices with values corresponding to their overlapping is denoted as solution. The vertices from both graphs are alternately put into this list. There are two important integer values steps and maxsteps. The latter represents a maximum number of vertices weights that must be reached to have odd and even paths ready to reconstruct the target sequence. The step value represents the current number of vertex weights accumulated. Every new vertex adds to steps a value equal to the weight on an arc connecting it with the previous chosen vertex in a given graph.
The verification of a new vertex is a complex process which depends on the types of errors in spectrum. The algorithm can work in four modes: no errors, only positive errors, only negative errors, or when both types of errors are present. The description of the verification process will begin with a case when there are no negative errors in S
2 subset. The example of the verification process is given in Fig. 2.
In a given example, algorithm tries to extend the odd path P
o. There are three possible new vertices connected by arcs with weight 1 with the vertex named AxTxGxGxTxA (Fig. 2a). They will add nucleotide C, G, or T (underlined black) depending on which one will be chosen. We assume that in the example, there are only vertices that have not been already visited in any of the two paths. In part b, there is a name of the last vertex already extending even path P
e. Using its postfix (underlined dark grey) and last letters from vertices that can potentially extend path P
o, the verification process build potential S
2-like elements (Fig. 2c). If any of them will be found in spectrum set S
2, the corresponding vertex from a) part of Fig. 2 will be marked as verified. In the example, only the vertex TxGxGxTxAxC is verified properly. It is not possible to choose any other vertex except the ones which have been verified. Doing so would result in an incorrect reconstruction of the target DNA sequence.
As it has been stated before, vertices can also be connected with arcs having weights greater than 1. It means that they overlap on a shorter label. Such vertices, if chosen, will add two or more new nucleotides (odd or even) to the reconstructed DNA fragment. In such a case, only the first nucleotide extending a given path will (and can) be verified. An example is given in Fig. 3.
In the example, the algorithm has two possible new vertices for the even path, i.e., vertices GxAxTxC and AxTxGxT. The first one will extend the solution only by one even nucleotide C, and the second one will extend it by two even nucleotides C and T. They are denoted as even#3_ovlap1 and even#3_ovlap2. Verification of the first one can be performed as described in the previous example. For the second one, only the first extending nucleotide (C) can be verified—the procedure is analogous as for the even#3_ovlap1. Nucleotide T cannot be verified at this moment, because the algorithm is unable to create S
2-like element for it. This is because vertex odd#3 extending odd path P
o is placed there with too long overlapping (maximal in fact) to help create a proper S
2 element for even#3_ovlap2. If the odd#3 vertex had a shorter overlap, and therefore, it extended the odd path by more than one nucleotide from ΣDNA, this could be possible. As one can see, set S
2 is crucial in ‘connecting’ odd and even nucleotides paths. The verification makes the search space much smaller by reducing the number of potential vertices that will extend each path. Different types of errors have different influence on the verification. Negative errors following from repetitions have no impact on the process. Even if an S
2-like fragment hybridize with different parts of the DNA, it will be present in this set at least once, and the algorithm will not count how many times this element verified elements from S
1. Positive errors increase the number of false verifications. The test results presented in this paper proved that their impact on the overall effectiveness of the algorithm is minimal if they are the only type of errors. The most serious situation takes place when there are negative errors in the spectrum resulting not only from repetitions but also specifically from losing data about some probes that in fact did hybridized. Elements from S
1 used to create search graph can compensate for this using shorter overlapping. Unfortunately, spectrum set S
2 used for verification is especially susceptible to such errors. As one can see in Fig. 3, to verify vertex even#3_ovlap1 element, TxAxGC created from postfix of odd#3 and the last letter from even#3_ovlap1 must be present in S
2. Its loss due to the negative errors makes such a verification impossible. Therefore, if the algorithm knows about the present of such a type of negative errors, the verification process is adjusted. Only vertices labels that overlap in P
o or P
e path on maximal possible length (i.e., l
1 − 2) are verified. Shorter overlapping is accepted without verification. In our example, vertex even#3_ovlap1 will not be verified due to the absence of verification element TxAxGC in S
2. However, the vertex even#3_ovlap2 will be accepted, containing both new letter for this path: C and T.
There is another mechanism that participates in minimizing the number of potential solutions to the sequencing problem. The paths P
o and P
e are searched in such a way that they cannot contain the same vertex from graph G
alt. This comes directly from the assumption that S
1 is not a multiset (neither is S
2). Therefore, it is possible that a misplaced vertex in some path will be missing later in the reconstruction and will not be replaced by a vertex with longer overlapping. If the algorithm will not be able to reconstruct a sequence of a desired length, it will have to go back and try different paths. This feature makes the search process longer, but also reduces the number of ambiguous solutions. This is obviously a trade-off. If enough elements corresponding to the neighboring location in an original DNA will be missing from S
1, there is a risk that the algorithm will not be able to reconstruct the correct sequence.
There are a few reasons for the algorithm to reverse the steps already taken, i.e., the vertices last taken are discarded and a new ones are being chosen. The reasons for this can be divided into two categories given in the following.
-
1.
There are no arcs leading from a current vertex to new ones that can be chosen. In most cases, it means that there are in fact arcs, but they lead to vertices already taken or to such that cannot be verified properly.
-
2.
The algorithm have just recently created new solution and going back is necessary to search the rest of the search space.
When the algorithm reaches the desired length for both paths, the target DNA sequence is being reconstructed. This step is simple—path P
o reconstructs all the odd nucleotides, path P
e—all the even ones. If there is still time left, the algorithm reverses last steps to the last unvisited but verified vertex and try to reconstruct more sequences from this point. Ambiguous reconstructions are possible, but as the results prove such a situation is very rare. Much more likely is the situation when in a given short time, the algorithm presents unambiguous reconstruction, identical to the target DNA sequence.
When the double-path reconstruction and the verification procedures have been explained, it should be clear why this two phase approach (i.e., choosing candidates for the extension of a path, then their verification) has been implemented. It is impossible to use S
2 elements for the arcs verification when the graph is being constructed. The correctness or the incorrectness of a connection between pair of vertices in a given path is based on the sequence of reconstructed nucleotides in the other path. Therefore, it is not possible to decide this before the actual sequence reconstruction begins.
As a last remark, a different approach to the algorithm construction can be considered: this time when the elements from S
2 create the search graph, while set S
1 is being used for the verification. Theoretically, this is possible, and one of the many differences of such an idea would be that the verification by the S
1 elements would not be required to successfully connect both paths. It would of course be used to help in the reduction on ambiguous reconstructions, but the elements from S
2 would suffice to connect the path, because their last two natural nucleotides are placed directly one after the other without x in between. Explanation why this scenario has been rejected has been in fact given in the paragraph where the search graph construction is being explained. Elements from S
2 are always shorter than those from S
1; therefore, the maximal possible overlapping would be shorter as well. This would create a more densely connected graph and definitely increase the search space—making the reconstruction more difficult.