1 Introduction

Arithmetic circuits are important components in processor designs as well as in special-purpose hardware for computationally intensive applications like signal processing and cryptography. At the latest since the famous Pentium bug [12] in 1994, where a subtle design error in the divider had not been detected by Intel’s design validation (leading to erroneous Pentium chips brought to the market), it has been widely recognized that incomplete simulation-based approaches are not sufficient for verification and formal methods should be used to verify the correctness of arithmetic circuits. Nowadays the design of circuits containing arithmetic is not only confined to the major processor vendors, but is also done by many different suppliers of special-purpose embedded hardware who cannot afford to employ large teams of specialized verification engineers being able to provide human-assisted theorem proofs. Therefore the interest in fully automatic formal verification of arithmetic circuits is growing more and more.

In particular the verification of multiplier and divider circuits formed a major problem for a long time. Both BDD-based methods [4, 8] and SAT-based methods [16, 41] for multiplier and divider verification do not scale to large bit widths. Nevertheless, there has been great progress during the last few years for the automatic formal verification of gate-level multipliers. Methods based on Symbolic Computer Algebra (SCA) were able to verify large, structurally complex, and highly optimized multipliers. In this context, finite field multipliers [24], integer multipliers [9, 15, 19, 20, 25,26,27,28, 33, 34, 38, 46, 49, 50], and modular multipliers [29] have been considered. Here the verification task has been reduced to an ideal membership test for the specification polynomial based on so-called backward rewriting, proceeding from the outputs of the circuit in direction of the inputs. For integer multipliers, SCA-based methods are closely related to verification methods based on word-level decision diagrams like *BMDs [6, 7, 18], since polynomials can be seen as “flattened” *BMDs [39]. Moreover, rewriting based approaches [43, 44] have recently shown to be able to verify complex multipliers as well as arithmetic modules with embedded multipliers at the register transfer level.

Research approaches for divider verification were lagging behind for a long time. Attempts to use Decision Diagrams for proving the correctness of an SRT divider [35] were confined to a single stage of the divider (at the gate level) [5]. Methods based on word-level model checking [10] looked into SRT division as well, but considered only a special abstract and clean sequential (i.e., non-combinatorial) divider without gate-level optimizations. Other approaches like [36, 11], or [31] looked into fixed division algorithms and used semi-automatic theorem proving with ACL2, Analytica, or Forte to prove their correctness. Nevertheless, all those efforts did not lead to a fully automated verification method suitable for gate-level dividers.

A side remark in [18] (where actually multiplier verification with *BMDs was considered) seemed to provide an idea for a fully automated method to verify integer dividers as well. Hamaguchi et al. start with a *BMD representing \(Q \times D + R\) (where Q is the quotient, D the divisor, and R the remainder of the division) and use a backward construction to replace the bits of Q and R step by step by *BMDs representing the gates of the divider. The goal is to finally obtain a *BMD representation for the dividend \(R^{(0)}\) which proves the correctness of the divider circuit. Unfortunately, the approach has not been successful in practice: Experimental results showed exponential blow-ups of *BMDs during the backward construction.

Recently, there have been several approaches to fully automatic divider verification that had the goal to catch up with successful approaches to multiplier verification: Among those approaches, [47] is mainly confined to division by constants and cannot handle general dividers due to a memory explosion problem. [48] works at the gate level, but assumes that hierarchy information in a restoring divider is present. Using this hierarchy information it decomposes the proof obligation \(R^{(0)} = Q \times D + R\) into separate proof obligations for each level of the restoring divider. Nevertheless, the approach scales only to medium-sized bit widths (up to 21 as shown in the experimental results of [48]).

Our approaches from [39, 40] work on the gate level as well, but they do not need any hierarchy information which may have been lost during logic optimization. We prove the correctness of non-restoring dividers by “backward rewriting” starting with the “specification polynomial” \(Q \times D + R - R^{(0)}\) (similar to [18], with polynomials instead of *BMDs as internal data structure). Backward rewriting performs substitutions of gate output variables with the gates’ specification polynomials in reverse topological order. We try to prove dividers to be correct by finally obtaining the 0-polynomial. The main insight of [39, 40] is the following: The backward rewriting method definitely needs “forward information propagation” to be successful, otherwise it fails due to exponential sizes of intermediate polynomials. Forward information propagation relies on the fact that the divider needs to work only within a range of allowed divider inputs (leading to input constraints like \(0 \le R^{(0)} < D \cdot 2^{n-1}\)). Scholl and Konrad [39] use SAT Based Information Forwarding (SBIF) of the input constraint in order to derive information on equivalent and antivalent signals, whereas [40] uses BDDs to compute satisfiability don’t cares which result from the structure of the divider circuit as well as from the input constraint. (Satisfiability don’t cares [37] at the inputs of a subcircuit describe value combinations which cannot be produced at those inputs by allowed assignments to primary inputs.) The don’t cares are used to minimize the sizes of polynomials. In that way, exponential blowups in polynomial sizes which would occur without don’t care optimization could be effectively avoided. Since polynomials are only changed for input values which do not occur in the circuit if only inputs from the allowed range are applied, the verification with don’t care optimization remains correct. In [40] the computation of optimized polynomials is reduced to suitable Integer Linear Programming (ILP) problems.

This paper is an extended version of the conference publication [22]. In Konrad et al. [22] we make two contributions to improve [39, 40]: First, we modify the computation of don’t cares leading to increased degrees of flexibility for the optimization of polynomials. Instead of computing don’t cares at the inputs of “atomic blocks” like full adders, half adders etc., which were detected in the gate level netlist, we combine atomic blocks and surrounding gates into larger fanout-free cones, leading to so-called Extended Atomic Blocks (EABs), prior to the don’t care computation. Second, we replace local don’t care optimization by Delayed Don’t Care Optimization (DDCO). Whereas local don’t care optimization immediately optimizes polynomials wrt. a don’t care cube as soon as the polynomial contains the input variables of the cube, DDCO only adds don’t care terms to the polynomial, but delays the optimization until a later time. This method has two advantages: First, by looking at the polynomial later on, we can decide whether exploitation of certain don’t cares is needed at all, and secondly, the later (delayed) optimization will take the effect of following substitutions into account and thus uses a more global view for optimization. Using those novel methods we are able to extend the applicability of SCA-based methods from [39, 40] to further optimized non-restoring dividers and restoring dividers which could not be handled by previous approaches.

This paper considerably extends [22] by additional proofs and analyzes. We provide a thorough analysis of usual optimizations to divider designs that lead to exponential representation sizes of the polynomials when our novel optimizations are not used, and we even provide detailed proofs of this exponential growth in memory consumption. Details on don’t care computation for extended atomic blocks are added and an exponential gap between representation sizes for local don’t care optimization and delayed don’t care optimization is proved. In addition to the SCA-based proof that \(R^{(0)} = Q \times D + R\) (with dividend \(R^{(0)}\), divisor D, quotient Q, and remainder R), we also consider discharging of the second verification condition \(0 \le R < D\) which was omitted in Konrad et al. [22] due to lack of space. Fortunately, this second verification condition could easily be discharged by slightly extending the BDD-based computation of satisfiability don’t cares which we have to perform anyway in the context of polynomial optimization.

The paper is structured as follows: In Sect. 2 we provide background on SCA and divider circuits. In Sect. 3 we analyze the polynomials which we would expect during backward rewriting from a high-level point of view. Sect. 4 looks into additional challenges for backward rewriting of dividers which result from usual design optimizations. In particular, we provide a formal proof that backward rewriting without making use of forward information propagation leads to intermediate polynomials of exponential size. We motivate the need for novel optimizations by analyzing the existing approaches in Sect. 5. In Sections 6 and 7 we present the novel approach. The approach is evaluated in Sect. 8 and we conclude with final remarks in Sect. 9.

2 Preliminaries

2.1 SCA for verification

For the presentation of SCA we basically follow [39]. SCA based approaches work with polynomials and reduce the verification task to an ideal membership test using a Gröbner basis representation of the ideal. The ideal membership test is performed using polynomial division. While Gröbner basis theory is very general and, e.g., can be applied to finite field multipliers [24] and truncated multipliers [19] as well, for integer arithmetic it boils down to substitutions of variables for gate outputs by polynomials over the gate inputs (in reverse topological order), if we choose an appropriate “term order” (see [38] or [34], e.g.). Here we restrict ourselves to exactly this view. Experimental results for multipliers in Ritirc et al. [34] showed that those substitutions were much more efficient than polynomial divisions in computer algebra systems, even though the computed results were the same.

For integer arithmetic we consider polynomials over binary variables (from a set \(X = \{x_1, \ldots , x_n\}\)) with integer coefficients, i. e., a polynomial is a sum of terms, a term is a product of a monomial with an integer, and a monomial is a product of variables from X. Polynomials represent pseudo-Boolean functions \(f : \{0, 1\}^n \mapsto \mathbb {Z}\). We denote the number of terms in a polynomial p by the size |p| of p.

As a simple example consider the full adder from Fig. 1. The full adder defines a pseudo-Boolean function \(f_{FA} : \{0, 1\}^3 \mapsto \mathbb {Z}\) with \(f_{FA}(a_0, b_0, c) = a_0 + b_0 + c\). We can compute a polynomial representation for \(f_{FA}\) by starting with a weighted sum \(2c_0 + s_0\) (called the “output signature” in [49]) of the output variables. Step by step, we replace the variables in polynomials by the so–called “gate polynomials”. This replacement is performed in reverse topological order of the circuit, see Fig. 1. We start by replacing \(c_0\) in \(2c_0 + s_0\) by its gate polynomial \(h_2 + h_3 - h_2 h_3\) (which is derived from the Boolean function \(c_0 = h_2 \vee h_3\)). Finally, we arrive at the polynomial \(a_0 + b_0 + c\) (called the “input signature” in Yu et al. [49]) representing the pseudo-Boolean function defined by the circuit. During this procedure (which is called backward rewriting) the polynomials are simplified by reducing powers \(v^k\) of variables v with \(k > 1\) to v (since the variables are binary), by combining terms with identical monomials into one term, and by omitting terms with leading factor 0. We can also consider \(a_0 + b_0 + c = 2c_0 + s_0\) as the “specification” of the full adder. The circuit implements a full adder iff backward substitution, now starting with \(2c_0 + s_0 - a_0 - b_0 - c\) instead of \(2c_0 + s_0\), reduces the “specification polynomial” to 0 in the end. (This is the notion usually preferred in SCA-based verification.)

Fig. 1
figure 1

Circuit with gate polynomials (upper right part) and series of substitutions (lower right part)

The correctness of the sketched method can be derived from results of SCA or simply from the fact that polynomials are canonical representations of pseudo-boolean functions as stated in Lemma 1 (see also [39]):

Lemma 1

Polynomials (with the previously mentioned simplifications resp. normalizations) are canonical representations of pseudo-Boolean functions (up to reordering of the terms).

Proof

Let \(p_1\) and \(p_2\) be different polynomials. Then one of the two polynomials contains a term that is not included in the other. Let \(t = zx_{i_1} \ldots x_{i_k}\) with \(z \ne 0\) be the shortest term with this property. W.l.o.g. it belongs to \(p_1\). Consider the valuation \(x_{i_1} = \ldots = x_{i_k} = 1\) and \(x_j = 0 \; \text {for all} \; x_j \in X \setminus \{x_{i_1}, \ldots , x_{i_k}\}\). This valuation evaluates \(p_1 - p_2\) to \(z - z'\), if \(p_2\) contains \(t' = z'x_{i_1} \ldots x_{i_k}\) with \(z' \ne z\) and to z otherwise. This shows that \(p_1\) and \(p_2\) represent different pseudo-Boolean functions. \(\square\)

Algorithm 1
figure a

Restoring division.

2.2 Divider circuits

In the following we briefly review textbook knowledge on dividers, see [23], e.g., for more details. We use \(\langle a_n, \ldots , a_0\rangle := \sum _{i=0}^n a_i 2^i\) and \([a_n, \ldots , a_0]_2 := (\sum _{i=0}^{n-1} a_i 2^i) - a_n 2^n\) for interpretations of bit vectors \((a_n, \ldots , a_0) \in \{0, 1\}^{n+1}\) as unsigned binary numbers and two’s complement numbers, respectively. The leading bit \(a_n\) is called the sign bit. An unsigned integer divider is a circuit with the following property:

Definition 1

Let \((r_{2n-2}^{(0)} \ldots r_0^{(0)})\) be the dividend with sign bit \(r_{2n-2}^{(0)} = 0\) and value \(R^{(0)} := \langle r_{2n-2}^{(0)} \ldots r_0^{(0)}\rangle = [r_{2n-2}^{(0)} \ldots r_0^{(0)}]_2\), \((d_{n-1} \ldots d_0)\) be the divisor with sign bit \(d_{n-1} = 0\) and value \(D := \langle d_{n-1} \ldots d_0\rangle = [d_{n-1} \ldots d_0]_2\), and let \(0 \le R^{(0)} < D \cdot 2^{n-1}\). Then \((q_{n-1} \ldots q_0)\) with value \(Q = \langle q_{n-1} \ldots q_0 \rangle\) is the quotient of the division and \((r_{n-1} \ldots r_0)\) with value \(R = [r_{n-1} \ldots r_0]_2\) is the remainder of the division, if \(R^{(0)} = Q \cdot D + R\) (verification condition 1 = “vc1”) and \(0 \le R < D\) (verification condition 2 = “vc2”).

Note that we consider here the case that the dividend has twice as many bits as the divisor (without counting sign bits). This is similar to multipliers where the number of product bits is two times the number of bits of one factor. If both the dividend and the divisor are supposed to have the same lengths, we just set \(r_{2n-2}^{(0)} = \ldots = r_{n-1}^{(0)} = 0\) and require \(D > 0\). Then \(D > 0\) immediately implies \(0 \le R^{(0)} < D \cdot 2^{n-1}\).

The simplest algorithm to compute quotient and remainder is restoring division which is the “school method” to compute quotient bits and “partial remainders” \(R^{(j)}\). Restoring division is shown in Algorithm 1. In each step it subtracts a shifted version of D. If the result is less than 0, the corresponding quotient bit is 0 and the shifted version of D is “added back”, i. e., “restored”. Otherwise the quotient bit is 1 and the algorithm proceeds with the next smaller shifted version of D. Figure 2 shows an implementation of a 4-bit restoring divider.

Fig. 2
figure 2

Restoring divider, \(n=4\)

Non-restoring division optimizes restoring division by combining two steps of restoring division in case of a negative partial remainder: adding the shifted D back and (tentatively) subtracting the next D shifted by one position less. These two steps are replaced by just adding D shifted by one position less (which obviously leads to the same result). More precisely, non-restoring division works according to Algorithm 2.

Algorithm 2
figure b

Non-restoring division.

Fig. 3 shows an implementation of a 4-bit non-restoring divider.

Fig. 3
figure 3

(Optimized) non-restoring divider, \(n=4\)

SRT dividers are most closely related to non-restoring dividers, with the main differences of computing quotient bits by look-up tables (based on a constant number of partial remainder bits) and of using redundant number representations which allow to use constant-time adders. Other divider architectures like Newton and Goldschmidt dividers rely on iterative approximation. In this paper we restrict our attention to restoring and non-restoring dividers.

3 High-level view of backward rewriting for dividers

For dividers it is near at hand to start backward rewriting not with polynomials for the binary representations of the output words (which is basically done for multiplier verification), but with a polynomial for \(Q \cdot D + R\). For a correct divider one would expect to obtain a polynomial for \(R^{(0)}\) after backward rewriting. As an alternative one could also start with \(Q \cdot D + R - R^{(0)}\) and one would expect that for a correct divider the result after backward rewriting is 0. This would be a proof for verification condition (vc1). (Then it remains to show that \(0 \le R < D\) (vc2) which we postpone until later.)

This idea was already proposed by Hamaguchi in 1995 [18] in the context of verification using *BMDs [7]. Although [18] used *BMDs to represent pseudo-Boolean functions, the approach is closely related to SCA-based methods, since polynomials can be seen as “flattened” *BMDs. (Whereas *BMDs can be more compact than polynomials due to factorization and sharing, the rather fine-granular and local optimization capabilities of polynomials make them more flexible to use and easier to optimize.) Unfortunately, Hamaguchi et al. observed exponential blow-ups of *BMDs in the backward construction and thus the approach did not provide an effective way for verifying large integer dividers.

However, this basic approach seems to be promising at first sight. As an example, Fig. 4 shows a high level view of a circuit for non-restoring division. Stage 1 implements a subtractor, stages j with \(j \in \{2, ..., n\}\) implement conditional adders / subtractors depending on the value of \(q_{n-j+1}\), and stage \(n+1\) implements an adder. If we start backward rewriting with the polynomial \(Q \cdot D + R - R^{(0)}\) (which is quadratic in n) and if backward rewriting processes the gates in the circuit in a way that the stages shown in Fig. 4 are processed one after the other, then we would expect the following polynomials on the corresponding cuts (see also Fig. 4):

Fig. 4
figure 4

High-level view of the non-restoring divider with subtractor SUB, conditional adder/subtractor CAS, and final adder ADD

At cut n which is obtained after processing stage \(n+1\), we would expect the polynomial \((\sum _{i=1}^{n-1} q_i 2^i + 2^0) \cdot D + R^{(n)} - R^{(0)}\). This is because stage \(n+1\) enforces \(R = R^{(n)} + (1 - q_0) \cdot D\). For \(j = n\) to 2 we would expect at cut \(j-1\) the polynomial \((\sum _{i=n-j+2}^{n-1} q_i 2^i + 2^{n-j+1}) \cdot D + R^{(j-1)} - R^{(0)}\). This can be shown by induction, since we arrive at cut \(j-1\) after processing stage j, and stage j enforces \(R^{(j)} = R^{(j-1)} - q_{n-j+1} (D \cdot 2^{n-j}) + (1-q_{n-j+1}) (D \cdot 2^{n-j}) = R^{(j-1)} + (1 - 2 q_{n-j+1}) (D \cdot 2^{n-j}).\) Finally, the polynomial at cut 0 after processing stage 1 would reduce to 0, since we are using the equation \(R^{(1)} = R^{(0)} - D \cdot 2^{n-1}\).

There may be two obvious reasons why backward rewriting might fail in practice all the same:

  1. 1.

    It could be the case that backward rewriting does not exactly hit the boundaries between the stages of the divider.

  2. 2.

    There may be significant peaks in polynomial sizes in between the mentioned cuts.

4 Challenges for backward rewriting of dividers

Previous analyses from Scholl and Konrad [39] and Scholl et al. [40] have shown that unfortunately additional obstacles, apart from the above mentioned potential problems, are encountered in backward rewriting of dividers: In fact, with usual optimizations in implementations of non-restoring (and restoring) dividers the polynomials represented at the cuts between stages are different from this high-level derivation. The reason lies in the fact that the stages do not really implement signed addition/subtraction. The stages rather represent shortened versions of the corresponding adders/subtractors where the bit widths of signed numbers are reduced. Whereas the results of signed adders/subtractors which are correct for all possible input operands must have bit widths which are one bit longer than the bit widths of the operands, the bit widths of the results of signed adder/subtractor stages in non-restoring (and restoring) dividers do not need to be extended by one bit, but they can even be shortened by one bit. The correctness of using the shortened versions follows from the input constraint \(0 \le R^{(0)} < D \cdot 2^{n-1}\) of the divider and from the way the results of the previous stages are computed by the division algorithm. However, this information is not present during backward rewriting. The required information has to be obtained by “forward propagation” of the information \(0 \le R^{(0)} < D \cdot 2^{n-1}\) through the divider circuit (in the direction from inputs to outputs).

4.1 Signed addition

To analyze the situation in more detail we first take a look at well-known textbook results regarding the addition of signed numbers and then derive consequences for the polynomials of signed adders.

We start with Lemma 2 which basically shows that signed addition can be reduced to unsigned addition, if there is no overflow, i. e., if the result of the addition of two \(n+1\)-bit signed binary numbers can be represented as a \(n+1\)-bit signed binary number as well.

Lemma 2

Consider \(a = (a_{n}, \ldots , a_0), b = (b_{n}, \ldots , b_0) \in \{0, 1\}^{n+1}\), \(c \in \{0, 1\}\), \(s = (s_{n}, \ldots , s_0) \in \{0, 1\}^{n+1}\) with \(\langle c_n, s \rangle = \langle a \rangle + \langle b \rangle + c\). (For \(0 \le i \le n\) \(c_i\) is the carry bit from the unsigned addition of \((a_i, \ldots a_0)\), \((b_i, \ldots b_0)\) with incoming carry c.) It holds \([a]_2 + [b]_2 + c \in \{-2^n, \ldots , 2^n -1\}\) iff \(c_n = c_{n-1}\).Footnote 1 If this is the case, then we have \([a]_2 + [b]_2 + c = [s]_2\).

Proof

By case distinction for the cases \(a_n = b_n = 0\), \(a_n = b_n = 1\) and \(a_n \ne b_n\). \(\square\)

As for unsigned addition, overflows can be avoided by extending the result by an additional bit in an appropriate manner:

Lemma 3

Consider \(a = (a_{n}, \ldots , a_0), b = (b_{n}, \ldots , b_0) \in \{0, 1\}^{n+1}\), \(c \in \{0, 1\}\), \(s = (s_{n+1}, \ldots , s_0) \in \{0, 1\}^{n+2}\) with \({\langle }c_n, s_n, \ldots , s_0{\rangle } = {\langle }a{\rangle } + {\langle }b{\rangle } + c\) and

$$\begin{aligned} s_{n+1} = {\left\{ \begin{array}{ll} s_n, &{} \text { if } c_n = c_{n-1} \\ c_n, &{} \text { if } c_n \ne c_{n-1}. \\ \end{array}\right. } \end{aligned}$$

(For \(0 \le i \le n\) \(c_i\) is the carry bit from the unsigned addition of \((a_i, \ldots a_0)\), \((b_i, \ldots b_0)\) with incoming carry c.) We have \([a]_2 + [b]_2 + c = [s_{n+1}, \ldots , s_0]_2\).

Proof

By case distinction for the cases \(a_n = b_n = 0\), \(a_n = b_n = 1\) and \(a_n \ne b_n\). \(\square\)

If we consider a signed adder with input operands \(a = (a_{n}, \ldots , a_0)\), \(b = (b_{n}, \ldots , b_0)\), input carry c as well as output \(s = (s_{n+1}, \ldots , s_0)\) and perform backward rewriting starting with the polynomial \(\sum _{i=0}^n s_i 2^i - s_{n+1} 2^{n+1}\), then by Lemma 3 we will finally obtain the polynomial \(\sum _{i=0}^{n-1} a_i 2^i - a_{n} 2^{n} + \sum _{i=0}^{n-1} b_i 2^i - b_{n} 2^{n} + c\).

Motivated by the fact that adder/subtractor stages in usual implementations of non-restoring and restoring dividers omit leading overflow bits (as we will further analyze in the following section), we will now consider the result of backward rewriting starting with \(\sum _{i=0}^{n-1} s_i 2^i - s_{n} 2^{n}\) instead of \(\sum _{i=0}^n s_i 2^i - s_{n+1} 2^{n+1}\). I.e. we consider a circuit which just omits the overflow bit \(s_{n+1}\) of signed addition (as in Lemma 2 and in contrast to Lemma 3).

Lemma 4

The polynomial for the binary adder according to Lemma 2 (omitting the overflow bit) is

$$\begin{aligned} A_n = \left(\sum _{i=0}^{n-1} a_i 2^i - a_n 2^n \right) + \left(\sum _{i=0}^{n-1} b_i 2^i - b_n 2^n \right) + c - 2^{n+1} P_n \end{aligned}$$

where \(P_n = C_{n-1} \cdot (1 - a_n - b_n + 2 a_n b_n) - a_n b_n\) and \(C_{n-1}\) represents the polynomial for the carry bit \(c_{n-1}\) (expressed with input bits \(a_{n-1}, \ldots , a_0\), \(b_{n-1}, \ldots , b_0\), c). \(C_{n-1}\) contains \(\frac{1}{2} (3^{n+1} -1)\) terms and thus \(P_n\) contains \(2 \cdot 3^{n+1} - 1\) terms.

Proof

Starting from a correct signed addition with overflow handling according to Lemma 3 we have

$$\begin{aligned} \sum _{i=0}^n s_i 2^i - s_{n+1} 2^{n+1} = \left(\sum _{i=0}^{n-1} a_i 2^i - a_n 2^n \right)+ \left(\sum _{i=0}^{n-1} b_i 2^i - b_n 2^n \right) + c \end{aligned}$$

and thus

$$\begin{aligned} \left(\sum _{i=0}^{n-1} s_i 2^i \right) - s_n 2^n = \left(\sum _{i=0}^{n-1} a_i 2^i - a_n 2^n \right)+ \left(\sum _{i=0}^{n-1} b_i 2^i - b_n 2^n \right) + c - 2^{n+1} (s_{n} - s_{n+1}). \end{aligned}$$

\(P_n\) is the polynomial for \(s_{n} - s_{n+1}\). To compute \(P_n\), we first express \(s_n\) and \(s_{n+1}\) in terms of \(a_n\), \(b_n\), and \(c_{n-1}\). We obtain the following function table:

$$\begin{aligned} \begin{array}{ccc|cc} a_n &{} b_n &{} c_{n-1} &{} s_n &{} s_{n+1} \\ \hline 0 &{} \quad 0 &{} \quad 0 &{} \quad 0 &{} \quad 0 \\ 0 &{} \quad 0 &{} \quad 1 &{} \quad 1 &{} \quad 0 \\ 0 &{} \quad 1 &{} \quad 0 &{} \quad 1 &{} \quad 1 \\ 0 &{} \quad 1 &{} \quad 1 &{} \quad 0 &{} \quad 0 \\ 1 &{} \quad 0 &{} \quad 0 &{} \quad 1 &{} \quad 1 \\ 1 &{} \quad 0 &{} \quad 1 &{} \quad 0 &{} \quad 0 \\ 1 &{} \quad 1 &{} \quad 0 &{} \quad 0 &{} \quad 1 \\ 1 &{} \quad 1 &{} \quad 1 &{} \quad 1 &{} \quad 1 \\ \end{array} \end{aligned}$$

This leads to the following polynomials for \(s_n\) and \(s_{n+1}\):

$$\begin{aligned}{} & {} \begin{array}{lcl} s_n &{} = &{} (1-a_n) (1- b_n) c_{n-1} + (1-a_n) b_n (1-c_{n-1}) + a_n (1- b_n) (1-c_{n-1}) + a_n b_n c_{n-1} \\ &{} = &{} (c_{n-1} - a_n c_{n-1} - b_n c_{n-1} + a_n b_n c_{n-1}) + (b_n - a_n b_n - b_n c_{n-1} + a_n b_n c_{n-1}) \\ &{} &{} + \, (a_n - a_n b_n - a_n c_{n-1} + a_n b_n c_{n-1}) + a_n b_n c_{n-1} \\ &{} = &{} a_n + b_n + c_{n-1} - 2 a_n b_n - 2 a_n c_{n-1} - 2 b_n c_{n-1} + 4 a_n b_n c_{n-1} \end{array}\\{} & {} \begin{array}{lcl} s_{n+1} &{} = &{} (1-a_n) b_n (1-c_{n-1}) + a_n (1-b_n) (1-c_{n-1}) + a_n b_n (1-c_{n-1}) + a_n b_n c_{n-1} \\ &{} = &{} (b_n - a_n b_n - b_n c_{n-1} + a_n b_n c_{n-1}) + (a_n - a_n b_n - a_n c_{n-1} + a_n b_n c_{n-1}) \\ &{} &{} + \, (a_n b_n - a_n b_n c_{n-1}) + a_n b_n c_{n-1} \\ &{} = &{} a_n + b_n - a_n b_n - a_n c_{n-1} - b_n c_{n-1} + 2 a_n b_n c_{n-1} \end{array} \end{aligned}$$

Thus, the polynomial \(P_n\) for \(s_{n} - s_{n+1}\) is

$$\begin{aligned} \begin{array}{lcl} s_{n} - s_{n+1} &{} = &{} c_{n-1} - a_n b_n - a_n c_{n-1} - b_n c_{n-1} + 2 a_n b_n c_{n-1} \\ {} &{} = &{} c_{n-1} (1 - a_n - b_n + 2 a_n b_n) - a_n b_n. \end{array} \end{aligned}$$

We can derive the polynomial \(C_{n-1}\) for \(c_{n-1}\), expressed with input bits \(a_{n-1}, \ldots , a_0\), \(b_{n-1}, \ldots , b_0\), c, as well as its size \(\sum _{i=0}^{n} 3^i\) by induction.

For \(n=1\) we have \(c_0 = a_0 b_0 + (a_0 \oplus b_0) c\) and thus for the polynomial \(C_0\) of \(c_0\) we have \(C_0 = a_0 b_0 + (a_0 + b_0 - 2 a_0 b_0) \cdot c = a_0 b_0 + a_0 c + b_0 c - 2 a_0 b_0 c\) with \(|C_0| = 4 = 3^0 + 3^1\).

With \(n > 1\) we have \(c_{n-1} = a_{n-1} b_{n-1} + (a_{n-1} \oplus b_{n-1}) c_{n-2}\) and thus for the polynomial \(C_{n-1}\) of \(c_{n-1}\) we have \(C_{n-1} = a_{n-1} b_{n-1} + (a_{n-1} + b_{n-1} - 2 a_{n-1} b_{n-1}) \cdot C_{n-2}\). I.e. for the size of \(C_{n-1}\) we have

$$\begin{aligned} |C_{n-1}| = 1 + 3 \cdot |C_{n-2}| = 1 + 3 \cdot \left (\sum _{i=0}^{n-1} 3^i \right) = \sum _{i=0}^{n} 3^i. \end{aligned}$$

With \(\sum _{i=0}^{n} 3^i = \frac{1}{2} (3^{n+1} -1)\) we obtain the result for \(C_{n-1}\) and with

$$\begin{aligned} P_n = C_{n-1} \cdot (1 - a_n - b_n + 2 a_n b_n) - a_n b_n \end{aligned}$$

we obtain \(|P_n| = 1 + 4 \cdot |C_{n-1}| = 2 \cdot 3^{n+1} - 1\). \(\square\)

4.2 Optimizing divider circuits wrt. bit widths of adder stages

In this section we will show how the bit widths of adder stages in dividers can be optimized and we will demonstrate the effect of such optimizations on the sizes of polynomials during divider verification by backward rewriting. In particular, we will prove that the polynomial at cut n of Fig. 4 has an exponential size.

4.2.1 Omitting overflow bits of adder stages

A first simple observation shows that the individual divider stages in Fig. 4 can in fact be optimized by omitting overflow bits as they are calculated in Lemma 3, since overflows cannot occur. This can simply be concluded from the fact that every stage computes the signed addition of a positive and a negative number according to Algorithm 2 (together with the constraint \(0 \le R^{(0)}\)). Signed addition of two numbers with different signs can never produce an overflow (see also Lemma 2).

4.2.2 Omitting further leading bits

Apart from omitting overflow bits of the adder/subtractor stages, implementations of dividers can even be further optimized. We show for non-restoring dividers according to Fig. 4 that it is even possible to omit the computation of one further most significant bit. This results from range restrictions on the partial remainders derived from the input constraint:

Lemma 5

For all partial remainders in non-restoring division with input constraint \(0 \le R^{(0)} < D \cdot 2^{n-1}\) we obtain \(-D \cdot 2^{n-j} \le R^{(j)} < D \cdot 2^{n-j}\).

Proof

By induction.

\(j=0\): \(-D \cdot 2^{n} \le R^{(0)} < D \cdot 2^{n}\) follows from \(0 \le R^{(0)}< D \cdot 2^{n-1} < D \cdot 2^{n}\).

\(j=1\): \(R^{(1)}\) is defined by \(R^{(1)} = R^{(0)} - D \cdot 2^{n-1}\). From \(0 \le R^{(0)} < D \cdot 2^{n-1}\) it follows \(- D \cdot 2^{n-1} \le R^{(0)} - D \cdot 2^{n-1} < 0\) and thus \(- D \cdot 2^{n-1} \le R^{(1)} < D \cdot 2^{n-1}\).

\((j-1) \rightarrow j\), \(2 \le j \le n\):

  • Case 1: \(R^{(j-1)} \ge 0\) . By induction assumption and case condition we have \(0 \le R^{(j-1)} < D \cdot 2^{n-j+1}\) and thus \(-D \cdot 2^{n-j} \le R^{(j-1)} -D \cdot 2^{n-j} < D \cdot 2^{n-j}\). The claim follows from \(R^{(j)} = R^{(j-1)} -D \cdot 2^{n-j}\).

  • Case 2: \(R^{(j-1)} < 0\) . By induction assumption and case condition we have \(-D \cdot 2^{n-j+1} \le R^{(j-1)} < 0\) and thus \(-D \cdot 2^{n-j} \le R^{(j-1)} + D \cdot 2^{n-j} < D \cdot 2^{n-j}\). The claim follows from \(R^{(j)} = R^{(j-1)} + D \cdot 2^{n-j}\).

\(\square\)

Corollary 1

For \(1 \le j \le n\) \(R^{(j)}\) may be represented by two’s complement representation with \(2n-j\) bits.

Proof

With \(-D \cdot 2^{n-j} \le R^{(j)} < D \cdot 2^{n-j}\) and \(D \le 2^{n-1} - 1 \le 2^{n-1}\) we obtain \(- 2^{2n-j-1} \le R^{(j)} < 2^{2n-j-1}\). \(\square\)

From Corollary 1 we can conclude that \(R^{(j)}\) can be represented by \((r^{(j)}_{2n-j-1} \ldots r^{(j)}_0)\) and a circuit for non-restoring division does not need to compute any bits with higher indices than \(r^{(j)}_{2n-j-1}\). For an example consider the optimized non-restoring divider implementation as shown in Fig. 3 for \(n=4\). (For simplicity, we present the circuit before propagation of constants which is done however in the real implemented circuit.) It can be seen from Fig. 3 that in the adder/subtractor stage computing \(R^{(j+1)}\), the leading bit \(r^{(j)}_{2n-j-1}\) of \(R^{(j)}\) does not need to be used as an input, since we know that the result \(R^{(j+1)}\) can be represented by one bit less, i.e., \(r^{(j+1)}_{2n-j-1}\) resulting from addition at bit position \(2n-j-1\) would just be a sign extension of \((r^{(j+1)}_{2n-j-2} \ldots r^{(j+1)}_0)\).

Note however that we actually would need the sign bit \(r^{(j)}_{2n-j-1}\) nevertheless, since its negation is the quotient bit \(q_{n-j}\). In Fig. 3 the quotient bits \(q_{n-j}\) are not computed as negations of sign bits \(r^{(j)}_{2n-j-1}\), but they are chosen as outgoing carry bits of the corresponding adders with most significant sum bits \(r^{(j)}_{2n-j-1}\) (and the sum outputs \(r^{(j)}_{2n-j-1}\) are really omitted). It is possible to show by a rather tedious but simple case distinction that this is a correct alternative to obtaining \(q_{n-j}\) by negation of \(r^{(j)}_{2n-j-1}\). For completeness we show the correctness by the following Lemma 6:

Lemma 6

In an implementation of adder/subtractor stages as in Fig. 3 where \(q_{n-j}\) and \(r^{(j)}_{2n-j-1}\) are computed as the carry and sum outputs of the leftmost full adder in stage j, respectively, we have \(q_{n-j} = {\lnot r^{(j)}_{2n-j-1}}\) for all \(j \in \{1, ..., n\}\).

Proof

We prove the lemma by induction.

For \(j=1\) we use \(d_{n-1} = 0\), \(r^{(0)}_{2n-2} = 0\). Thus the leftmost FA in the first stage (see Fig. 3) has 1 and 0 as sum inputs and some signal \(c^{(1)}_{2n-3}\) as carry input. This leads to sum output \(r^{(1)}_{2n-2} = {\lnot c^{(1)}_{2n-3}}\) and carry output \(q_{n-1} = c^{(1)}_{2n-3}\), i.e., \(q_{n-1} = {\lnot r^{(1)}_{2n-2}}\) holds.

Now consider some stage \(j \in \{2, ..., n\}\). By induction hypothesis we can assume that \(q_{n-j+1} = {\lnot r^{(j-1)}_{2n-j}}\). Since \(d_{n-1} = 0\), the sum inputs of the leftmost FA in stage j are \({\lnot r^{(j-1)}_{2n-j}}\) and \(r^{(j-1)}_{2n-j-1}\) and the carry input is some \(c^{(j)}_{2n-j-2}\).

In case \(r^{(j-1)}_{2n-j} = r^{(j-1)}_{2n-j-1}\) this leads to sum output \(r^{(j)}_{2n-j-1} = {\lnot c^{(j)}_{2n-j-2}}\) and carry output \(q_{n-j} = c^{(j)}_{2n-j-2}\), i.e., \(q_{n-j} = {\lnot r^{(j)}_{2n-j-1}}\).

In case \(r^{(j-1)}_{2n-j} = 1\) and \(r^{(j-1)}_{2n-j-1} = 0\), we conclude \(c^{(j)}_{2n-j-2} = 1\). If \(c^{(j)}_{2n-j-2}\) would be 0, then \(R^{(j)} = R^{(j-1)} + D \cdot 2^{n-j}\) would be smaller than \(-2^{2n-j} + 2^{2n-j-1} = -2^{2n-j-1}\), i.e., it could not be represented with \(2n-j\) bits which would contradict Corollary 1. Thus, the sum inputs of the leftmost FA in stage j are in this case both 0 and the carry input is 1. Therefore the sum output \(r^{(j)}_{2n-j-1}\) is 1, the carry output \(q_{n-j}\) is 0, i.e., \(q_{n-j} = {\lnot r^{(j)}_{2n-j-1}}\).

Similarly, in case \(r^{(j-1)}_{2n-j} = 0\) and \(r^{(j-1)}_{2n-j-1} = 1\), we conclude \(c^{(j)}_{2n-j-2} = 0\). If \(c^{(j)}_{2n-j-2}\) would be 1, then \(R^{(j)} = R^{(j-1)} - D \cdot 2^{n-j}\) would be at least \(2^{2n-j-1}\), i.e., it could not be represented as a signed number with \(2n-j\) bits. This would again contradict Corollary 1. Thus, the sum inputs of the leftmost FA in stage j are in this case both 1 and the carry input is 0. Therefore the sum output \(r^{(j)}_{2n-j-1}\) is 0, the carry output \(q_{n-j}\) is 1, i.e., \(q_{n-j} = {\lnot r^{(j)}_{2n-j-1}}\). \(\square\)

It can also be seen from Fig. 3 that with this optimization the adder/subtractor stages have a fixed bit width of n, since due to the shifting of the divisor D some least significant bits of \(R^{(j)}\) do not have to be computed by addition. (If we would omit the optimization presented in this section, then the bit width of the adder/subtractor stages would increase from n to \(2n - 1\). If we would add overflow bits in each stage, then the bit width would even increase from n to \(3n - 1\).).

4.3 Backward rewriting of the optimized non-restoring implementation

In this section we will show that the optimizations of bit widths of the adder / subtractor stages have a severe negative impact on the sizes of polynomials during backward rewriting, if we do not enhance backward rewriting by additional methods making use of “forward information propagation”. We will show by the following theorem that the size of the polynomial at cut n of Fig. 4 will then already be exponential. The proof essentially follows from the fact that polynomials for signed adders with omitted overflow handling have exponential sizes (see Lemma 4).

Theorem 1

Assume that backward rewriting is started with the specification polynomial \(Q \cdot D + R - R^{(0)}\) and applied to an optimized non-restoring divider implemented according to Fig. 3. Then the polynomial resulting at cut n (see Fig. 4) has size \(5^{n-1} + n^2 + 2n - 3\).

Proof

We start with the specification polynomial

$$\begin{aligned} (\sum _{i=0}^{n-1}q_i2^i) \cdot D + \left(\sum _{i=0}^{n-2}r_i 2^i - r_{n-1} 2^{n-1} \right) - R^{(0)} \end{aligned}$$

and perform backward rewriting for the last stage. For the moment we denote the outputs of the AND gates with inputs \(d_i\) by \(d_i'\). The adder in the final stage then sums up two signed numbers \((d'_{n-1}, \ldots , d'_0)\) and \((r^{(n)}_{n-1}, \ldots , r^{(n)}_0)\) into a signed number \((r_{n-1}, \ldots , r_0)\) and it omits the overflow bit. Therefore we can apply Lemma 4 and obtain at the cut before the AND gates the polynomial

$$\begin{aligned} \left(\sum _{i=0}^{n-1}q_i2^i \right) \cdot D + \left(\sum _{i=0}^{n-2}r^{(n)}_i2^i - r^{(n)}_{n-1}2^{n-1} \right) + \left(\sum _{i=0}^{n-2}d'_i2^i - d'_{n-1}2^{n-1} \right) - 2^{n}P'_{n-1} - R^{(0)} \end{aligned}$$

where \(P'_{n-1} = C'_{n-2}(1 - r^{(n)}_{n-1} - d'_{n-1} + 2 r^{(n)}_{n-1} d'_{n-1}) - r^{(n)}_{n-1} d'_{n-1}\). \(C'_{n-2}\) is the polynomial for the carry bit \(c_{n-2}\) of the adder in the last stage. According to the proof of Lemma 4 we have \(C'_{n-2} = r^{(n)}_{n-2} d'_{n-2} + (r^{(n)}_{n-2} + d'_{n-2} - 2 r^{(n)}_{n-2} d'_{n-2}) C'_{n-3}\) and \(C'_0 = r^{(n)}_{0}d'_0\) (since the carry-in of the last stage is 0). Similar to the proof of Lemma 4 we have \(|C'_0| = 1\) and \(C'_{n-2} = 1 + 3 \cdot |C'_{n-3}|\) for \(n > 2\). Thus \(|C'_{n-2}| = \sum _{i=0}^{n-2} 3^i = \frac{1}{2} (3^{n-1}-1)\). Now we process the AND gates and replace \(d'_i\) for \(i \in \{0, \ldots , n-2\}\) by \((1-q_0)d_i\). Since \(d_{n-1} = 0\) according to Def. 1, \(d'_{n-1}\) has to be replaced by 0. Altogether the replacement results in

$$\left( {\sum\limits_{{i = 1}}^{{n - 1}} {q_{i} } 2^{i} + 1} \right) \cdot \left( {\sum\limits_{{i = 0}}^{{n - 2}} {d_{i} } 2^{i} } \right) + {\text{ }}\left( {\sum\limits_{{i = 0}}^{{n - 2}} {r_{i}^{{(n)}} } 2^{i} - r_{{n - 1}}^{{(n)}} 2^{{n - 1}} } \right) - 2^{n} P^{\prime\prime}_{{n - 1}} - R^{{(0)}}$$
(1)

with \(P''_{n-1} = C''_{n-2}(1-r^{(n)}_{n-1})\), \(C''_{n-2} = r^{(n)}_{n-2}(1-q_0)d_{n-2} + (r^{(n)}_{n-2} + (1-q_0)d_{n-2} - 2r^{(n)}_{n-2}(1-q_0)d_{n-2})C''_{n-3}\), and with \(C''_0 = r^{(n)}_{0}(1-q_0)d_0\).

Now we have \(|C''_0| = 2\) and \(C''_{n-2} = 2 + 5 \cdot |C'_{n-3}|\) for \(n > 2\). This results by induction in \(|C''_{n-2}| = 2 \cdot \sum _{i=0}^{n-2} 5^i = \frac{1}{2}(5^{n-1} - 1)\), leading to \(|P''_{n-1}| = 5^{n-1} - 1\).

Altogether, we obtain with Eq. (1) the size of the polynomial at cut n as \(5^{n-1} + n^2 + 2n - 3\). \(\square\)

Remark 1

Note that the exponential sized polynomial at cut n would collapse again, if we would use the fact that \(q_0 = {\lnot r^{(n)}_{n-1}}\). This also means that the polynomial at cut n would have a polynomial size in an (easier to verify) implementation that computes \(q_0\) as the negation of \({r^{(n)}_{n-1}}\) and includes this negation into the circuit part below cut n. This observation holds for cut n, but not for cut \(n-1\) where the polynomial has provably exponential size even if we make use of equations \(q_0 = {\lnot r^{(n)}_{n-1}}\) and \(q_1 = {\lnot r^{(n-1)}_{n}}\).

In the following we discuss this insight in more detail. Lemma 7 shows the (positive) result that computing the quotient bits as negations of the leading partial remainder bits helps in avoiding the exponential sized polynomial at cut n. In contrast, Theorem 2 shows the (negative) result that this helps for cut n, but not for cut \(n-1\).

Lemma 7

By replacing \(q_0\) with \({\lnot r^{(n)}_{n-1}}\) the exponential sized polynomial at cut n (see Fig. 4) of the divider according to Fig. 3 collapses to a polynomial with polynomial size of form \((\sum _{i=1}^{n-1}q_i2^i + 1) \cdot \left(\sum _{i=0}^{n-2}d_i2^i \right) + \left(\sum _{i=0}^{n-2}r^{(n)}_i2^i - r^{(n)}_{n-1}2^{n-1} \right) - R^{(0)}\) (which exactly corresponds to the form derived by the high level analysis in Sect. 3).

Proof

According to the proof of Theorem 1 the polynomial at cut n has the form

$$\begin{aligned} \left (\sum _{i=1}^{n-1}q_i2^i + 1 \right) \cdot \left (\sum _{i=0}^{n-2}d_i2^i \right) + \left(\sum _{i=0}^{n-2}r^{(n)}_i2^i - r^{(n)}_{n-1}2^{n-1} \right) - 2^{n}P''_{n-1} - R^{(0)} \end{aligned}$$

with \(P''_{n-1} = C''_{n-2}(1-r^{(n)}_{n-1})\), \(C''_{n-2} = r^{(n)}_{n-2}(1-q_0)d_{n-2} + (r^{(n)}_{n-2} + (1-q_0)d_{n-2} - 2r^{(n)}_{n-2}(1-q_0)d_{n-2})C''_{n-3}\), and with \(C''_0 = r^{(n)}_{0}(1-q_0)d_0\), i.e.,

$$\begin{aligned} P''_{n-1} = (1-r^{(n)}_{n-1}) \cdot [r^{(n)}_{n-2}(1-q_0)d_{n-2} + (r^{(n)}_{n-2} + (1-q_0)d_{n-2} - 2r^{(n)}_{n-2}(1-q_0)d_{n-2})C''_{n-3}]. \end{aligned}$$

Using \(q_0 = {\lnot r^{(n)}_{n-1}}\) means to replace \(q_0\) by \(1 - r^{(n)}_{n-1}\) in \(P''_{n-1}\). With \(r^{(n)}_{n-1} \cdot (1 - r^{(n)}_{n-1}) = 0\) this leads to the simplified polynomial

$$\begin{aligned} P'''_{n-1} = (1-r^{(n)}_{n-1}) \cdot r^{(n)}_{n-2} C''_{n-3}. \end{aligned}$$

Continuing the replacements until \(C''_0\) we obtain

$$\begin{aligned} P'''_{n-1} = (1-r^{(n)}_{n-1}) \cdot r^{(n)}_{n-2} \cdot \ldots \cdot r^{(n)}_{1} C''_{0} \end{aligned}$$

and with \(C''_0 = r^{(n)}_{0}(1-q_0)d_0\) we finally obtain

$$\begin{aligned} P'''_{n-1} = (1-r^{(n)}_{n-1}) \cdot r^{(n)}_{n-2} \cdot \ldots \cdot r^{(n)}_{0}r^{(n)}_{n-1} d_0 = 0. \end{aligned}$$

Altogether, by replacing \(q_0\) with \({\lnot r^{(n)}_{n-1}}\) we obtain at cut n the polynomial

$$\begin{aligned} \left(\sum _{i=1}^{n-1}q_i2^i + 1 \right) \cdot \left(\sum _{i=0}^{n-2}d_i2^i \right) + \left(\sum _{i=0}^{n-2}r^{(n)}_i2^i - r^{(n)}_{n-1}2^{n-1} \right) - R^{(0)} \end{aligned}$$

which apparently has polynomial size and which exactly corresponds to the form derived by the high level analysis in Sect. 3, see also Fig. 4. \(\square\)

With the results of Lemma 7 we immediately see that the polynomial at cut n in a circuit for non-restoring division according to Fig. 5 (where the quotient bits \(q_{n-j}\) are computed by negation of \(r^{(j)}_{2n-j-1}\) for all \(j \in \{1, ..., n\}\) and the negation is included in cut n) has polynomial size and is identical to the polynomial derived by the high level analysis in Sect. 3.

Fig. 5
figure 5

Non-restoring divider with modified computation of quotient bits, \(n=4\)

Now one could conjecture that the problems with polynomials of exponential sizes are just due to the computation of quotient bits as in Fig. 3. Considering cut \(n-1\) in Fig. 5 shows that this is not the case.

Theorem 2

Assume that backward rewriting is started with the specification polynomial \(Q \cdot D + R - R^{(0)}\) and applied to an optimized non-restoring divider implemented according to Fig. 5. Then the polynomial resulting at cut \(n-1\) has a size in \(O(8^n)\).

Proof

We already know from Lemma 7 that the polynomial at cut n of a divider according to Fig. 5 has the form

$$\begin{aligned} \left(\sum _{i=1}^{n-1}q_i2^i + 1 \right) \cdot \left(\sum _{i=0}^{n-2}d_i2^i \right) + \left(\sum _{i=0}^{n-2}r^{(n)}_i2^i - r^{(n)}_{n-1}2^{n-1} \right) - R^{(0)}. \end{aligned}$$

Now consider the row in the divider array (between cut n and cut \(n-1\)) that computes \(R^{(n)}\) from \(R^{(n-1)}\). For the moment we denote the outputs of the EXOR gates with inputs \(d_i\) by \(d_i'\) and the incoming carry of the adder in the second last stage by \(c^{(n)}_{-1}\). The adder in the second last stage then sums up two signed numbers \((d'_{n-1}, \ldots , d'_0)\) and \((r^{(n-1)}_{n-1}, \ldots , r^{(n-1)}_0)\) with incoming carry \(c^{(n)}_{-1}\) into a signed number \((r^{(n)}_{n-1}, \ldots , r^{(n)}_0)\) and it omits the overflow bit. Therefore we can apply Lemma 4 and obtain at the cut before the EXOR gates the polynomial

$$\begin{aligned} \left(\sum _{i=1}^{n-1}q_i2^i + 1 \right) \cdot \left (\sum _{i=0}^{n-2}d_i2^i \right) + \left (\sum _{i=0}^{n-2} r^{(n-1)}_i 2^i - r^{(n-1)}_{n-1} 2^{n-1} \right) + \left (\sum _{i=0}^{n-2} d'_i 2^i - d'_{n-1} 2^{n-1} \right) + c^{(n)}_{-1} \\ - 2^{n} P_{n-1} - R^{(0)} \end{aligned}$$

where \(P_{n-1} = C_{n-2}(1 - r^{(n-1)}_{n-1} - d'_{n-1} + 2 r^{(n-1)}_{n-1} d'_{n-1}) - r^{(n-1)}_{n-1} d'_{n-1}\). \(C_{n-2}\) is the polynomial for the carry bit \(c^{(n)}_{n-2}\) of the adder in the second last stage. According to the proof of Lemma 4 we have \(C_{n-2} = r^{(n-1)}_{n-2} d'_{n-2} + (r^{(n-1)}_{n-2} + d'_{n-2} - 2 r^{(n-1)}_{n-2} d'_{n-2}) C_{n-3}\) and \(C_0 = r^{(n-1)}_{0} d'_0 + r^{(n-1)}_{0} c^{(n)}_{-1} + d'_0 c^{(n)}_{-1} - 2 r^{(n-1)}_{0} d'_0 c^{(n)}_{-1}\).

Now we process the EXOR gates as well as the inverter computing \(q_1\), i.e., we replace for \(i \in \{0, \ldots , n-2\}\) \(d'_i\) by \(q_1 \oplus d_i = {\lnot r^{(n-1)}_{n}} \oplus d_i\), we replace \(d'_{n-1}\) by \({\lnot r^{(n-1)}_{n}}\) because of \(d_{n-1} = 0\), and we replace \(c^{(n)}_{-1}\) by \(q_1 = {\lnot r^{(n-1)}_{n}}\).

By this \((\sum _{i=0}^{n-2} d'_i 2^i - d'_{n-1} 2^{n-1}) + c^{(n)}_{-1}\) is replaced by

  • \(\sum _{i=0}^{n-2} d_i 2^i\), if \(r^{(n-1)}_{n} = 1\), and by

  • \(- \sum _{i=0}^{n-2} d_i 2^i\), if \(r^{(n-1)}_{n} = 0\).

I.e., \((\sum _{i=0}^{n-2} d'_i 2^i - d'_{n-1} 2^{n-1}) + c^{(n)}_{-1}\) is replaced by \((2 r^{(n-1)}_{n} - 1) \sum _{i=0}^{n-2} d_i 2^i\).

Moreover, this replacement transforms \(P_{n-1}\) into \(\tilde{P}_{n-1}\) with

$$\begin{aligned} \tilde{P}_{n-1} = \tilde{C}_{n-2}(r^{(n-1)}_{n-1} + r^{(n-1)}_{n} - 2 r^{(n-1)}_{n-1} r^{(n-1)}_{n}) - r^{(n-1)}_{n-1} + r^{(n-1)}_{n-1} r^{(n-1)}_{n} \end{aligned}$$

such that the overall polynomial at cut \(n-1\) of Fig. 5 is

$$\begin{aligned}& \left(\sum _{i=1}^{n-1}q_i2^i + 2 r^{(n-1)}_{n} \right) \cdot \left (\sum _{i=0}^{n-2}d_i2^i \right) + \left(\sum _{i=0}^{n-1} r^{(n-1)}_i 2^i \right) \\ {}&- 2^{n} r^{(n-1)}_{n-1} r^{(n-1)}_{n} - 2^{n} \tilde{C}_{n-2}(r^{(n-1)}_{n-1} + r^{(n-1)}_{n} - 2 r^{(n-1)}_{n-1} r^{(n-1)}_{n}) - R^{(0)}. \end{aligned}$$

Considering the replacement of \(q_1\) with \((1 - r^{(n-1)}_{n})\) this simplifies to

$$\begin{aligned}& \left(\sum _{i=2}^{n-1}q_i2^i + r^{(n-1)}_{n} + 1 \right) \cdot \left(\sum _{i=0}^{n-2}d_i2^i \right) + \left(\sum _{i=0}^{n-1} r^{(n-1)}_i 2^i \right) \nonumber \\ {}&- 2^{n} r^{(n-1)}_{n-1} r^{(n-1)}_{n} - 2^{n} \tilde{C}_{n-2}(r^{(n-1)}_{n-1} + r^{(n-1)}_{n} - 2 r^{(n-1)}_{n-1} r^{(n-1)}_{n}) - R^{(0)}. \end{aligned}$$
(2)

Here \(\tilde{C}_{n-2}\) is the polynomial resulting from \(C_{n-2}\) by the replacement above, i.e.,

$$\begin{aligned} \tilde{C}_{n-2} =&(r^{(n-1)}_{n-2} - d_{n-2} r^{(n-1)}_{n-2} - r^{(n-1)}_{n-2} r^{(n-1)}_{n} + 2 d_{n-2} r^{(n-1)}_{n-2} r^{(n-1)}_{n}) \\&+ \tilde{C}_{n-3}(1 - d_{n-2} - r^{(n-1)}_{n-2} - r^{(n-1)}_{n} + 2 d_{n-2} r^{(n-1)}_{n-2} + 2 d_{n-2} r^{(n-1)}_{n} \\&+ 2 r^{(n-1)}_{n-2} r^{(n-1)}_{n} - 4 d_{n-2} r^{(n-1)}_{n-2} r^{(n-1)}_{n}) \end{aligned}$$

and

$$\begin{aligned} \tilde{C}_{0} = 1 - d_0 - r^{(n-1)}_{n} + d_0 r^{(n-1)}_0 + d_0 r^{(n-1)}_{n}. \end{aligned}$$

The size of \(\tilde{C}_{n-2}\) is again derived by induction: It holds \(|\tilde{C}_{0}| = 5\) and \(|\tilde{C}_{n-2}| = 4 + 8 |\tilde{C}_{n-3}|\), i.e., \(|\tilde{C}_{n-2}| = 4 \sum _{i=0}^{n-3} 8^i + 5 \cdot 8^{n-2} = 5\frac{4}{7} \cdot 8^{n-2} - \frac{4}{7}\).

The polynomial at cut \(n-1\) of Fig. 5 results by replacing \(\tilde{C}_{n-2}\) in Eq. (2) by its polynomial representation of size \(5\frac{4}{7} \cdot 8^{n-2} - \frac{4}{7}\) and by combining terms with shared monomials. It is clear that the resulting polynomial has a size in \(O(8^n)\). \(\square\)

In summary, it is important to note that

  1. 1.

    the stages in Fig. 3 cannot be seen as real adder/subtractor stages as shown in the high-level view from Fig. 4,

  2. 2.

    backward rewriting leads to polynomials at the cuts which are different from the ones shown in Fig. 4, and

  3. 3.

    unfortunately those polynomials have provably exponential sizes.

The conclusion drawn in [40] was that verification of (large) dividers using backward rewriting is infeasible, if there is no means to make use of “forward information” obtained by propagating the input constraint \(0 \le R^{(0)} < D \cdot 2^{n-1}\) in forward direction through the circuit.

5 Analysis of existing approach

In this section we motivate our approach by analyzing weaknesses of our methods from [39, 40]. Before giving a detailed analysis of those methods we provide a brief review of [39, 40].

5.1 Brief review of [39, 40]

Our algorithms from [39, 40] start with a gate level netlist and first perform atomic block detection [26]. Based on the computation of k-input cuts [32] in a gate level netlist, atomic block detection searches for occurrences of pre-defined atomic blocks in the gate level netlist. In our case we search for occurrences of full adder and half adder implementations. This finally results in a circuit with non-trivial atomic blocks (i.e., full adders and half adders) and trivial atomic blocks (original gates not included in non-trivial atomic blocks). The method computes a topological order \(\prec _{top}\) on the atomic blocks with heuristics from [25, 26] and computes satisfiability don’t cares [37] at the input signals of the atomic blocks. Satisfiability don’t cares at the inputs of an atomic block describe value combinations which cannot be produced at those inputs by allowed assignments to primary inputs. Remember that for the dividers we are only allowed to use assignments to the primary inputs which satisfy the input constraint \(0 \le R^{(0)} < D \cdot 2^{n-1}\). Computing satisfiability don’t cares can be seen as a form of “forward information propagation” from the inputs towards the outputs (i.e. in the opposite direction compared to backward rewriting starting from the outputs of the circuit). We compute satisfiability don’t cares for atomic blocks based on a series of BDD-based image computations [13]. (We omit a review of this computation here, since we will describe a similar computation for so-called extended atomic blocks (EABs) in Sect. 6.1.) In the literature satisfiability don’t cares are typically used for logic optimization of the netlist. In contrast, we are using satisfiability don’t cares for keeping the sizes of polynomials small during backward rewriting. Our approach from [40] performs backward rewriting starting with the specification polynomial \(Q \cdot D + R - R^{(0)}\) by replacing atomic blocks in reverse topological order. During backward rewriting two optimization methods are used, if they are needed to keep polynomial sizes small:

The first method uses information on equivalent and antivalent signals which is derived by SAT Based Information Forwarding (SBIF) using the input constraint and the don’t cares at the inputs of atomic blocks. SBIF uses simulation and restricted window-based SAT-solving [39, 40] to (approximately) compute classes of equivalent/antivalent signals in the circuit. (In order to improve the approximation quality the SAT problems are enhanced by the information that signal combinations corresponding to satisfiability don’t cares cannot occur.) During backward rewriting the signals in each class are then replaced by the unique signal in the class that is minimal w.r.t. the topological order \(\prec _{top}\) (or by its negation, depending on whether there is an equivalence or an antivalence).

The second optimization method optimizes polynomials during backward rewriting modulo don’t cares by reducing the problem to Integer Linear Programming (ILP). If certain valuations for the inputs of a polynomial \(P(x_1, \ldots , x_n)\) representing a pseudo-boolean function \(f(x_1, \ldots , x_n) : \{0, 1\}^n \rightarrow \mathbb {Z}\) cannot occur, since they are satisfiability don’t cares, then we can modify the pseudo-boolean function for those valuations. Our goal is to find an assignment to the don’t cares that minimizes the size of \(P(x_1, \ldots , x_n)\). For a polynomial \(P(x_1, \ldots , x_n)\) with don’t care cubes \(dc_1, \ldots , dc_n\) the method consists in the following steps:

  • Introduce a new integer variable \(v_i\) for each don’t care cube \(dc_i\).

  • Add for all \(1 \le i \le n\)\(v_i \cdot dc_i\)” to P.

  • Multiply out, combine terms with the same monomial etc..

  • Use Integer Linear Programming to minimize the size of P.

Example 1

([40]) As an example we choose \(P(x_1, x_2, x_3) = 1 - x_1 - x_2 - x_3 + 2x_1x_2 + 2 x_1x_3 + 2x_2x_3 - 4x_1x_2x_3\) with don’t care cubes \(\lnot x_1 x_2 x_3\) and \(x_1 \lnot x_2 \lnot x_3\). For \(\lnot x_1 x_2 x_3\) we choose the integer variable \(v_1\), for \(x_1 \lnot x_2 \lnot x_3\) we choose \(v_2\). Since \(v_1 (1 - x_1) x_2 x_3\) and \(v_2 x_1 (1 - x_2) (1 - x_3)\) reduce to 0 for all care vectors, we can add them to \(P(x_1, x_2, x_3)\) without changing the polynomial within the care set. By multiplying out and combining terms with the same monomial we arrive at the polynomial \(1 + (v_2 - 1) x_1 - x_2 - x_3 + (2-v_2) x_1x_2 + (2-v_2) x_1x_3 + (2+v_1) x_2x_3 + (v_2 - v_1 -4) x_1x_2x_3.\) Minimizing the number of terms in the polynomial means choosing integer values for \(v_1\) and \(v_2\) such that the constant of a maximum number of terms is 0 (thus eliminating a maximum number of terms). Thus, in the equation system in Fig. 6 we try to satisfy a maximum number of equations. It is easy to see that the optimal solution is \(v_1 = -2\), \(v_2 = 2\), leading to the polynomial \(1 + x_1 - x_2 - x_3\).

Fig. 6
figure 6

Equation system for minimization. The goal is to satisfy a maximum number of equations

Satisfying a maximum number of linear integer equations can be reduced to integer linear programming by standard methods (replacing each equation \(\ell _i(x_1, \ldots , x_n) = 0\) by \(\ell _i(x_1, \ldots , x_n) \le M d_i\) and \(-\ell _i(x_1, \ldots , x_n) \le M d_i\) with a sufficiently large constant M and a binary deactivation variable \(d_i\) for each equation, then minimizing the number of deactivation variables assigned to 1).

5.2 Insufficient don’t care conditions

Let us start by considering stage \(n+1\) of the non-restoring divider (see Figs. 4 and 3). Analyzing the method from [40] applied to optimized n-bit non-restoring dividers, we can observe that it does not make use of don’t cares at the inputs of atomic blocks corresponding to stage \(n+1\) (although there exist some don’t cares), but it makes use of the (only existing) antivalence of \(q_0\) and \(r_{n-1}^{(n)}\) which is shown by SAT taking already proven satisfiability don’t cares into account (as already described above). If we only consider the circuit of stage \(n+1\) (i.e., the circuit below the dashed line in Fig. 3), replace \(q_0\) by \({\lnot r_{n-1}^{(n)}}\) (i.e. if we make use of the mentioned antivalence), and start backward rewriting with \((\sum _{i=0}^{n-1} q_i 2^i) \cdot (\sum _{i=0}^{n-1} d_i 2^i) + (\sum _{i=0}^{n-2} r_i 2^i - r_{n-1} 2^{n-1}) - (\sum _{i=0}^{2n-2} r^{(0)}_i 2^i)\), then we indeed obtain exactly the polynomial \((\sum _{i=1}^{n-1} q_i 2^i + 2^0) \cdot D + R^{(n)} - R^{(0)}\) as shown in Fig. 4, cut n (see also the theoretical analysis from Lemma 7). Figure 7 shows the size of the final polynomial after backward rewriting stage \(n+1\) with increasing bit width n, with and without using the antivalence \(q_0 = {\lnot r_{n-1}^{(n)}}\). Figure 7 clearly shows that it is essential to make use of the mentioned antivalence.

Fig. 7
figure 7

Polynomial sizes, stage \(n+1\), optimized non-restoring divider

Now we consider another version of the non-restoring divider which is slightly further optimized. It is clear that in a correct divider the final remainder is non-negative, i.e, \(r_{n-1} = 0\). Therefore there is actually no need to compute \(r_{n-1}\) which is known in advance and the full adder shown in gray in Fig. 3 can be omitted. Thus in the optimized divider the remainder is not represented as a signed number anymore, but as an unsigned number, and the verification condition (vc1) is then replaced by \(R^{(0)} = Q \cdot D + \sum _{i=0}^{n-2} r_i 2^i\). Whereas in the original circuit making use of antivalences was essential for keeping the polynomial sizes small, in stage \(n+1\) of the further optimized version there are neither equivalent nor antivalent signals anymore. The only don’t cares in the last stage (after constant propagation) are two value combinations at the inputs of the now leading full adder. However, making use of those don’t cares does not help in avoiding an exponential blow up as Fig. 8 shows. Intuitively it is not really surprising that removing the full adder shown in gray potentially makes the verification problem harder, since the partial remainders \(R, R^{(n)}, \ldots , R^{(1)}\) in the high-level analysis of polynomials at cuts (see Fig. 4) represent signed numbers, but now R does not introduce a sign bit anymore.

Algorithm 3
figure c

Backward rewriting with backtracking.

Fig. 8
figure 8

Polynomial sizes, stage \(n+1\), further optimized non-restoring divider

Nevertheless, this raises the question whether the derivation of don’t care conditions may be improved in a way that don’t care optimization can avoid exponential blow ups like the one shown in Fig. 8.

5.3 Don’t care optimization with backtracking

The method from [40] does not make use of don’t care optimizations immediately, but stores a backtrack point after backward rewriting was applied to an atomic block which has don’t cares at its inputs or has input signals with equivalent / antivalent signals. Whenever the polynomial grows too much, the method backtracks to a previously stored backtrack point and performs an optimization. Algorithm 3 shows a simplified overview of the approach. For ease of exposition we omitted handling of equivalences/antivalences here.

At the end of the algorithm we perform an additional check as shown in Algorithm 4. In contrast to the usual assumption in SCA-based verification, where the implementation is considered correct iff backward rewriting reduces the specification polynomial to 0, in optimized non-restoring dividers as in Fig. 3 a resulting polynomial different from 0 does not necessarily indicate an error. The divider has only to be correct for input combinations satisfying the input constraint \(IC = 0 \le R^{(0)} < D \cdot 2^{n-1}\), i. e., the divider is correct iff the final polynomial evaluates to 0 for all inputs satisfying \(IC = 0 \le R^{(0)} < D \cdot 2^{n-1}\). Of course, the number of inputs satisfying the input constraint is exponential and thus it is not advisable to evaluate the polynomial for all admissible input combinations. Fortunately, in the special case of a divider the input constraint \(IC\) has a form that allows a decomposition into a linear number of cases to be evaluated, following the idea of bitwise comparing \(R^{(0)}\) and \(D \cdot 2^{n-1}\) starting with the most significant bit, see Algorithm 4. (Here the notion \(SP_0|_{d_i=1, r^{(0)}_{i+n-1}=0}\) just means that we replace in the polynomial \(SP_0\) the variables \(d_i\) and \(r^{(0)}_{i+n-1}\) by constants 1 and 0, respectively, and simplify the polynomial afterwards by usual normalization rules.)

Algorithm 4
figure d

Evaluate (\(SP_0\)).

As shown in [40], the approach works surprisingly well, at least for non-restoring dividers without the small optimization introduced in Sect. 5.2. It tries to restrict don’t care optimizations (which were illustrated in Sect. 5.1) to situations where they are really needed. Only if the size threshold in line 5 of Alg. 3 is exceeded, backtracking is used and don’t care optimization comes into play. A further analysis shows that the success of the approach in [40] is partly due to the following reasons:

  1. 1.

    In the non-restoring dividers used as benchmarks, the number of atomic blocks that have any satisfiability don’t cares grows only linearly with the bit width.

  2. 2.

    Only a linear amount of backtrackings is needed.

  3. 3.

    On the other hand, if backtrackings have to be used, don’t care assignments have an essential effect in keeping the polynomials small (the size of the polynomials is quadratic in n just like the specification polynomial we start with).

Let us now consider a very simple example which does not have the mentioned characteristics.

Example 2

Consider a circuit which contains (among others) \(2n + 1\) atomic blocks \(a_0, \ldots a_{2n}\). The subcircuit containing those \(2n + 1\) atomic blocks is shown in Fig. 9. Those blocks are the last atomic blocks in the topological order and \(a_{2n} \prec _{top} \ldots \prec _{top} a_0\). The initial polynomial is \(SP^{ init } = 8a + 4b + 2c + i_0\). The atomic block \(a_0\) has inputs \(x_1, i_1\), output \(i_0\), defines the function \(i_0 = x_1 \vee i_1 = x_1 + i_1 - x_1 i_1\), and we assume that it has the satisfiability don’t care \((x_1, i_1) = (0, 0)\). Correspondingly, for \(j=1, \ldots , n\), \(a_j\) defines \(i_j = x_{j+1} i_{j+1}\) with assumed satisfiability don’t care \((x_{j+1}, i_{j+1}) = (0, 0)\), and for \(j=n+1, \ldots , 2n\), \(a_j\) defines \(i_j = x_{j+1} \vee i_{j+1} = x_{j+1} + i_{j+1} - x_{j+1} i_{j+1}\). We compute \(\text{ size }(p)\) as the number of terms in the polynomial p and assume \(threshold = 1.5\) in line 5 of Algorithm 3. Then Algorithm 3 computes the following series of polynomials

$$\begin{aligned} SP_{m}&= 8a + 4b + 2c + i_0 \\ SP_{m-1}&= 8a + 4b + 2c + x_1 + i_1 - x_1 i_1 \\ SP_{m-2}&= 8a + 4b + 2c + x_1 + x_2 i_2 - x_1 x_2 i_2 \\&\ldots \\ SP_{m-n-1}&= 8a + 4b + 2c + x_1 + x_2 \ldots x_{n+1} i_{n+1} - x_1 x_2 \ldots x_{n+1} i_{n+1} \\ SP_{m-n-2}= & {} 8a + 4b + 2c + x_1 + x_2 \ldots x_{n+2} + x_2 \ldots x_{n+1} i_{n+2} - x_2 \ldots x_{n+2} i_{n+2} \\&- x_1 \ldots x_{n+2} - x_1 \ldots x_{n+1} i_{n+2} + x_1 \ldots x_{n+2} i_{n+2} \end{aligned}$$

with sizes 4, 6, ..., 6, 10. \(SP_{m-n-2}\) is the first polynomial exceeding the size limit. For each of the \(n+1\) preceding atomic blocks there was a satisfiability don’t care at the inputs, the size limit was not exceeded, and the corresponding polynomial has been pushed to the backtracking stack ST. Now backtracking to \(SP_{m-n-1}\) takes place. (Note that it is easy to see that without backtracking using don’t care optimization the following \(n-1\) backwriting steps would quickly lead to a blowup in the polynomial sizes finally resulting in a polynomial with size \(2^{n+2} + 2\).) \(SP_{m-n-1}\) is optimized with the don’t care \((x_{n+1}, i_{n+1}) = (0, 0)\). As already explained in Sect. 5.1, don’t care optimization adds \(v \cdot (1 - x_{n+1}) \cdot (1 - i_{n+1})\) for the don’t care \((x_{n+1}, i_{n+1}) = (0, 0)\) to \(SP_{m-n-1}\) with a fresh integer variable v. For all valuations \((x_{n+1}, i_{n+1}) \ne (0, 0)\), \(v \cdot (1 - x_{n+1}) \cdot (1 - i_{n+1})\) evaluates to 0, thus we may choose an arbitrary integer value for v without changing the polynomial “inside the care space”. The choice of v is made such that the size of \(SP_{m-n-1}\) is minimized. So the task is to choose v such that the size of \(8a + 4b + 2c + x_1 + x_2 \ldots x_{n+1} i_{n+1} - x_1 x_2 \ldots x_{n+1} i_{n+1} + v - v i_{n+1} - v x_{n+1} + v x_{n+1} i_{n+1}\) is minimal. We achieve this by using an ILP solver to get a solution for v which maximizes the number of terms with coefficients 0 and therefore minimizes the polynomial. It is easy to see that the best choice is \(v=0\) in this case. This means that we arrive at an unchanged polynomial \(SP_{m-n-1}\) and the don’t care did not help. Then we do the replacement of \(a_{n+1}\) again, detect an exceeded size limit again, backtrack to \(SP_{m-n}\) and so on. Exactly as for \(SP_{m-n-1}\), don’t care assignment does not help for \(SP_{m-n}, \ldots , SP_{m-2}\). The first really interesting case occurs when backtracking arrives at \(SP_{m-1}\). Adding \(v \cdot (1 - x_1) \cdot (1 - i_1)\) with a fresh variable v to \(SP_{m-1}\) results in \(8a + 4b + 2c + v + (1 - v) x_1 + (1 - v) i_1 + (v-1) x_1 i_1\) and choosing \(v = 1\) leads to the minimal polynomial \(8a + 4b + 2c + 1\) which is even independent from \(i_1\). Now replacing \(a_1, \ldots , a_{2n}\) does not change the polynomial anymore and we finally arrive at \(SP_{m - 2n - 1} = 8a + 4b + 2c + 1\) (without further don’t care assignments).

The example shows that the backtracking method works in principle, but it comes at huge costs: Backtracking potentially explores all possible combinations of assigning or not assigning don’t cares for atomic blocks with don’t cares by storing backtrack points again in line 10 of Algorithm 3 after successful as well as unsuccessful don’t care optimizations. In the example this leads to \(2^{n+1}\) rewritings for atomic blocks and \(2^{n+1} - 1\) unsuccessful don’t care optimizations, before we finally backtrack to \(SP_{m-1}\) where we do the relevant don’t care optimization.

Fig. 9
figure 9

Subcircuit contained in the circuit of Example 2

Our goal is to come up with a don’t care optimization method which is robust against situations like the one illustrated in Example 2 where we have many blocks with don’t cares, but only a few of those don’t cares are really useful for minimizing the sizes of polynomials. As we will show in Sect. 8, we run into such situations when we verify restoring dividers using the method from [40].

6 Don’t care computation and optimization

6.1 Don’t care computation for extended atomic blocks

This section is motivated by [15, 38] which combine several gates and atomic blocks into fanout-free cones, compute polynomials for the fanout-free cones first and use those precomputed polynomials for “macro-gates” formed by the fanout-free cones during backward rewriting. Whereas in [15, 38] the purpose of forming those fanout-free cones is avoiding peaks in polynomial sizes during backward rewriting without don’t care optimization, the motivation here is different: Here we aim at detecting more and better don’t cares.

First of all, we detect atomic blocks for fixed known functions like full adders and half adders as already mentioned in Sect. 5.1. The result is a circuit with non-trivial atomic blocks and the remaining gates. Now we want to combine those atomic blocks and remaining gates into “extended atomic blocks (EABs)” which are fanout-free cones of atomic blocks and remaining gates. To do so, we compute a directed graph \(G = (V, E)\) where the nodes correspond to the non-trivial atomic blocks, the remaining gates, and the outputs. There is an edge from a node v to a node w iff there is an output of the atomic block / gate corresponding to v which is connected to an input of the atomic block / gate / output node corresponding to w. We compute the coarsest partition \(\{P_1, \ldots , P_l\}\) of V such that for all sets \(P_i\) and all \(v \in P_i\) with more than one successor it holds that all successors of v are not in \(P_i\). We combine all gates / atomic blocks in \(P_i\) into an EAB \(ea_i\).

Algorithm 5
figure e

Computation of satisfiability don’t cares.

The computation of satisfiability don’t cares at the inputs of EABs that result from the input constraint \(IC\) (for dividers according to Definition 1\(IC = 0 \le R^{(0)} < D \cdot 2^{n-1}\)) is performed for EABs as described in [40] for atomic blocks. The algorithm is shown in Algorithm 5. First of all, an intensive simulation (taking \(IC\) into account) excludes candidates for satisfiability don’t cares. Value combinations at inputs of EABs that are seen in the simulation are excluded, finally resulting in a set \(dc\_cand (ea_j)\) for each EAB \(ea_j\). This candidate set is the starting point of Algorithm 5 (line 1). Whereas in principle it is possible to prove those candidates to be really satisfiability don’t cares by using SAT, preliminary experiments showed that for large dividers it was not possible to confirm a sufficient number of don’t care candidates by SAT due to lack of resources. Restricting the SAT-solving to windows of manageable size did not succeed either, since it seems that for the divider designs existing don’t cares cannot be confirmed by local reasoning.

However, it turned out in our experiments (see Sect. 8) that a series of BDD-based image computations [13] is suitable for deriving all satisfiability don’t cares at the inputs of EABs: We start with a BDD for representing input constraint \(IC\). Then we identify the first EAB \(ea_i\) in the topological order which has a non-empty set of don’t care candidates (computed by the preceding simulation), see line 4 of Algorithm 5. EABs \(ea_1, \ldots , ea_{i-1}\) form the first slice \(slice_1\) in the circuit. The output signals of \(slice_1\) are exactly the signals connecting the slice with EABs \(ea_i, . . . , ea_m\), i. e., by construction, the inputs of \(ea_i\) are outputs of \(slice_1\). We use BDD-based image computation to compute the BDD for the image \(\chi _1\) of \(IC\) under \(slice_1\) (line 5). Afterwards we evaluate \(\chi _1\) wrt. to the don’t care candidates at the inputs of \(ea_i\). If the evaluation results in constant 0 for some candidate, then it is not included in the image and it is confirmed as a don’t care (line 7). Then we continue with the next EAB \(ea_j\) with a non-empty set of don’t care candidates, choose \(ea_i, \ldots , ea_{j-1}\) to form the next slice, compute the image \(\chi _2\) of \(\chi _1\) under this slice etc.. We repeat the procedure until all don’t care candidates have been classified as real don’t cares or not.

If we apply the method to the optimized non-restoring divider in Fig. 3, the EABs below the dashed line are shown by dashed boxes. The number of satisfiability don’t cares at the inputs of the dashed boxes (after constant propagation!) are shown at the right sides of the boxes just above the full adders. For the first EAB, the number of don’t cares is 9, e.g., whereas for the atomic block (full adder) included in the EAB the number is only 2. At first sight, it is not clear that more don’t cares really help during don’t care based optimization, but we will show in Sect. 8 that this is definitely the case and that the use of extended atomic blocks is essential for a successful verification of large dividers.

For the BDD-based image computations we just use a static variable order. We choose it such that the bits \(r_i\) and \(d_i\) as well as \(r^{(0)}_{i+n-1}\) and \(d_i\) are arranged side by side and bits with higher indices are evaluated first, resulting in a linear sized BDD for the input constraint \(IC\) (as can be seen in Fig. 10). Other variables corresponding to internal signals are ordered according to [30], extended to the case that the relative order of the previously mentioned variables has already been fixed. This static ordering keeps intermediate BDD sizes occurring during the image computation steps in a manageable range (see Sect. 8 for detailed experimental data).

Fig. 10
figure 10

BDD representing the input constraint with linear size

Fortunately, the don’t care computation from Algorithm 5 can be easily extended to a verification of the second verification condition (vc2). We will come back to this fact in Sect. 7.

6.2 Delayed don’t care optimization

In this section we introduce Delayed don’t care optimization (DDCO). DDCO is based on the observation that don’t care optimization as introduced in [40] is a local optimization that does not take its global effects into account. If backtracking goes back to a backtrack point with don’t cares, then it backtracks to a situation where backward rewriting for an (extended) atomic block with don’t cares at its inputs has taken place and the inputs of this block have been brought into the polynomial. The optimization locally minimizes the size of the polynomial using those don’t cares immediately and the results of the optimization do not depend on rewriting steps which take place in the future. However, it is obvious that the future sizes of polynomials depend on the future substitutions during backward rewriting and therefore a local don’t care optimization may go into the wrong direction. For that reason we propose a delayed don’t care optimization taking future steps into account, which are performed after rewriting of the block for which the don’t cares are defined. Before we will introduce DDCO, we illustrate the effect by an example.

Example 3

Consider the polynomial

$$\begin{aligned} p&= x_1 x_4 x_6 + x_2 x_4 x_6 + x_3 x_4 x_6 \\&\quad - x_1 x_2 x_4 x_6 - x_1 x_3 x_4 x_6 - x_2 x_3 x_4 x_6 + x_1 x_2 x_3 x_4 x_6 \\&\quad + x_1 x_5 x_6 + x_2 x_5 x_6 + x_3 x_5 x_6 \\&\quad - x_1 x_2 x_5 x_6 - x_1 x_3 x_5 x_6 - x_2 x_3 x_5 x_6 + x_1 x_2 x_3 x_5 x_6 \\&\quad - x_1 x_4 x_5 x_6 - x_2 x_4 x_5 x_6 - x_3 x_4 x_5 x_6 \\&\quad + x_1 x_2 x_4 x_5 x_6 + x_1 x_3 x_4 x_5 x_6 + x_2 x_3 x_4 x_5 x_6 - x_1 x_2 x_3 x_4 x_5 x_6 \end{aligned}$$

with size 21. Assume that the valuation \((x_1, x_2, x_3, x_4, x_5) = (0, 0, 0, 1, 1)\) is a don’t care. By using the don’t care optimization method from [40] which was already illustrated in Example 2, we arrive at a polynomial

$$\begin{aligned} q&= p + v x_4 x_5 - v x_1 x_4 x_5 - v x_2 x_4 x_5 - v x_3 x_4 x_5 + v x_1 x_2 x_4 x_5 \\&\quad + v x_1 x_3 x_4 x_5 + v x_2 x_3 x_4 x_5 - v x_1 x_2 x_3 x_4 x_5 \end{aligned}$$

with a new integer variable v. Since there is no pair of terms in q with the same monomials, \(v=0\) leads to the polynomial with the smallest number of terms. For all \(v \ne 0\) q has the size 29 instead of 21. This shows that a local don’t care optimization with don’t care \((x_1, x_2, x_3, x_4, x_5) = (0, 0, 0, 1, 1)\) does not help in this example. Now assume that we perform a replacement of \(x_6\) by \(x_4 \cdot x_5\) in the polynomial q, resulting in

$$\begin{aligned} q'&= v x_4 x_5 + (1-v) x_1 x_4 x_5 + (1-v) x_2 x_4 x_5 + (1-v) x_3 x_4 x_5 \\&\quad + (v-1) x_1 x_2 x_4 x_5 + (v-1) x_1 x_3 x_4 x_5 + (v-1) x_2 x_3 x_4 x_5 \\&\quad + (1-v) x_1 x_2 x_3 x_4 x_5 \end{aligned}$$

Here it is easy to see that choosing \(v=1\) reduces \(q'\) to \(q'' = x_4 x_5\). I.e., performing local don’t care optimization before rewriting with \(x_6 = x_4 \cdot x_5\) does not help and leads to a polynomial with 21 terms after the rewriting step, but don’t care optimization after the rewriting step reduces the polynomial to a single term.

By generalizing Example 3 we can show the following lemma:

Lemma 8

Delayed don’t care optimization can be exponentially better than local don’t care optimization (even for a delay by only one rewriting step).

Proof

(Sketch) Consider a generalization of Example 3 from 6 to n variables. Let us consider the polynomial \(p_n\) for the Boolean function \((\bigvee _{i=1}^{n-3} x_i) \wedge (x_{n-2} \vee x_{n-1}) \wedge x_n.\) It has \(3 \cdot (2^{n-3} - 1)\) terms.

Assume the don’t care \((x_1, \ldots , x_{n-3}, x_{n-2}, x_{n-1}) = (0, \ldots , 0, 1, 1)\). The don’t care optimization method from [40] generates a polynomial \(q_n\) with an additional integer variable v and \(2^{n-3}\) additional terms, leading to \(4 \cdot 2^{n-3} - 3\) terms in \(q_n\). Since there are no terms with shared monomials in \(q_n\), local don’t care optimization does not help and setting \(v=0\) leads to the polynomial with the smallest number of terms (which is again \(p_n\)).

Replacing \(x_n\) with \(x_{n-2} \wedge x_{n-1}\) leads to a polynomial \(q'_n\) with \(2^{n-3}\) terms and after an optimal (delayed) don’t care assignment choosing \(v=1\) we arrive at the polynomial \(q''_n = x_{n-2} \cdot x_{n-1}\) with only one term. \(\square\)

Algorithm 6 shows an integration of DDCO into backward rewriting. In contrast to Algorithm 3, it does not use backtracking and it always “delays” don’t care optimization by d EAB rewriting steps. In the while loop from lines 2 to 16, don’t care terms with fresh integer variables \(v^{(i)}_j\) are immediately added to the polynomial \(SP_{i-1}\) for each don’t care of the current EAB \(ea_i\) (line 6), but those don’t cares may only be used with a delay of d EAB rewritings, i.e., in the iteration replacing \(ea_i\) only don’t cares coming from \(ea_{i+d}\) may be used. Therefore, younger don’t care variables are temporarily assigned to 0 in line 8, leading to a polynomial \(SP_{i-1}^{ tmp }\). Now the size of \(SP_{i+d}\) (which is the polynomial before rewriting with \(ea_{i+d}\)) is compared to the size \(dc0\_size\) of \(SP_{i-1}\) where the don’t care variables from \(ea_{i+d}\) are assigned to 0 as well (i.e., they are not used). If \(dc0\_size\) did not increase too much compared to the size of \(SP_{i+d}\) (“too much” is specified by a monotonically increasing function \(increase\)), then the don’t care variables from \(ea_{i+d}\) are permanently assigned to 0 (lines 11 and 12) in the current as well as all previous polynomials containing those variables. Otherwise, the known ILP based don’t care optimization is used and its results are inserted into \(SP_{i-1}\) and again also in all previous polynomials containing the don’t care variables from \(ea_{i+d}\) (lines 14 to 16).

Algorithm 6
figure f

Rewriting with DDCO.

7 Discharging verification condition (vc2)

Verifying condition (vc1) from Definition 1 is not sufficient to prove that a circuit correctly implements a divider. We have to prove as well that for the remainder \(0 \le R < D\) holds (condition (vc2)). Unfortunately, backward rewriting starting with a polynomial for \(0 \le R < D\) fails, since already the polynomial representation for \(0 \le R < D\) has exponential size. However, the good news is that the don’t care computation of Algorithm 5 in Sect. 6.1 can be easily extended to a verification method for (vc2).

Assume that \(ea_{last}\) is the topologically last EAB with a non-empty set of don’t care candidates. Then Algorithm 5 computes in the last iteration of the loop from line 3 to 8 an image \(\chi _{last}\) of a slice that does not contain \(ea_{last}\). Now we extend Algorithm 5 just by performing a final image computation step which computes the image \(\chi _{outputs}\) of \(\chi _{last}\) for the slice formed by all remaining EABs \(\{ea_{last}, \ldots , ea_l\}\) after the slice considered in the last step. It is easy to see that \(\chi _{outputs}\) is then the image of IC computed at the outputs of the whole circuit. It has just been computed (step by step) by a series of image computations for the different slices which were considered in Algorithm 5 and its extension. \(\chi _{outputs}\) tells us which value combinations can be observed at the outputs, if inputs satisfying the input constraint IC are applied. Therefore it only has to be checked whether \(\chi _{outputs}\) implies \(0 \le R < D\). (Note that \(0 \le R < D\) has a small BDD representation similar to the representation of IC shown in Fig. 10.)

8 Experimental results

Our experiments have been carried out on one core of an Intel Xeon CPU E5-2643 with 3.3 GHz and 62 GiB of main memory. The run time of all experiments was limited to 24 CPU hours. All run times in Tables 1 and 2 are given in CPU seconds. A “TO” entry indicates a time out, i.e. exceeding the time limit. Similarly a “MO” entry indicates a memory out, i.e. exceeding the available memory. We used the ILP solver Gurobi [17] for solving the ILP problems for don’t care optimization of polynomials. For image computations we used the BDD package CUDD 3.0.0 [42]. Benchmarks and binaries are available for download at [21].

Table 1 Verifying dividers with non-SCA tools for comparison, times in CPU seconds
Table 2 Verifying dividers with old and new SCA methods, times in CPU seconds

In our experiments we consider verification of three different types of divider benchmarks (Cols. 1 in Table 1, 2 and 3) with different bit widths n (Cols. 2 in Table 1 to 3, the bit width n is defined as explained in Definition 1, i.e., the divisor has length n, the dividend has length \(2n-1\), both with sign bit 0). First we verify non-restoring dividers “non-res\(_1\)” as seen in Fig. 3 (with the gray full adder included), which were also used in [40]. Second, we consider further optimized non-restoring dividers “non-res\(_2\)” that omit the gray full adder shown in Fig. 3. Last we look into verification of restoring dividers as shown in Fig. 2. Note that we did not make use of any hierarchy information during verification, but only used the flat gate-level netlist (numbers of gates are shown in Col. 3 of Table 3) and employed heuristics for detecting atomic blocks as well as for finding good substitution orders [25, 26].

Table 3 Detailed verification statistics of our new tool

8.1 Comparison with Non-SCA tools

We begin with six experiments for comparison where we need—in contrast to our approach—a “golden specification” circuit for the divider which we compare to the corresponding implementations. As a golden specification circuit we choose a simple restoring divider without any optimizations which implements Algorithm 1. Now we check the equivalence of the divider circuits with the golden specification by building a miter circuit and additionally restricting counterexamples to the allowed range \(0 \le R^{(0)} < D \cdot 2^{n-1}\) of inputs. Results are shown in Table 1. For the first four experiments we use SAT solvers for equivalence checking. For this, we translate the miter circuit (conjoined with the input constraint) into Conjunctive Normal Form (CNF). We consider two methods for building the CNFs. First, we directly use a Tseitin transformation [45] (see Cols. 3 and 5 “Tseitin”). Second, we try to improve the encoding by preprocessing the miter circuits with the logic synthesis tool ABC [2, 3] before translating them into CNF by Tseitin transformation. More precisely, we used the ABC commands “|strash; drwsat; &get|” to obtain more optimized CNFs (see Cols. 4 and 6 “opt.”). We used two different SAT solvers to solve the corresponding satisfiability problems: MiniSat 2.2.0 [14] and SBVA-CaDiCaL, which is the winner of the SAT Competition 2023 [1]. The results for MiniSat 2.2.0 are shown in Cols. 3 and 4, the results for SBVA-CaDiCaL in Cols. 5 and 6. The run times (in CPU seconds) from Cols. 3 to 6 show for larger instances an advantage of SBVA-CaDiCaL over MiniSat 2.2.0 whereas the results for the two different encodings are mixed. However, in summary, the results show an exponential run time behavior of the SAT based approaches with timeouts for dividers with bit widths larger than 16.

In the next experiment we considered the combinational equivalence checking (CEC) approach of ABC [2, 3]. Since it is based on And-Inverter-Graph (AIG) rewriting via structural hashing, simulation, and SAT, the equivalence checking between two designs is reduced to finding equivalent internal AIG nodes. The results are similar to the SAT solving results with MiniSAT and SBVA-CaDiCaL: ABC cannot verify the dividers with bit widths larger than 8, see Col. 7.

In a sixth experiment we used a state-of-the-art leading commercial verification tool for formal property and equivalence checking. As Col. 8 shows, the commercial tool is able to verify also 16-bit dividers, for the restoring dividers it even verifies the 32-bit divider in about 15 CPU hours, but does not finish within the time limit for larger benchmarks.

8.2 Comparison of old and new SCA methods

Next we ran three experiments to compare our old tools and methods (discussed in Sect. 5) with our new methods presented in Sect. 6. The run time results are shown in Table 2. More detailed statistics on benchmarks and the verification with our new tool can be seen in Table 3. From Col. 3 in Table 2 we can see that the method from [40] performs very well for the verification of the non-res\(_1\) dividers. Col. 4 from Table 2 (“#bt”) shows how many backtrack operations were actually performed. For the non-res\(_2\) benchmarks the method exceeds the available memory for 16 bits and larger, for the restoring ones even already for 8 bits. As already shown by our analysis from Sect. 5 (see Fig. 8), equivalence/antivalence computation and don’t care optimizations on atomic blocks as used in [40] are not strong enough to avoid exponential blowups of polynomials for the non-res\(_2\) dividers. For restoring dividers the situation is similar.

In the next experiment we evaluate our new approach of using EABs for don’t care computation instead of atomic blocks as used in [40] (at first without DDCO). For non-res\(_1\) dividers (where the method from [40] already performed very well) this approach is somewhat slower than the original method, see Cols. 3 and 5 in Table 2. The reason for this is that using EABs instead of atomic blocks as in [40] leads to more blocks where don’t cares are applicable whereas the number of don’t care optimizations which are really necessary stays the same. This can be seen in Cols. 4 and 6 of Table 2 which compare the number of performed backtracks. The version with EABs performs additional backtracks to backtrack points where optimization does not help and it has to store a larger amount of backtrack points. This even leads to running out of available memory for the 512-bit instance of non-res\(_1\). But on the other hand already the usage of EABs enables to verify the non-res\(_2\) dividers up to 256 bits in about 2 hours. Since don’t care optimizations on atomic blocks as used in [40] are not strong enough to avoid exponential blowups for the non-res\(_2\) dividers (as already mentioned above), using EABs is inevitable. However, the approach is not able to verify restoring dividers with bit widths larger than 64, see Col. 5 in Table 2, due to increasing run times and memory consumption. This can be explained by the larger number of EABs with non-empty don’t care sets for restoring dividers compared to non-restoring dividers. These numbers are given in Col. 4 (“#EABs with DCs”) of Table 3. The numbers grow only linearly for non-restoring dividers, but quadratically for restoring dividers. More EABs with non-empty don’t care sets lead to an increased memory consumption by storing more backtrack points and to increased run times consumed by extensive backtracking. The effect occurring here has already been illustrated in Example 2 of Sect. 5.3 where we have to perform an exponential amount of unsuccessful backtracks before finally arriving at the relevant don’t care optimization. For the 64-bit non-res\(_2\) divider, e.g., the approach needs less than 50 seconds with 205 backtracks whereas the corresponding restoring divider only finishes in about 15 minutes with 3047 backtracks (Cols. 5, 6 in Table 2).

Col. 7 of Table 2 shows that those difficulties can be overcome by using our novel DDCO method. It turned out that already the simplest possible parameter choice of \(d=1\) and \(increase (size) =size+1\) in Alg. 6 is successful. We were even able to verify the 256-bit restoring divider in less than 9.5 CPU hours and both 512-bit instances of non-res\(_1\) and non-res\(_2\) could be verified in about 7.5 hours. Comparing the numbers of EABs with non-empty don’t care sets (Col. 4, “#EABs with DCs”) with the actual numbers of don’t care optimizations performed (Col. 5, “#DC opt.”) in Table 3, we observe that in particular for restoring dividers DDCO performs don’t care optimizations only for a small fraction of the EABs with non-empty don’t care sets. The effect is visible especially for larger instances. For the 256-bit divider this percentage is less than 1%, e.g..

Col. 6 of Table 3 gives the peak polynomial sizes during backward rewriting in our new approach, counted in number of monomials. It can be observed that these peak sizes grow quadratically with the bit width. This shows that our methods are really successful in keeping the polynomial sizes small, since already the specification polynomial is quadratic in n. Figure 11 visualizes this by displaying the progress of the polynomial size during backward rewriting on the example of the 32-bit non-res\(_2\) benchmark. Finally, Col. 7 of Table 3 shows the maximal BDD sizes occurring during the image computations described in Algorithm 5. The results show that an extension of the static variable ordering heuristics according to [30] was successful in keeping the intermediate BDD sizes within a manageable range and that the slice-based image computation for don’t care computation was feasible for the non-restoring dividers non-res\(_1\), non-res\(_2\) as well as the restoring dividers.

Fig. 11
figure 11

Growth behaviour of polynomials for 32-bit non-res\(_2\) divider with our new method

Note that the run times in Table 2 as well as the BDD sizes in Table 3 include discharging verification condition (v2).

In summary, the presented results show that our new method is able to successfully verify not only the divider benchmarks from [40], but also new divider architectures for which the previous approach fails.

9 Conclusions and future work

We analyzed weaknesses of previous approaches that enhanced backward rewriting in an SCA approach with forward information propagation and we presented two major contributions to overcome those weaknesses. The first contribution is the usage of Extended Atomic Blocks to enable stronger don’t care computations. The second one is the new method of Delayed Don’t Care Optimization which has two benefits: First, it performs don’t care optimizations in a more global rewriting context instead of seeking for only local optimizations of polynomials, and second it is able to effectively minimize the number of don’t care optimizations compared to considering all possible combinations of using/not using don’t cares of EABs which can potentially occur in a backtracking approach. We showed that our new method is able to verify large divider designs as well as different divider architectures. For the future, we believe that the general approach of combining backward rewriting with forward information propagation will be a key concept to verify further divider architectures as well as other arithmetic circuits at the gate level.