1 Notation and main result

Let \({\mathbb {F}}_{{\mathcal {N}}}\) denote the set of normalized precision-p base-\(\beta \) floating-point numbers

$$\begin{aligned} {\mathbb {F}}_{{\mathcal {N}}}:= \{\pm m \beta ^e \;\;\text{ with }\;\; \beta ^{p-1} \leqslant m \leqslant \beta ^p-1 \;\;\text{ and }\;\; E_{\min } \leqslant e \leqslant E_{\max } \}, \end{aligned}$$
(1.1)

and denote by

$$\begin{aligned} {\mathbb {F}}_{{\mathcal {D}}}:= \{\pm m \beta ^{E_{\min }} \;\;\text{ with }\;\; 1 \leqslant m < \beta ^{p-1} \} \end{aligned}$$
(1.2)

the set of denormalized numbers. Then \({\mathbb {F}}:= {\mathbb {F}}_{{\mathcal {N}}} \cup {\mathbb {F}}_{{\mathcal {D}}} \cup \{0\}\) is the set of all precision-p base-\(\beta \) floating-point numbers.Footnote 1 Set \({\mathbb {F}}^*:= {\mathbb {F}}\cup \{-\infty ,\infty \}\) and let an arithmetic on \({\mathbb {F}}^*\) following the IEEE 754 standard [7, 8] be given. That means in particular that in RoundToNearest all floating-point operations have minimal error, bounded by the relative rounding error unit \(\textbf{u}:= \frac{1}{2} \beta ^{1-p}\). Moreover, different rounding modes are available, also with best possible result.

In [19] we introduced the “unit in the first place” (ufp) which is defined by

$$\begin{aligned} 0 \ne r \in {\mathbb {R}}\quad \Rightarrow \quad \textrm{ufp}(r):= \beta ^{\lfloor \log _{\beta }|r| \rfloor } \end{aligned}$$

and \(\textrm{ufp}(0):=0\). For all real \(r \in {\mathbb {R}}\) it is the value of the left-most nonzero digitFootnote 2 in the base \(\beta \)-representation.

In contrast, the often used “unit in the last place” (ulp) depends on the precision of the floating-point format in use. For a nonzero finite base-\(\beta \) string it is the magnitude of its least significant digit, or in other words, the distance between the floating point number and the next floating point number of greater magnitude [6].Footnote 3 There are several other definitions of the unit in the last place, in particular for real \(r \notin {\mathbb {F}}\), cf. [2, 12, 13]. We use the definition above, namely \(\textrm{ulp}(r) = \beta ^e\) for \(r \in {\mathbb {F}}_{{\mathcal {N}}}\) according to (1.1), \(\textrm{ulp}(r) = \beta ^{E_{\min }}\) for \(r \in {\mathbb {F}}_{{\mathcal {D}}}\), and \(\textrm{ulp}(0)=0\). All definitions have in common that they depend not only on the basis \(\beta \) but also on the precision of the floating-point format in use.

We invented the unit in the first place in [19] because it was very helpful if not mandatory to formulate complicated proofs of the validity of our new floating-point algorithms for accurate summation and dot products. We developed a small collection of rules using ufp, so that based on that no further understanding of the many properties of the IEEE 754 floating-point arithmetic was necessary to follow the proofs.

The main difference in the definition of \(\textrm{ulp}\) compared to \(\textrm{ufp}\) is to separate the use of the basis and of the precision. First, ufp is defined for a general real number, only depending on the basis \(\beta \), and second precision-related results use ufp and the relative rounding error unit, i.e., the precision p. That separation was useful to formulate our proofs in [19] and following papers.

There are simple algorithms to compute the predecessor, successor, unit in the first place, unit in the last place etc. in binary arithmetic [3, 5, 10, 11, 13, 14], but apparently no method is known to compute the unit in the first place in a base-\(\beta \) arithmetic. Jean-Michel Muller [15] proposed a method based on the results in [9], however, it needs up to \(\log _2(\beta )\) iterations. We are interested in a flat, loop-free algorithm with few operations.

Recently we wrote a toolbox for an IEEE 754 precision-p base-\(\beta \) arithmetic with specifiable exponent range [18] as part of INTLAB [16], the Matlab/Octave toolbox for reliable computing. As part of this we present in this note a simple algorithm to compute the unit in the first place for a precision-p base-\(\beta \) arithmetic with \(p \geqslant 1\) and \(\beta \geqslant 2\). The algorithm works correctly in the underflow range, where numbers close to overflow are treated by scaling. That algorithm requires a directed rounding, i.e., RoundToZero, RoundUp or RoundDown; we could not construct a simple algorithm in RoundToNearest.

In addition, as a reply to suggestions by the referees, we present some additional algorithms to compute ufp and ulp. Those require a specific directed rounding mode and/or access to the predecessor/successor of a floating-point number. Since these algorithms are pretty obvious and the proofs of correctness are trivial, we banned them into the appendix.

We formulate our algorithm to compute the unit in the first place in the rounding mode RoundToZero and call the corresponding mapping \(\textrm{fl}_{\diamond }: {\mathbb {R}}\rightarrow {\mathbb {F}}\). It follows that the result of a floating-point operation with positive real result x is \(\max \{f \in {\mathbb {F}}: f \leqslant x\}\), and that operations cannot cause overflow.

The predecessor and successor of \(x \in {\mathbb {R}}\) in \({\mathbb {F}}^*\) is defined by

$$\begin{aligned} \textrm{pred}(x):= & {} \max \{ g \in {\mathbb {F}}^*: g< x \} \\ \textrm{succ}(x):= & {} \min \{ g \in {\mathbb {F}}^*: x < g \}, \end{aligned}$$

respectively. In precision-p base-\(\beta \) arithmetic we have

$$\begin{aligned} E_{\min } < k \leqslant E_{\max }&\;\Rightarrow \;&\textrm{pred}(\beta ^k) = (1-\beta ^{-p}) \beta ^k \end{aligned}$$
(1.3)
$$\begin{aligned} 0 <f \in {\mathbb {F}}_{{\mathcal {N}}} \;\text{ and }\; f \ne \textrm{ufp}(f)&\;\Rightarrow \;&\textrm{pred}(f) = f - \beta ^{1-p} \textrm{ufp}(f) \end{aligned}$$
(1.4)
$$\begin{aligned} 0 <f \in {\mathbb {F}}_{{\mathcal {N}}}&\;\Rightarrow \;&\textrm{succ}(f) = f + \beta ^{1-p} \textrm{ufp}(f) \end{aligned}$$
(1.5)

Note that (1.5) includes the case \(p=1\), \(\beta =2\) for which \({\mathbb {F}}\) is the set of powers of 2, i.e., \(f = \textrm{ufp}(f)\) for all \(f \in {\mathbb {F}}\), and \(\textrm{succ}(f) = f + \beta ^{1-p}\textrm{ufp}(f) = 2f\). Among the properties of ufp [19] is

$$\begin{aligned} 0 \ne f \in {\mathbb {F}}&\;\Rightarrow \;&\textrm{ufp}(f) \leqslant |f| \leqslant \beta (1-\beta ^{-p}) \cdot \textrm{ufp}(f). \end{aligned}$$
(1.6)

Next we present in Fig. 1 our algorithm to compute \(\textrm{ufp}(f)\) for \(f \in {\mathbb {F}}\) in precision-p base-\(\beta \) arithmetic and RoundToZero or RoundDown. It is obvious how to adapt the algorithm for RoundUp. We assume that subrealmin, the smallest positive denormalized floating-point number equal to \(\beta ^{E_{\min }}\), is available. Overflow is easily avoided by proper scaling, but we omit that technical detail. Note that in a practical implementation, the constants p1 and phi in lines 2 and 3 of Algorithm ufp would be stored rather than calculated, and the extra input parameters p and beta would be omitted.

Fig. 1
figure 1

Algorithm ufp in RoundToZero or RoundDown

Theorem 1.1

Let S be the result of Algorithm ufp applied to \(f \in {\mathbb {F}}\), where \(E_{\min } \leqslant -1 < p \leqslant E_{\max }\). Suppose that all operations are executed in precision-p base-\(\beta \) floating-point arithmetic following the IEEE 754 standard with \(p \geqslant 1\) and \(\beta \geqslant 2\) in RoundToZero or RoundDown, and that \(|f| < \beta ^{E_{\max }-p+1}\). Then S is equal to \(\textrm{ufp}(f)\).

Remark 1.1

The usual problems in the denormalized range are avoided because \(q \in {\mathbb {F}}_{{\mathcal {N}}}\), so that the multiplication in line 5 are in the normalized range. The result of the final subtraction may be in the denormalized range but is error-free because of Sterbenz’ lemma [21].

Proof

The result is correct for \(f=0\), so henceforth we assume \(f \ne 0\). We first verify that the used constants p1 and phi are in \({\mathbb {F}}\). The rounding RoundToZero or RoundDown implies that \(p_1\) in line 2 is the predecessor of 1, and (1.3) and \(E_{\min } \leqslant -1\) yield \(p_1 = 1-\beta ^{-p}\). Moreover, \(\varphi \in {\mathbb {F}}\) follows by \(\beta ^{p-1}+1 \leqslant \beta ^p \leqslant \beta ^{E_{\max }}\). Note that this includes the case \(\varphi =2\) for \(p=1\).

The input f is used only in line 4, and since \(\textrm{ufp}(f)=\textrm{ufp}(|f|)\) we may henceforth assume without loss of generality that \(f > 0\). The monotonicity of the rounding, (1.6) and (1.5) imply

$$\begin{aligned} \varphi f\leqslant & {} (\beta ^{p-1}+1)\beta (1-\beta ^{p})\cdot \textrm{ufp}(f) = ( \beta ^p + \beta - 1 - \beta ^{1-p}) \cdot \textrm{ufp}(f) \\< & {} (1+\beta ^{1-p})\beta ^p \textrm{ufp}(f) = \textrm{succ}(\beta ^p\textrm{ufp}(f)), \end{aligned}$$

so that the rounding mode implies \(q = \textrm{fl}_{\diamond }(\varphi f) \leqslant \beta ^p\textrm{ufp}(f)\). Therefore,

$$\begin{aligned} \beta ^{p-1}\textrm{ufp}(f) \leqslant \textrm{ufp}(q) \leqslant \beta ^p\textrm{ufp}(f). \end{aligned}$$
(1.7)

Hence q is always in the normalized range \({\mathbb {F}}_{{\mathcal {N}}}\) and \(f < \beta ^{E_{\max }-p+1}\) yields \(\textrm{ufp}(f) \leqslant \beta ^{E_{\max }-p}\) and \(q \leqslant \beta ^p\textrm{ufp}(f) \leqslant \beta ^{E_{\max }}\).

We distinguish two cases. First, assume \(\textrm{ufp}(q) = \beta ^p\textrm{ufp}(f)\), which implies that \(q = \beta ^p\textrm{ufp}(f)\) is a power of \(\beta \). Then \(q \geqslant \beta ^p \beta ^{E_{\min }} > \beta ^{E_{\min }}\) and (1.3) yield

$$\begin{aligned} r:= \textrm{fl}_{\diamond }((1-\beta ^{-p}) q) = \textrm{pred}(q) = (1-\beta ^{-p}) q \end{aligned}$$

and therefore \(S = \textrm{fl}_{\diamond }(q-r) = \textrm{fl}_{\diamond }(\beta ^{-p}q) = \textrm{fl}_{\diamond }(\textrm{ufp}(f)) = \textrm{ufp}(f)\). According to (1.7) it remains the second case

$$\begin{aligned} \textrm{ufp}(q) = \textrm{ufp}( \textrm{fl}_{\diamond }((\beta ^{p-1}+1) f)) = \beta ^{p-1}\textrm{ufp}(f). \end{aligned}$$
(1.8)

Note that \(p=1\) and \(\beta =2\) belongs to the first case \(\textrm{ufp}(q) = \beta ^p\textrm{ufp}(f)\). Next \(\beta ^{p-1}f \in {\mathbb {F}}_{{\mathcal {N}}} \) and (1.5) give

$$\begin{aligned} q= & {} \textrm{fl}_{\diamond }((\beta ^{p-1}+1)f) = \textrm{fl}_{\diamond }((1+\beta ^{1-p})\beta ^{p-1}f) \\\geqslant & {} \textrm{fl}_{\diamond }(\beta ^{p-1}f + \beta ^{1-p}\textrm{ufp}(\beta ^{p-1}f)) = \textrm{succ}(\beta ^{p-1}f)\\\geqslant & {} \textrm{succ}(\beta ^{p-1}\textrm{ufp}(f)) = \textrm{succ}(\textrm{ufp}(q)). \end{aligned}$$

The monotonicity of the rounding, (1.6), \(q > \textrm{ufp}(q)\) and (1.4) yield

$$\begin{aligned} q= & {} \textrm{fl}_{\diamond }(q) > \textrm{fl}_{\diamond }((1-\beta ^{-p})q) =: r \\\geqslant & {} \textrm{fl}_{\diamond }(q - \beta ^{1-p}(1-\beta ^{-p})\textrm{ufp}(q)) \geqslant \textrm{fl}_{\diamond }(q - \beta ^{1-p}\textrm{ufp}(q)) \\= & {} \textrm{pred}(q), \end{aligned}$$

and therefore \(r = \textrm{pred}(q) = q - \beta ^{1-p}\textrm{ufp}(q) = q - \textrm{ufp}(f)\). Hence \(S = \textrm{fl}_{\diamond }(q-r) = \textrm{fl}_{\diamond }(\textrm{ufp}(f)) = \textrm{ufp}(f)\). The theorem is proved. \(\square \) \(\square \)

Algorithm ufp will part of the flbeta toolbox in INTLAB. Executable INTLAB code, which is almost identical to the one given in Fig. 1, is shown in Fig. 2,

Fig. 2
figure 2

Algorithm ufp in executable INTLAB code

Here flbeta is a user-defined data type, where the precision \(p \geqslant 1\), the base \(\beta \geqslant 2\) as well as the exponent range \((E_{\min },E_{\max })\) can be specified through initialization by flbetainit. As in every operator concept, an operation is executed in flbeta-arithmetic if at least one of the operands is of type flbeta. The flbeta toolbox respects the rounding mode; in line 2 it is switched to RoundToZero using the internal Matlab command feature.

The result of p = flbetainit as in line 3 without input and with one output argument is the precision p in use. The constructor flbeta(m,e) generates the flbeta constant \(m\beta ^e\). Otherwise the code is self-explaining.

Finally we want to mention that the flbeta toolbox was very useful for testing in different precisions p, different bases \(\beta \) and exponent ranges \(E_{\min },E_{\max }\). Frankly speaking, we found Algorithm ufp experimentally when playing around with different possibilities. However, we did not find a simple algorithm in the nearest rounding RoundTiesToEven.

We close the main part of this note with some open problems. As has been mentioned, we did not succeed to find a simple algorithm to compute ufp solely in rounding to nearest. Here “simple” means few operations without loop.

Problem 1.1

Given a precision-p base-\(\beta \) arithmetic following IEEE 754, find a simple algorithm to compute the unit in the first place (ufp) in rounding to nearest.

The problem is solved [17] in binary for \(p \geqslant 1\).

Problem 1.2

Given a precision-p base-\(\beta \) arithmetic following IEEE 754, find a simple algorithm to compute the unit in the last place (ulp) in rounding to nearest.

Concerning units of a floating-point number, there is a third quantity of interest, namely, the magnitude of the least nonzero digit in a finite base-\(\beta \) representation. Historically [15], Shewchuk [20] uses this quantity implicitly for defining his “nonoverlapping expansion”, with the notation \(\omega (f)\) it appears in [4], and in [1] the notation \(\textrm{uls}(f)\) (unit in the least significant place) is used. For example, in a precision-3 decimal arithmetic and \(f = 42\) we have \(\textrm{ufp}(f) = 10\), \(\textrm{ulp}(f) = 0.1\) and \(\textrm{uls}(f) = 1\).

Problem 1.3

Given a precision-p base-\(\beta \) arithmetic following IEEE 754, find a simple algorithm to compute the unit in the least significant place (uls) in any rounding mode.