1 Notation and introduction

We assume a precision-p binary floating-point arithmetic with rounding to nearest according to the IEEE 754 standard [10] to be given and denote the set of floating-point numbers by \(\mathbb {F}\). Then \(\mathbb {F}\) is symmetric, i.e., \(\mathbb {F}= -\mathbb {F}\), and there is a smallest and largest positive normalized floating-point number \(\textrm{realmin}\) and \(\textrm{realmax}\), respectively. We define \(\mathcal {P}:= [\textrm{realmin},\textrm{realmax}]\) and call \(\mathcal {N}:= -\mathcal {P} \cup \{0\} \cup \mathcal {P}\) the normalized range.

To be more precise, for “rounding to nearest” we assume RoundTiesToEven which means that a real number being the midpoint of two adjacent floating-point numbers is rounded to the one with even mantissa. Calling that rounding function \(\textrm{fl}: \mathbb {R}\rightarrow \mathbb {F}\) it follows that \(\textrm{fl}(a \circ b)\) is the floating-point result of \(a \circ b\) for \(a,b \in \mathbb {F}\) and \(\circ \in \{+,-,\times ,/\}\).

In the computation of the Euclidean norm of a vector intermediate results may be outside \(\mathcal {N}\) but the final result in \(\mathcal {N}\). That is taken care of by case distinctions and normalization, see [1, 3, 20]. Henceforth, we assume throughout this note without loss of generality that neither over- nor underflow occurs, i.e., all intermediate results are in \(\mathcal {N}\).

For \(\textbf{u}:= 2^{-p}\) denoting the relative rounding error unit [7] the refined error estimate [4, 7, 23, 31]

$$\begin{aligned} x \in \mathcal {N}: \quad \max \{|x-f|: f \in \mathbb {F}\} \le \frac{\textbf{u}}{1+\textbf{u}} |x| \end{aligned}$$
(1)

holds true, and the same constant \(\frac{\textbf{u}}{1+\textbf{u}}\) bounds the relative error of every floating-point operation.

Many of our results are also true for a precision-p floating-point arithmetic with general base \(\beta \) and \(\textbf{u}= \frac{1}{2}\beta ^{1-p}\). Since we target on MATLAB implementations, we restrict our attention to binary.

Throughout this note \(\Vert \cdot \Vert \) denotes the Euclidean, i.e., \(\ell _2\)-norm. The result of a floating-point evaluation of an expression is denoted by \(\textrm{float}(\cdot )\), where parentheses are respected but otherwise any order of evaluation may be used. Hence \(\textrm{float}(a \circ b)=\textrm{fl}(a \circ b)\) is the floating-point result of \(a \circ b\) for \(a,b \in \mathbb {F}\) and \(\circ \in \{+,-,\times ,/\}\). For example, \(s:= \textrm{float}(\Vert x\Vert )\) denotes a floating-point approximation of the Euclidean norm of \(x \in \mathbb {F}^n\) using any order of summation. Standard error estimates [7] yield

$$\begin{aligned} |s - \Vert x\Vert | \le \gamma _{n+1} \Vert x\Vert \qquad \text{ where }\;\; \gamma _k:= \frac{k\textbf{u}}{1-k\textbf{u}} \end{aligned}$$

provided that \((n+1)\textbf{u}< 1\). In [11] we proved a refined estimate without restriction on n.

Lemma 1

Let \(x \in \mathbb {F}^n\) and \(s:= \textrm{float}(\sqrt{\sum _{i=1}^n x_i^2})\) computed in any order. Then

$$\begin{aligned} |s - \Vert x\Vert | \le \left( \frac{n}{2}+1\right) \textbf{u}\Vert x\Vert \end{aligned}$$

is true without restriction on \(n \in \mathbb {N}\).

The bound is basically sharp, but practical experience and probabilistic arguments [8, 9, 26] suggest that practically the relative error for the Euclidean norm and for summation is hardly larger than \(\sqrt{n} \textbf{u}\Vert x\Vert \).

Recently [6, 20] there is interest in algorithms computing a faithful approximation of the Euclidean norm. That means that there is no other floating-point number between the computed and the true real result. Both are based on error-free transformations and some kind of double-double arithmetic [2], where the latter was already considered in [5]. The computed result is thus equal to the rounded to nearest result or to one of its neighbors. If the true result is a floating-point number, that will be the result of the algorithms.

Both approaches [6, 20] are devoted to the computation of the Euclidean norm. In [17] we introduced a novel pair arithmetic cpair and prove sufficient conditions that for a general arithmetic expression comprised of \(\{+,-,\times ,/,\sqrt{}\}\) the result computed using cpair is faithfully rounded. As a by-product it includes the Euclidean norm. One difference to the well-known double-double pair arithmetic [2], which is intrinsically used in [6, 20], is that a final error-free transformation is omitted. That speeds up the algorithms in cpair significantly. While there is not much penalty in the accuracy of the computed result, it bears the advantage that, in contrast to [2], the higher order part is equal to the ordinary floating-point result. In that sense cpair is a floating-point arithmetic together with an error term.

In this note we will give some new algorithms for the computation of a faithfully rounding of the Euclidean norm as well as for the rounded to nearest result. All algorithms are given in executable MATLAB code [21]. We invest particular care in designing fast algorithms diminishing the interpretation overhead. In particular, we avoid loops as they may slow down the performance significantly.

This note is organized as follows. The next section recalls error-free transformations and some improvements, and mainly error estimates to ensure a faithfully rounded result. In Sect. 3 a vectorized error-free vector transformation is given, we recall recent sharp error estimates on summation and present the first two of our new algorithms to approximate \(\Vert x\Vert \). Those are based on relative splittings and adopt methods presented in [25]. In the next section another two new algorithms are presented using absolute splitting as in [28, 29], again with sufficient conditions on a faithfully rounded result. In Sect. 5 an algorithm is presented computing the nearest approximation of \(\Vert x\Vert \) with proof of correctness. To our knowledge that is the first of its kind. The generation of ill-conditioned test examples, i.e., floating-point vectors x with \(\Vert x\Vert \) very close to a switching point is addressed in Sect. 6. The note is closed with computational results on the computing time and the accuracy of all algorithms and a conclusion.

2 Error-free transformations and previous algorithms

Since a long time it is known [5, 13, 14, 22] that the sum and product of two floating-point numbers can be expressed as the sum \(x+y\) of two floating-point numbers, and that x and y can be computed using few pure floating-point operations. It was used implicitly by Neumaier, who wrote the remarkable paper [24] when he was a bachelor student, otherwise it was basically known to experts. The methods received wide attention when I coined the term “error-free transformations” in [25] with numerous papers following thereafter.

For this note we need only the error-free transformations for sum and product; for details of other error-free transformations see e.g. [23] (Fig. 1). Consider

Fig. 1
figure 1

Error-free sum

We display all algorithms in executable MATLAB code; later some longer algorithms appear so that we decided to add line numbers. The following is true [18, 22, 23, 25].

Lemma 2

Let \(a,b \in \mathbb {F}\) be given and xy be the result of Algorithm TwoSum applied to ab. Then

$$\begin{aligned} x + y = a + b \quad \text{ and }\quad \textrm{fl}(x+y) = x. \end{aligned}$$
(2)

If \(|a|\ge |b|\), then (2) is also true for the result of Algorithm FastTwoSum.

The assumptions for Algorithm FastTwoSum can be weakened [18, 23], but we do not need this here. One might use Algorithm FastTwoSum together with an “if”-statement thereby reducing the number of operations from 6 to 3, however, that is often slower [25] than applying Algorithm TwoSum.

The proof of correctness [23] relies on the fact that all operations from row 3 on are error-free, i.e., cannot cause a rounding error.

The key to the error-free transformation of multiplication is to split [5] both factors into a sum of two floating-point numbers such that the product of the addends does not cause a rounding error. The Algorithm Split for the binary64 format can be implemented as follows (Fig. 2).

Fig. 2
figure 2

Error-free split of a floating-point number for 53-binary arithmetic

In precision-p the factor in line 2 is to be replaced [5] by \(2^{\lceil p/2 \rceil }+1\). For various splitting methods and many details see [12]. For the calculation of the Euclidean norm we need only squares, so we add a specialized method for that (Fig. 3).

Fig. 3
figure 3

Error-free product and square

Since the input is split into two parts we use, for example, the notation Aa to indicate that A+a = Aa in line 3, and similarly for Bb in Algorithm TwoProduct.

Lemma 3

Let \(a,b \in \mathbb {F}^n\) be given and \(P,p \in \mathbb {F}^n\) be the results of Algorithm TwoProduct applied to ab. Then

$$\begin{aligned} P_i + p_i = a_i \cdot b_i \quad \text{ for } \text{ all }\;\; i \in \{1,\ldots ,n\}. \end{aligned}$$
(3)

In binary arithmetic the results \(P,p \in \mathbb {F}^n\) of Algorithm TwoSquare applied to \(a \in \mathbb {F}^n\) satisfy

$$\begin{aligned} P_i + p_i = a_i^2 \quad \text{ for } \text{ all }\;\; i \in \{1,\ldots ,n\}. \end{aligned}$$
(4)

Furthermore, for both algorithms \(|p_i| \le \textbf{u}|P_i|\) for all \(i \in \{1,\ldots ,n\}\).

Proof

The first result (3) is well-known [14, 23], where the proof relies on the fact all operations in lines \(3-5\) do not cause a rounding error. That proves (4) as well because multiplication by 2 is error-free. The last estimate is a well-known property [23] of Algorithm TwoProduct. \(\square \)

For given \(x \in \mathbb {F}^n\), previous approaches [6, 20] borrow from the double-double pair arithmetic [2] to calculate a pair (Tt) such that \(T+t\) is an accurate approximation of the sum of squares \(\sum _{i=1}^{n} x_i^2\). Another candidate for a pair arithmetic is the cpair arithmetic [17]. Both are implemented as toolboxes dd and cpair in INTLAB [27], the MATLAB toolbox for Reliable Computing.

In [6] TwoProduct is used to compute a pair approximation for \(x_i^2\), in [20] FMA instructions are used. While this is part of the new floating-point standard [10] and implemented on many computers, it is not available in MATLAB.Footnote 1 Therefore some of our algorithms avoid that in this note.

Given (Tt) it remains to compute a good floating-point approximation of \(\sqrt{T+t}\). In [6] just sqrt(T) is used ignoring the lower order part t. In [20] the algorithm of our cpair arithmetic [17] is used adapted to one output \(P+p\) rather than the pair (Pp) (Fig. 4).

Fig. 4
figure 4

Accurate square root of \(T+t\) for a given pair (Tt)

If (Tt) are such that the correction t is below the last bit of T, i.e., \(\textrm{fl}(T+t) = T\), then the result of AccSqrt is almost always equal to \(\sqrt{T}\), at most one bit apart. In [20, Theorem 3.6] the following error estimate is proved.

Lemma 4

Let \(T,t \in \mathbb {F}\) be such that \(\textrm{fl}(T+t) = T\), and assume \(\textbf{u}\le 2^{-5}\). Let Pp be the final values in Algorithm AccSqrt when applied to the pair [Tt]. Then

$$\begin{aligned} |P + p - \sqrt{T+t}| \le \frac{25}{8} \textbf{u}^2 \sqrt{T+t}. \end{aligned}$$

The theorem estimates the error of \(P+p\) rather than that of \(\textrm{fl}(P+p)\), otherwise the additional rounding error \(\textbf{u}\) would spoil the result.

Now we can display the Algorithms normG from [6] and normL from [20]. Recall that the latter used an FMA instruction to calculate Pp in line 2, that is, they use \(\texttt {P(i)} = \texttt {x(i)}.^{\wedge }{} \texttt {2}\) and \(p(i) = FMA(x(i),x(i),-P(i))\) inside the loop. Since the FMA instruction is not available in MATLAB, we replaced the computation in the loop by [P,p] = TwoSquare(x) splitting the whole vector x without loop. In that respect later time comparisons may be more fair (Fig. 5).

Fig. 5
figure 5

Algorithms by Graillat et al. [6] and Lefèvre et al. [20]

The summation scheme in Algorithms normG and normL is slightly different, but the main improvement is in the last line: Algorithm normG ignores the lower order part, whereas normL uses our Algorithm AccSqrt in Fig. 4 to compute the square root approximation of the pair \(S+s\). As we will see in Sect. 7 that produces often a nearest rounding.

An alternative is to use the double-double and the cpair toolbox directly (Fig. 6):

Fig. 6
figure 6

Algorithms using double-double and cpair arithmetic

The goal is to guarantee a faithfully rounded approximation to \(\Vert x\Vert \) or even the rounded to nearest result. In [6] it is proved that, if computed in binary64, the result is a faithfully rounded approximation to \(\Vert x\Vert \) if \(n < (24\textbf{u}+\textbf{u}^2)^{-1}\), corresponding to \(n \lesssim 3.7 \cdot 10^{14}\). Our cpair arithmetic proves similar conditions for general arithmetic expressions. Applied to the Euclidean norm the the result is faithful for \(n \le (\beta \textbf{u})^{-\frac{1}{2}}\) when using base-\(\beta \) arithmetic, and that corresponds to \(n \lesssim 8.3 \cdot 10^{7}\) in binary64. In [20] we did not find an explicit limit for n, but the error estimates suggest that it should be a little larger than that for Algorithm normG.

In order to prove a faithful rounding for our algorithms to be presented we use the following criterion [17, Lemma 5.3]. That is a specialized version; the original allows for a much more general computer arithmetic.

Lemma 5

Let \(r,\delta \in \mathbb {R}\) and assume \(|\delta |<\frac{\textbf{u}}{2-\textbf{u}}|r|\). Then \(\textrm{fl}(r)\) is a faithful rounding of \(r+\delta \).

In a typical application a pair (Tt) with \(r:= T+t\) is an approximation to some real quantity q. If \(|r-q| < \frac{\textbf{u}}{2-\textbf{u}}|r|\), then \(\textrm{fl}(r)\) is a faithful rounding of q. An application is the following criterion that \(\textrm{fl}(T+t)\) is a faithful rounding of \(q:= \sqrt{x}\).

Lemma 6

Let \(T,t \in \mathbb {F}\) with \(T+t>0\) be given, and let \(0 \le q \in \mathbb {R}\). Assume

$$\begin{aligned} |T+t-q^2| \le \alpha q^2 \end{aligned}$$
(5)

for some \(\alpha \in \mathbb {R}\) with \(\alpha < 1\). Let \(r \in \mathbb {R}\) be such that

$$\begin{aligned} |r-\sqrt{T+t}| \le \beta \sqrt{T+t} \end{aligned}$$
(6)

for some \(\beta < 1\). Then

$$\begin{aligned} \left( 1-\beta \right) ^{-1}\left( \beta + \frac{\alpha }{2(1-\alpha )}\right) < \frac{\textbf{u}}{2-\textbf{u}} \end{aligned}$$
(7)

implies that \(\textrm{fl}(r)\) is a faithful rounding of q.

Proof

Note that

$$\begin{aligned} |\sqrt{x}-\sqrt{y}| = \frac{|x-y|}{\sqrt{x}+\sqrt{y}} \end{aligned}$$

for positive \(x,y \in \mathbb {R}\). We show

$$\begin{aligned} |\sqrt{T+t} - q| \le \frac{\alpha }{2(1-\alpha )} \sqrt{T+t}. \end{aligned}$$
(8)

We distinguish two cases. First, suppose \(T+t \le q^2\). Then (5) gives

$$\begin{aligned} |\sqrt{T+t} - q| = \frac{|T+t-q^2|}{\sqrt{T+t}+q} \le \frac{\alpha q^2}{2\sqrt{T+t}} \le \frac{\alpha }{2(1-\alpha )} \sqrt{T+t}. \end{aligned}$$

Second, suppose \(T+t > q^2\). Then using again \(q^2 \le \frac{T+t}{1-\alpha }\) and (5) give

$$\begin{aligned} |\sqrt{T+t} - q| \le \frac{\alpha q^2}{2q} \le \frac{\alpha }{2\sqrt{1-\alpha }} \sqrt{T+t} \le \frac{\alpha }{2(1-\alpha )} \sqrt{T+t} \end{aligned}$$

and proves (8). Hence (6) yields

$$\begin{aligned} |r-q| \le |r-\sqrt{T+t}| + |\sqrt{T+t}-q| \le \left( \beta + \frac{\alpha }{2(1-\alpha )} \right) \sqrt{T+t}, \end{aligned}$$

and \(r \ge \left( 1-\beta \right) \sqrt{T+t}\) together with Lemma 5 implies the result. \(\square \)

In our applications \(\alpha \) is very small. A sufficient criterion that AccSqrt(T,t) is a faithful rounding of \(\sqrt{x}\) follows.

Corollary 1

Let \(T,t \in \mathbb {F}\) with \(\textrm{fl}(T+t)=T\), assume \(\textbf{u}\le 2^{-8}\) and let res be the result of Algorithm AccSqrt applied to [Tt]. If \(0 \le x \in \mathbb {R}\) satisfies

$$\begin{aligned} |T+t-x| \le \frac{31}{32}\textbf{u}x, \end{aligned}$$

then res is a faithful rounding of \(\sqrt{x}\).

Proof

Let [Pp] be the final values in Algorithm AccSqrt when applied to the pair [Tt], so that \(res = \textrm{fl}(P+p)\). Lemma 4 shows that (6) is true for \(r:= P+p\) and \(\beta := \frac{25}{8}\textbf{u}^2\). Moreover, (5) is true by assumption for \(\alpha := \frac{31}{32}\textbf{u}\) and \(q:= \sqrt{x}\). Hence (7) is true if, and only if,

$$\begin{aligned} d:= (2-\textbf{u})(2(1-\alpha )\beta +\alpha ) - 2(1-\alpha )(1-\beta )\textbf{u}< 0. \end{aligned}$$

Using \(\textbf{u}\le 2^{-8}\) yields \(64d = -4\textbf{u}+ 862\textbf{u}^2-775\textbf{u}^3 \le \left( -4 + 862 \cdot 2^{-8}\right) \textbf{u}<0\), and Lemma 6 finishes the proof. \(\square \)

3 Faithfully rounding of \(\Vert x\Vert \) based on relative splitting of x

The Algorithms TwoProduct and TwoSquare as in Fig. 3 apply to vector input as well. As a consequence we obtain the following lemma.

Lemma 7

For \(a,b \in \mathbb {F}^n\) the output [Pp] of Algorithm TwoProduct as in Fig. 3 applied to ab satisfies \(\sum _{i=1}^n P_i+p_i = \sum _{i=1}^n a_ib_i\), and the output [Pp] of Algorithm TwoSquare applied to a satisfies \(\sum _{i=1}^n P_i+p_i = \sum _{i=1}^n a_i^2\). Furthermore, \(P_i \ge 0\) and \(|p_i| \le \textbf{u}P_i\) for all \(i \in \{1,\ldots ,n\}\).

Thus one way to approximate \(\Vert x\Vert \) is to compute vectors \(P,p \in \mathbb {F}^n\) with \(P+p = \Vert x\Vert ^2\) and apply some accurate summation algorithm. Both [6] and [20] follow that scheme. Note that the vectors Pp are computed based on a relative splitting of x; later we will an absolute splitting.

In [25] efficient summation algorithms are developed based on TwoSum. First, q = VecSum(p) transforms an input vector p into a vector q without changing its sum S but with the property that \(q_{1\ldots n-1}\) is small in absolute value and \(q_n = \textrm{float}(\sum _{i=1}^n p_i)\). The error estimates in [25] imply that \(res = \textrm{float}(\sum _{i=1}^n q_i)\) is a very good approximation of the true sum S.

Before continuing, we need to estimate the error of ordinary floating-point summation. To that end the traditional Wilkinson-type estimate \(\gamma _{n-1}\) can be used. However, new and optimal bounds are available. The following sharp bound was shown in [16, Theorem 5].

Lemma 8

For \(p \in \mathbb {F}^n\) denote \(S = \textrm{float}(\sum _{i=1}^n p_i)\) for summation in any order, and denote by \(\delta _i\) the errors in the \(n-1\) nodes of the evaluation tree. Hence \(\sum _{i=1}^n p_i = S + \sum _{i=1}^{n-1} \delta _i\). Suppose \(n \le 1+\frac{1}{2} \textbf{u}^{-1}\). Then

$$\begin{aligned} \left| S - \sum _{i=1}^n p_i\right| \le \sum _{i=1}^{n-1} |\delta _i| \le \varphi _{n-1} \sum _{i=1}^n |p_i| \qquad \text{ with }\quad \varphi _k:= \frac{k \textbf{u}}{1+k\textbf{u}}, \end{aligned}$$
(9)

and that bound is sharp as for the input vector \(p = (1,\textbf{u},\ldots ,\textbf{u})^T\).

The Algorithm VecSum is realized by a loop in [25]. In MATLAB we face some interpretation overhead, so loops should be avoided where possible. That has been done in TwoSquare, and next we give a new, loop-free version of VecSum, see Fig. 7.

Fig. 7
figure 7

Error-free vector transformation

It is easily verified that Algorithms VecSum and FastVecSum produce identical results. The error analysis follows by Lemma 8.

Lemma 9

For given \(p \in \mathbb {F}^n\) let [Ss] be the output of Algorithm FastVecSum. Suppose \(n \le 1+\frac{1}{2} \textbf{u}^{-1}\). Then \(s \in \mathbb {F}^{n-1}\), \(\sum _{i=1}^n p_i = S + \sum _{i=1}^{n-1} s_i\) and

$$\begin{aligned} \sum _{i=1}^{n-1} |s_i| \le \varphi _{n-1} \sum _{i=1}^n |p_i| \qquad \text{ with }\quad \varphi _k:= \frac{k \textbf{u}}{1+k\textbf{u}}, \end{aligned}$$
(10)

and that bound is sharp as by the input vector \(p = (1,\textbf{u},\ldots ,\textbf{u})^T\).

We mention that (10) is true [15, Theorem 2.1] without restriction on n when replacing \(\varphi _k\) by \(\frac{k \textbf{u}}{1+\textbf{u}}\).

Our first algorithm is based on Algorithm Sum2 in [25], which in turn is equivalent to Algorithm IV in [24] (Fig. 8).

Fig. 8
figure 8

Algorithm normSum2

Theorem 1

Let res be the result of Algorithm normSum2 applied to \(x \in \mathbb {F}^n\). Suppose \(n \le \sqrt{\frac{31}{32}}\textbf{u}^{-1/2}\) and \(\textbf{u}\le 2^{-8}\). Then res is a faithful rounding of \(\Vert x\Vert \).

Proof

We will prove

$$\begin{aligned} \left| T + t - \sum _{i=1}^{n} x_i^2\right| \le \frac{31}{32}\textbf{u}\sum _{i=1}^{n} x_i^2 \end{aligned}$$
(11)

for the scalars [Tt] computed in line 4 of Algorithm normSum2 in order to apply Corollary 1. We know

$$\begin{aligned} \sum _{i=1}^n x_i^2 = \sum _{i=1}^n (P_i + p_i) \quad \text{ and }\quad |p_i| \le \textbf{u}P_i\quad \text{ for } \text{ all }\; i \in \{1,\ldots ,n\} \end{aligned}$$
(12)

by Lemma 7, so that Lemma 9 implies

$$\begin{aligned} \sum _{i=1}^n P_i = S + \sum _{i=1}^{n-1} s_i \quad \text{ and }\quad \sum _{i=1}^{n-1} |s_i| \le \varphi _{n-1} \sum _{i=1}^n P_i. \end{aligned}$$

Denote the floating-point sum sum(s) by \(\sigma _s\), and correspondingly of the floating-point sum sum(p) by \(\sigma _p\). Note that \(s \in \mathbb {F}^{n-1}\) and \(p \in \mathbb {F}^n\). Then Lemma 8 gives

$$\begin{aligned} \left| \sigma _s - \sum _{i=1}^{n-1} s_i\right| \le \varphi _{n-2} \sum _{i=1}^{n-1} |s_i| \le \varphi _{n-2}\varphi _{n-1} \sum _{i=1}^{n} P_i \end{aligned}$$

and

$$\begin{aligned} \left| \sigma _p - \sum _{i=1}^{n} p_i\right| \le \varphi _{n-1} \sum _{i=1}^{n-1} |p_i| \le \varphi _{n-1}\textbf{u}\sum _{i=1}^{n} P_i. \end{aligned}$$

Furthermore, \(T + t = S + \textrm{fl}(\sigma _s+\sigma _p)\). Hence, using \(S + \sum _{i=1}^{n-1} s_i + \sum _{i=1}^{n} p_i = \sum _{i=1}^{n} x_i^2\),

$$\begin{aligned} \begin{array}{rcl} \left| T + t - \sum _{i=1}^{n} x_i^2\right| &{}\le &{} |S + \sigma _s+\sigma _p - \sum _{i=1}^{n} x_i^2| + \textbf{u}|\sigma _s+\sigma _p| \\ &{}\le &{} \left( \varphi _{n-2}\varphi _{n-1}+\varphi _{n-1}\textbf{u}\right) \sum _{i=1}^{n} P_i + \textbf{u}|\sigma _s+\sigma _p|. \end{array} \end{aligned}$$

Hence

$$\begin{aligned} |\sigma _s| \le \left( 1+\varphi _{n-2}\right) \sum _{i=1}^{n-1} |s_i| \le \left( 1+\varphi _{n-2}\right) \varphi _{n-1} \sum _{i=1}^{n} P_i \end{aligned}$$

and

$$\begin{aligned} |\sigma _p| \le \left( 1+\varphi _{n-1}\right) \sum _{i=1}^{n-1} |p_i| \le \left( 1+\varphi _{n-1}\right) \textbf{u}\sum _{i=1}^{n} P_i \end{aligned}$$

and a calculation shows

$$\begin{aligned} \begin{array} {rcl} \left| T + t - \sum _{i=1}^{n-1} x_i^2\right| &{}\le &{} \left( \varphi _{n-1}\left( \varphi _{n-2}+\textbf{u}+\textbf{u}+\varphi _{n-2}\textbf{u}+\textbf{u}^2 \right) + \textbf{u}^2 \right) \sum _{i=1}^{n} P_i\\ &{}=&{} \left( \varphi _{n-1}\left( (1+\textbf{u})\varphi _{n-2}+2\textbf{u}+\textbf{u}^2 \right) + \textbf{u}^2 \right) \sum _{i=1}^{n} P_i\\ &{}\le &{} \left( \varphi _{n-1}\left( n+\textbf{u}\right) \textbf{u}+ \textbf{u}^2\right) \sum _{i=1}^{n} P_i \\ &{}\le &{} (n^2 - n + 1)\textbf{u}^2 \sum _{i=1}^{n} P_i. \end{array} \end{aligned}$$

For \(n=1\) the left hand side in (11) is zero and the result is faithful by Corollary 1. For \(n \ge 2\), we use (12) to see

$$\begin{aligned} \sum _{i=1}^n P_i = \left| \sum _{i=1}^n x_i^2-p_i\right| \le \sum _{i=1}^n x_i^2 + \sum _{i=1}^n |p_i| \le \sum _{i=1}^n x_i^2 + \textbf{u}\sum _{i=1}^n P_i, \end{aligned}$$
(13)

and again by Corollary 1 the result is faithful if \(32(n^2-n+1)\textbf{u}\le 31(1-\textbf{u})\), and a computation shows that this is true because \(n \le \sqrt{\frac{31}{32}}\textbf{u}^{-1/2}\). \(\square \)

Algorithm VecSum is an error-free vector transformation, so as in [25] we may apply it a second time, thus further diminishing the condition number of the sum (Fig. 9).

Fig. 9
figure 9

Algorithm normSum3

Theorem 2

Let \(x \in \mathbb {F}^n\) be given and apply Algorithm normSum3 to x. Suppose \(n \le (\frac{17}{4}\textbf{u}^2)^{-1/3}\) and \(\textbf{u}\le 2^{-8}\). Then res is a faithful rounding of \(\Vert x\Vert \).

Proof

We proceed as in the proof of Theorem 1 and show that the scalars [Tt] in Algorithm normSum3 satisfy

$$\begin{aligned} \left| T + t - \sum _{i=1}^{n-1} x_i^2\right| \le \frac{31}{32}\textbf{u}\sum _{i=1}^{n} x_i^2. \end{aligned}$$
(14)

The quantities in Algorithm normSum3 are scalars QST and t as well as vectors \(P, p \in \mathbb {F}^{n}\), \(q \in \mathbb {F}^{n-1}\) and \(s \in \mathbb {F}^{2n-2}\). As before \(\sum _{i=1}^n x_i^2 = \sum _{i=1}^n (P_i + p_i)\) with \(|p_i| \le \textbf{u}P_i\) for all \(i \in \{1,\ldots ,n\}\). Furthermore, Lemma 9 implies \(\sum _{i=1}^n P_i = Q + \sum _{i=1}^{n-1} q_i\) and \(\sum _{i=1}^{n} p_i + \sum _{i=1}^{n-1} q_i = S + \sum _{i=1}^{2n-2} s_i\) as well as \(\sum _{i=1}^{n-1} |q_i| \le \varphi _{n-1} \sum _{i=1}^n P_i\) and \(\sum _{i=1}^{2n-2} |s_i| \le \varphi _{2n-2} \left( \sum _{i=1}^{n} |p_i|+\sum _{i=1}^{n-1} |q_i|\right) \). Denote the floating-point sum sum(s) by \(\sigma _s\). Then

$$\begin{aligned} \left| \sigma _s - \sum _{i=1}^{2n-2} s_i\right| \le \varphi _{2n-3} \sum _{i=1}^{2n-2} |s_i|. \end{aligned}$$

Furthermore, \(T + t = Q + \textrm{fl}(S + \sigma _s)\). Hence

$$\begin{aligned} \begin{array}{rcl} \left| T + t - \sum \nolimits _{i=1}^{n} x_i^2\right| &{}\le &{} |Q + S + \sigma _s - \sum \nolimits _{i=1}^{n} x_i^2| + \textbf{u}|S + \sigma _s|\\ {} &{}=&{} |\sigma _s - \sum \nolimits _{i=1}^{2n-2} s_i | + \textbf{u}|S + \sum \nolimits _{i=1}^{2n-2} s_i + \sigma _s - \sum \nolimits _{i=1}^{2n-2} s_i|\\ {} &{}\le &{} (1+\textbf{u})|\sigma _s-\sum \nolimits _{i=1}^{2n-2} s_i| + \textbf{u}|\sum \nolimits _{i=1}^{n} p_i + \sum \nolimits _{i=1}^{n-1} q_i |\\ {} &{}\le &{} \varphi _{2n-3}(1+\textbf{u}) \sum \nolimits _{i=1}^{2n-2} |s_i| + \textbf{u}|\sum \nolimits _{i=1}^{n} p_i + \sum \nolimits _{i=1}^{n-1} q_i |\\ {} &{}\le &{} \left( \varphi _{2n-3}\varphi _{2n-2}(1+\textbf{u}) + \textbf{u}\right) \left( \sum \nolimits _{i=1}^{n} |p_i|+\sum \nolimits _{i=1}^{n-1} |q_i|\right) \\ &{}\le &{} \left( \varphi _{2n-3}\varphi _{2n-2}(1+\textbf{u}) + \textbf{u}\right) \left( \textbf{u}+ \varphi _{n-1}\right) \sum \nolimits _{i=1}^{n} P_i \\ {} &{}=:&{} \varPhi \sum \nolimits _{i=1}^{n} P_i \\ \end{array} \end{aligned}$$

and using (13) yields

$$\begin{aligned} \left| T + t - \sum _{i=1}^{n} x_i^2\right| \le (1-\textbf{u})^{-1} \varPhi \sum _{i=1}^{n} x_i^2. \end{aligned}$$

The factor \(\varPhi \) is monotonically increasing in n. A direct computation for \(\textbf{u}\in \{2^{-e}: 8 \le e \le 53 \}\) and the maximal value \(n:= \lfloor (\frac{17}{4}\textbf{u}^2)^{-1/3} \rfloor \) verifies

$$\begin{aligned} (1-\textbf{u})^{-1} \varPhi \le \frac{31}{32}\textbf{u}. \end{aligned}$$

Hence (14) is true and Corollary 1 finishes the proof. \(\square \)

The error of floating-point summation in Lemma 9 is sharp but, as has been mentioned after Lemma 1, highly overestimated in practice: We hardly find cases with relative error exceeding \(\sqrt{n}\textbf{u}\)—unless we looked for them. In particular it seems unlikely that the worst case bound (10) is attained for all summations in Algorithms normSum2 or normSum3.

Theorems 1 and 2 prove that Algorithms normSum2 and normSum3 compute a faithfully rounded result if the vector length n satisfies \(\frac{32}{31}n^2 \textbf{u}\le 1\) or \(4n^3 \textbf{u}^2 \le 1\), respectively. These are sufficient criteria, but in practice the results are faithful for much larger n.

A rough estimate of this limit under practical assumptions, i.e., when replacing \(\varphi _k\) in Lemma 8 by \(\sqrt{k}\textbf{u}\), suggests a faithfully rounded result for \(n \lesssim \textbf{u}^{-1}\) for Algorithm normSum2 and \(n \lesssim \frac{1}{4}\textbf{u}^{-4/3}\) for Algorithm normSum3. In other words, in practical applications it suffices to use Algorithms normSum2 and we can always expect a faithfully rounded result.

4 Faithfully rounding of \(\Vert x\Vert \) based on absolute splitting of x

The error-free transformation of \(\Vert x\Vert ^2\) into (Pp) with \(P+p=\Vert x\Vert ^2\), as used in [6] and [20] and our algorithms up to now, is based on a relative splitting of the \(x_i\), i.e., each \(x_i\) is transformed into a sum of two floating-point numbers.

Once the vectors Pp are available, any good summation algorithm may be applied. An alternative to a relative splitting of the \(x_i\) was first proposed in [32]. A constant \(\sigma \) larger in absolute value than all summands is chosen. The split of the vector x into a pair of vectors (rs) with respect to \(\sigma \) is such that \(x = r+s\) and all bits of r reside in the same range and such that the sum(r) is error-free. The same principle can be applied successively.

In [32] the splitting was performed using scaling and integer rounding, and no analysis was given. In [28] we pursued that principle in Algorithm AccSum with an efficient implementation and thorough error analysis. Based on that Algorithm AccDot is presented in [28] for the accurate computation of a dot product \(x^Ty\). Basically, it first splits \(x^Ty = r+s\) as in TwoProduct and then applies AccSum. That algorithm can be used for \(\Vert x\Vert \) as well.

Following we split the input vector x into vectors qb directly such that sum(q.*q) is error-free. That avoids the costly splitting \(\Vert x\Vert ^2 = P + p\) by Algorithm TwoSquare. The Algorithm normExtract is presented in Fig. 10. Note that M, in contrast to Algorithm AccSum in [28], is not a power of 2.

Fig. 10
figure 10

Algorithm normExtract

The bound on the dimension n as for Algorithm normSum2 to guarantee that the approximation res is a faithful rounding of \(\Vert x\Vert \) is very conservative. We present this algorithm because it is very fast and, as explained at the end of the previous section, we can expect a faithful result up to \(n \lesssim 79\) million. That may be sufficient in most practical applications.

To that end we need “ufp” as introduced in [28], the unit in the first place

$$\begin{aligned} 0 \ne r \in \mathbb {R}\quad \Rightarrow \quad \textrm{ufp}(r):= 2^{\lfloor \log _2|r| \rfloor } \end{aligned}$$

and \(\textrm{ufp}(0):=0\). Compared to the often used “ulp”, the unit in the last place, it bears the advantage that it is independent of a floating-point format and applies to real numbers as well. The following properties are proved in [28]. For \(\sigma =2^k, k\in \mathbb {Z}, r\in \mathbb {R}\) we have

$$\begin{aligned} r \ne 0 \quad\Rightarrow & {} \quad \textrm{ufp}(r) \le |r| < 2\textrm{ufp}(r) \end{aligned}$$
(15)
$$\begin{aligned} \sigma '=2^m, \; m\in \mathbb {Z}, \; \sigma '\ge \sigma \quad\Rightarrow & {} \quad \textbf{u}\sigma '\mathbb {Z}\subseteq \textbf{u}\sigma \mathbb {Z}\end{aligned}$$
(16)
$$\begin{aligned} f \in \mathbb {F}\quad \text{ and } \quad |f|\ge \sigma \quad\Rightarrow & {} \quad \textrm{ufp}(f)\ge \sigma \end{aligned}$$
(17)
$$\begin{aligned} f \in \mathbb {F}\quad\Rightarrow & {} \quad f \in 2\textbf{u}\cdot \textrm{ufp}(f) \mathbb {Z}\end{aligned}$$
(18)
$$\begin{aligned} r\in \textbf{u}\sigma \mathbb {Z}\cap \mathcal {N}, \; |r|\le \sigma \quad\Rightarrow & {} \quad r\in \mathbb {F}\end{aligned}$$
(19)
$$\begin{aligned} a,b\in \mathbb {F}, \; a\ne 0 \quad\Rightarrow & {} \quad \textrm{fl}(a+b)\in \textbf{u}\cdot \textrm{ufp}(a)\mathbb {Z}\end{aligned}$$
(20)
$$\begin{aligned} \tilde{r}:=\textrm{fl}(r)\in \mathbb {F}\quad\Rightarrow & {} \quad |\tilde{r}-r| \le \textbf{u}\cdot \textrm{ufp}(r) \le \textbf{u}\cdot \textrm{ufp}(\tilde{r}). \end{aligned}$$
(21)

Note that, if \(b \ne 0\), \(\textrm{fl}(a+b)\in \textbf{u}\cdot \textrm{ufp}(b)\mathbb {Z}\) in (20) holds as well.

Theorem 3

Let \(x \in \mathbb {F}^n\) be given and apply Algorithm normExtract to x. Suppose \(n \le \frac{11}{59}\textbf{u}^{-1/3}\) and \(\textbf{u}\le 2^{-8}\). Then res is a faithful rounding of \(\Vert x\Vert \).

Proof

As in the previous proofs we will show that the scalars [Tt] in Algorithm normExtract satisfy

$$\begin{aligned} \left| T + t - \sum _{i=1}^{n} x_i^2\right| \le \frac{31}{32}\textbf{u}\sum _{i=1}^{n} x_i^2. \end{aligned}$$
(22)

Henceforth we assume \(x_i \ge 0\) as justified by line 2 of Algorithm normExtract. Denote by \(\hat{x}:= \texttt {norm(x)}\) MATLAB’s floating-point approximation to \(\Vert x\Vert \). Then Lemma 1 with \(\beta = (\frac{n}{2}+1)\textbf{u}\) shows

$$\begin{aligned} (1-\beta )\Vert x\Vert \le \hat{x} \le (1+\beta )\Vert x\Vert . \end{aligned}$$
(23)

Note that \(4(n+2)\textbf{u}\le 16n\textbf{u}\le 1\) implies \(1-2(n+2)\textbf{u}\ge 1/2\), so that \(\textrm{float}(1-2(n+2)\textbf{u}) = 1-2(n+2)\textbf{u}\) by Sterbenz’ lemma [31], and a calculation using (1) yields for all \(i \in \{1,\ldots ,n\}\)

$$\begin{aligned} \begin{array} {rcl} M = \textrm{float}(\varphi \hat{x}) &{}\ge &{} \frac{4\hat{x}}{(1+\textbf{u})^3(2-(n+8)\textbf{u})\sqrt{\textbf{u}}} \ge \frac{4\hat{x}}{(2-(n+2)\textbf{u})\sqrt{\textbf{u}}} = \frac{2\hat{x}}{(1-\beta )\sqrt{\textbf{u}}} \\ &{}\ge &{} 2\Vert x\Vert /\sqrt{\textbf{u}} \\ {} &{}\ge &{} 32\Vert x\Vert \ge x_i, \end{array} \end{aligned}$$
(24)

where \(\varphi := 4/(2-(n+8)\textbf{u})/\sqrt{\textbf{u}}\). Lines 5 and 6 of Algorithm normExtract are similar to Algorithm FastTwoSum in Fig. 1. More precisely, the code for FastTwoSum(M,x) is identical to

$$\begin{aligned} \texttt {N}= & {} \texttt {M + x}; \\ \texttt {qs}= & {} \texttt {M - N}; \\ \texttt {b}= & {} \texttt {qs + x}; \end{aligned}$$

where q in line 5 of Algorithm normExtract is equal to -qs, and b in line 6 is the same. By (24), Lemma 2 for Algorithm FastTwoSum is applicable, so that there is no rounding error when subtracting M in line 5, i.e., \(q_i = \textrm{fl}(M+x_i)-M\). Using \(\textrm{ufp}(M+x) \le 2\textrm{ufp}(M)\) by (24) that implies

$$\begin{aligned} x_i = q_i + b_i \quad \text{ and }\quad |b_i| \le 2\textbf{u}\cdot \textrm{ufp}(M) \end{aligned}$$
(25)

and \(q_i \le x_i+\textbf{u}\cdot \textrm{ufp}(M+x_i)\) for all \(i \in \{1,\ldots ,n\}\). We distinguish three cases to show

$$\begin{aligned} q_i \le 2x_i. \end{aligned}$$
(26)

First, if \(2\textbf{u}\cdot \textrm{ufp}(M) \le x_i\), then \(\textrm{ufp}(M+x_i) \le 2\textrm{ufp}(M)\) proves (26). Second, if \(\textbf{u}\cdot \textrm{ufp}(M) \le x_i < 2\textbf{u}\cdot \textrm{ufp}(M)\), then \(\textrm{ufp}(M+x) = \textrm{ufp}(M)\) and proves (26) as well. Third and finally, if \(x_i < \textbf{u}\cdot \textrm{ufp}(M)\), then \(\textrm{fl}(M+x_i) = M\) and \(q_i=0\). Thus (26), (24) and (15) yield

$$\begin{aligned} \begin{array} {rcl} \sum _{i=1}^n q_i^2\le & {} 4\Vert x\Vert ^2 \le \textbf{u}M^2 < 4\textbf{u}\cdot \textrm{ufp}(M)^2. \end{array} \end{aligned}$$

Now (20) and (16) yield \(q_i \in \textbf{u}\cdot \textrm{ufp}(M+x)\mathbb {Z}\subseteq 2\textbf{u}\cdot \textrm{ufp}(M)\mathbb {Z}\). Hence \(q_i^2 \in \textbf{u}\cdot 4\textbf{u}\cdot \textrm{ufp}(M)^2 \mathbb {Z}\) and (19) show that the floating-point sum of all \(q_i^2\) is error-free, i.e., \(S = \sum _{i=1}^n q_i^2\). For \(c_i:= \textrm{float}((q_i+x_i)b_i)\) we see by (1) that

$$\begin{aligned} |c_i-(q_i+x_i)b_i| \le \left( \left( 1+\frac{\textbf{u}}{1+\textbf{u}}\right) ^2-1\right) |q_i+x_i||b_i| \le 2\textbf{u}|q_i+x_i||b_i| \end{aligned}$$

and, if \(n \ge 3\),

$$\begin{aligned} \left| \textrm{float}\left( \sum _{i=1}^n c_i\right) - \sum _{i=1}^n c_i\right| \le \frac{(n-1)\textbf{u}}{1+(n-1)\textbf{u}} \sum _{i=1}^n |c_i| \le (n-1)\textbf{u}\sum _{i=1}^n |q_i+x_i||b_i|. \end{aligned}$$
(27)

Moreover, using (23) and \((1+\frac{\textbf{u}}{1+\textbf{u}})^{3} \le 1+3\textbf{u}\),

$$\begin{aligned} M = \textrm{float}(\varphi \hat{x}) \le \frac{4(1+3\textbf{u})}{(2-(n+8)\textbf{u})\sqrt{\textbf{u}}} \left( 1 + (\frac{n}{2}+1) \textbf{u}\right) \Vert x\Vert =: \gamma \Vert x\Vert . \end{aligned}$$
(28)

Thus \(s = \textrm{float}(\sum _{i=1}^n c_i)\), \(x_i = q_i + b_i\), (26) and (25) yield

$$\begin{aligned} \begin{array} {rcl} |T + t - \Vert x\Vert ^2| &{}=&{} |S + s - \Vert x\Vert ^2| = |\sum \nolimits _{i=1}^n q_i^2 + s - \Vert x\Vert ^2| \\ {} &{}=&{} \left| s - \sum \nolimits _{i=1}^n (q_i+x_i)b_i\right| \\ {} &{}\le &{} (n-1)\textbf{u}\sum \nolimits _{i=1}^n |q_i+x_i||b_i| \\ {} &{}\le &{} (n-1)\textbf{u}\cdot 3\Vert x\Vert _1 \cdot 2\textbf{u}M \\ {} &{}\le &{} 6(n-1)\textbf{u}^2 \sqrt{n}\gamma \Vert x\Vert ^2.\\ \end{array} \end{aligned}$$

In order to show (22) we note that \(6(n-1)\textbf{u}^2 \sqrt{n}\gamma \le \frac{31}{32}\textbf{u}\) is equivalent to

$$\begin{aligned} 384(n-1)\sqrt{n\textbf{u}}(1+3\textbf{u})(2+(n+2)\textbf{u}) < 31(2-(n+8)\textbf{u}), \end{aligned}$$

which in turn is equivalent to \(\varPhi := \sum _{i=1}^{3} \alpha _i \textbf{u}^{i/2} + \alpha _5\textbf{u}^{5/2} < 62\) where

$$\begin{aligned} \begin{array} {rcl} \alpha _1 &{}=&{} 768(n-1)\sqrt{n}\\ \alpha _2 &{}=&{} 31n+248 \\ \alpha _3 &{}=&{} (384n^2+2688n-3072)\sqrt{n} \\ \alpha _5 &{}=&{} (1152n^2+1152n-2304)\sqrt{n}. \\ \end{array} \end{aligned}$$

Now \(\varPhi \) is monotonically increasing in n, and a direct computation using the maximal value \(n:= \lfloor \frac{11}{59}\textbf{u}^{-1/3} \rfloor \) shows

$$\begin{aligned} \varPhi < 61.9 + 12\textbf{u}^4 + 465\textbf{u}^6 + 18\textbf{u}^{10} + 93\textbf{u}^{12} \end{aligned}$$

and verifies (22) for \(\textbf{u}\le 2^{-8}\) and \(n \ge 3\). The case \(n=2\) follows by an extra factor \(\frac{1+2\textbf{u}}{1+\textbf{u}}\) in (27). Hence (14) is true and Corollary 1 finishes the proof. \(\square \)

The reason for the severe restriction of the vector length n to guarantee a faithfully rounded result is the estimate (27). As before, a rough estimate under the practical assumption \(\varphi _k \lesssim \sqrt{k}\textbf{u}\) in Lemma 8 suggests a faithfully rounded result for \(n \lesssim \frac{1}{12}\textbf{u}^{-1/2}\) for Algorithm normExtract.

The limit on the dimension for guaranteed faithful rounding is improved by the following Algorithm normExtract2 by introducing a second splitting (Fig. 11).

Fig. 11
figure 11

Algorithm normExtract2

We show this algorithm as yet another example to compute the Euclidean norm faithfully, however, we refrain from giving a complete analysis. We just mention that the main errors occur in line 10, namely, the summation of \(2(q_i+r_i)c_i\) and \(c_i^2\). The following sums of the \(r_i^2\), the \(q_ir_i\) and \(q_i^2\) are error-free.

5 Computation of the nearest rounding of \(\Vert x\Vert \)

The algorithms in the previous section adapt Algorithm AccSum in [28] to the computation of the Euclidean norm of a vector. In [29] we explored that principle by designing Algorithm NearSum to compute the rounded to nearest value of the sum of floating-point numbers, and Algorithm AccSign to compute the sign of the sum. Several other algorithms such as storing the result in an unevaluated vector, the rounded downward and upward result, treatment of vectors of huge lengths and more.

Next we derive Algorithm normNearest to compute the nearest value of the Euclidean norm of a vector. To that end we first present an adapted version of the Algorithm Transform derived in [28] (Fig. 12).

Fig. 12
figure 12

Algorithm transform

In our adaptation we rewrote the “repeat”- into a “while”-loop and omitted the output parameter \(\sigma \). Then Lemma 4.3 in [28] shows the following.

Lemma 10

Let tau1, tau2 and r be the result of Algorithm Transform applied to \(p \in \mathbb {F}^k\), and suppose \(k \le \frac{1}{2}\textbf{u}^{-1/2} - 2\). Then

$$\begin{aligned} \sum _{i=1}^k p_i = \tau _1 + \tau _2 + \sum _{i=1}^k r_i, \end{aligned}$$
(29)

and the MATLAB statement

$$\begin{aligned} \texttt {res} = \texttt {tau1 + (tau2 + sum(r))} \end{aligned}$$

implies that res is a faithful rounding of \(\sum _{i=1}^k p_i\). Moreover,

$$\begin{aligned} \max _{1 \le i \le n}{|r_i|} \le 2^{-2M}\textbf{u}|\tau _1| \quad \text{ and }\quad |\tau _2| \le \textbf{u}|\tau _1|. \end{aligned}$$
(30)

When replacing the constant \(\varPhi \) in line 3 by \(\varPhi = 2^M\), then \(\tau _1\) and \(\sum _{i=1}^k p_i\) have the same sign under the weaker assumption \(k \le \frac{1}{2}\textbf{u}^{-1} - 2\).

Proof

The definition of M in line 2 implies \(2^M \ge k+2 \ge 2^{M-1}\) and therefore

$$\begin{aligned} 2^{2M}\textbf{u}\le 4(k+2)^2\textbf{u}\le 1. \end{aligned}$$

Hence the the assumptions of Lemma 4.3 in [28] are satisfied, and the assertions until (30) follow. The last statement is implied by Theorem 4.2 in [29]. \(\square \)

The smaller the constant \(\varPhi \) is, the less “while”-loops are necessary in Algorithm Transform. As shown in [28] and [29] the chosen constants \(\varPhi = 2^{2\,M}\) for a faithful result and \(\varPhi =2^M\) for the sign are optimal.

Our Algorithm NearSum needs the predecessor and successor of a floating-point number. The next Algorithm PredSucc combines Algorithm 1 in [30] (Fig. 13).

Fig. 13
figure 13

Predecessor and successor of c

In Theorem 2.2 in [30] it is shown that Algorithms Pred and Succ computes the predecessor and successor of a floating-point number c provided that \(\textbf{u}\le \frac{1}{16}\) and except for a tiny range near the smallest positive normalized floating-point number. To avoid that, we scaled the input in line 2 so that, provided no overflow occurs, Algorithm PredSucc computes the predecessor and successor of c. Of course, proper scaling avoids overflow.

Now we can present our Algorithm normNearest in Fig. 14 to compute the nearest value of the Euclidean norm of a vector. It borrows from Algorithm NearSum in [29] and is adapted to our task.

Fig. 14
figure 14

Algorithm for round to nearest Euclidean norm

Remark 1

There are obvious ways to improve Algorithm normNearest by utilizing information obtained in the first transformation in line 3 in the following transformations in lines 5 and possibly 19, or by integrating the call in line 5 into that of line 3. Moreover, the transformation in Algorithm 3.3 in [29] with an extra parameter \(\varrho \) computing a faithful rounding of \(\varrho + \sum _{i=1}^n p_i\) could be used. We refrain from doing that keep the code simple.

Remark 2

Algorithm Transform in line 3 transforms the input vector [Ss] into p. The number of “while”-loops is proportional to the condition number of the sum, i.e., how close the true is sum to the midpoint of adjacent floating-point numbers.

Algorithm Transforms in lines 5 and possibly 19 is applied to the already transformed vector p, so that in all our examples we did not encounter more than 2 loops.

Theorem 4

Let \(x \in \mathbb {F}^n\) be given and apply Algorithm normNearest to x, where Algorithm Transforms in lines 5 and 19 is identical to Transform in Fig. 12 with replacing the constant \(\varPhi \) in line 3 by \(\varPhi = 2^M\). Suppose \(n \le \frac{1}{4}\textbf{u}^{-1/2} - 4\). Then the computed result res is equal to the Euclidean norm of x rounded to the nearest floating-point number, i.e., \(\texttt {res} = \textrm{fl}(\Vert x\Vert )\).

Proof

Line 2 in Algorithm normNearest and Lemma 3 imply \(\sum _{i=1}^n x_i^2 = \sum _{i=1}^n P_i + \sum _{i=1}^n p_i\), so that Lemma 10 shows that

$$\begin{aligned} \sum _{i=1}^n x_i^2 = \tau _1 + \tau _2 + \sum _{i=1}^n p_i \end{aligned}$$
(31)

and that f computed in line 4 is a faithful rounding of \(\Vert x\Vert ^2\). Thus \(\textrm{pred}(f)< \Vert x\Vert ^2 < \textrm{succ}(f)\). The vector argument of Transforms in line 5 is equal to

$$\begin{aligned} Q:= \tau _1 + \tau _2 + \sum _{i=1}^n p_i - f = \sum _{i=1}^n x_i^2 - f. \end{aligned}$$

Lemma 10 shows that the signs of Q and the computed \(\delta \) coincide, It follows that \(\Vert x\Vert ^2\in (\textrm{pred}(f),f)\) if \(\delta < 0\), \(\Vert x\Vert ^2 \in (f,\textrm{succ}(f))\) if \(\delta > 0\), and \(\Vert x\Vert ^2=f\) if \(\delta =0\). Thus lines \(6--11\) imply that \(\Vert x\Vert ^2\) is in the convex union of f and \(f_2\). Denote the pair \((f,f_2)\) by \((s_1,s_2) \in \mathbb {F}^2\) with \(s_1 \le s_2\), such that

$$\begin{aligned} s_1< \Vert x\Vert ^2 < s_2 \quad \text{ or }\quad s_1 = \Vert x\Vert ^2 = s_2 \end{aligned}$$
(32)

and \(s_2 \le \textrm{succ}(s_1) \le (1+2\textbf{u}) s_1\). Set \(g_i:= \textrm{fl}(\sqrt{s_i})\) for \(i \in \{1,2\}\). Then (1), \(\sqrt{1+2\textbf{u}} < 1+\textbf{u}\), the monotonicity of the rounding \(\textrm{fl}(\cdot )\) and \(\textrm{fl}((1+\textbf{u})x) \le \textrm{fl}((1+\textbf{u})^2 \textrm{fl}(x)) \le \textrm{succ}(\textrm{fl}(x))\) for \(x \in \mathbb {R}\) imply

$$\begin{aligned} g_2 = \textrm{fl}(\sqrt{s_2}) \le \textrm{fl}(\sqrt{(1+2\textbf{u})s_1}) \le \textrm{fl}((1+\textbf{u})\sqrt{s_1}) \le \textrm{succ}(\textrm{fl}(\sqrt{s_1})) = \textrm{succ}(g_1). \end{aligned}$$

Hence \(g_1\) and \(g_2\) are equal or adjacent floating-point numbers, and (32) yields

$$\begin{aligned} g_1 = \textrm{fl}(\sqrt{s_1}) \le \textrm{fl}(\Vert x\Vert ) \le \textrm{fl}(\sqrt{s_2}) = g_2. \end{aligned}$$

In other words, the nearest rounding of \(\Vert x\Vert \) is in \(\{g_1,g_2\}\). Thus, if \(g_1=g_2\), the nearest rounding is equal to \(g_1=g_2\) which is handled in line 15.

Otherwise, line 17 implies \(g_1^2 = R + r\). Then d, which is a power of 2 because it is half the distance between \(g_1\) and \(g_2\), is computed in line 18 without rounding error. Thus the product \(2 g_1 d\) is computed without error as well, and the sum of the vector argument of Transforms in line 19 is equal to

$$\begin{aligned} S:= \tau _1 + \tau _2 + \sum _{i=1}^n p_i - R - r - 2 g_1 d - d^2 = \sum _{i=1}^n x_i^2 - (g_1+d)^2. \end{aligned}$$

Note that the length of the vector argument is \(2n+6\) and the assumption on n verifies that Lemma 10 is applicable and implies that \(\textrm{sign}(\texttt {Delta}) = \textrm{sign}(S)\). Now \(g_1+d\) is the midpoint between the adjacent floating-point numbers \(g_1\) and \(g_2\), and the result follows by \(\Vert x\Vert \in \{g_1,g_2\}\). \(\square \)

We mention that the assumption \(n \le \frac{1}{4}\textbf{u}^{-1/2} - 4\) can be lifted to \(n \le \frac{1}{32}\textbf{u}^{-1}-64\) using the ideas in Algorithm AccSumHugeN in [29], but we refrain from exploring this.

We showed that the f computed in line 4 is a faithful rounding of \(\Vert x\Vert ^2\). As has been noted in [6] that does not imply that \(\textrm{fl}(\sqrt{f})\) is a faithful rounding of \(\Vert x\Vert \), but likely AccSqrt(f,delta) is.

6 Generation of ill-conditioned examples

A vector p is ill-conditioned with respect to the nearest rounding of \(\Vert p\Vert \) if a very small change of the input data changes the result. The closer \(\Vert p\Vert \) is to a switching point, the more difficult and ill-conditioned is the computation of the nearest rounding. For positive \(f \in \mathbb {F}\) its successor is \(\textrm{succ}(f) = f + 2\textbf{u}\cdot \textrm{ufp}(f)\), so that the switching point is \(\mu = f + \textbf{u}\cdot \textrm{ufp}(f) =: f + \delta \). Then \(\varepsilon = \frac{\delta '-\delta }{\delta }\) is the relative distance of \(\Vert p\Vert = f + \delta '\) to the switching point \(f + \delta \).

For given \(\varepsilon \) it is, in principle, not too difficult to generate a vector p with \(\Vert p\Vert \) having a relative distance \(\varepsilon \) to a switching point. To that end a multiple precision package may be helpful. However, when doing this we observed a severe influence on the timing. The mere presence of a call to the multiple precision package, of course, outside the loop to be measured, changed the measured computing time by a factor of 2 and more. Therefore, we wrote Algorithm GenVec, see Fig. 15. Using it ensured reliable computing times.

Fig. 15
figure 15

Vector \(p \in \mathbb {F}^n\) with relative distance \(\varepsilon \) of \(\Vert p\Vert \) to a switching point

The challenge is to approximate the anticipated final result \(\Vert p\Vert \) near a switching point s “from below”: During a loop the vector norm must always stay below s. That is the principle of Algorithm GenVec, a nice example of our algorithms with absolute splitting used for faithful and nearest rounding.

The rationale is as follows. The output vector p is computed in K segments each of length m. The initialization in line 4 ensures that the final vector length is n. The floating-point number f in line 6 or its successor \(f+1\) is the anticipated result of the nearest rounding of the final vector p to be generated with relative distance e to the switching point \(f + 0.5\). The initial vector p as in line 4 satisfies \(\Vert p\Vert ^2 = p_1 + p_2\) and \(f-\Vert p\Vert > 0\). Lines \(7--8\) yield \(f^2 = F_1+F_2\) and \(e\cdot f = ef_1 + ef_2\), so that

$$\begin{aligned} \sum S_i = f^2 + f + \frac{1}{4} + e\cdot f - \Vert p\Vert ^2 = \left( f + \frac{1}{2}\right) ^2 + e\cdot f - \Vert p\Vert ^2 =: T \end{aligned}$$

for the S in line 9. Here \(\sum S_i\) denotes the mathematical sum of all elements of S. Furthermore, lines \(10-11\) and Lemma 10 imply that sumS is a faithful rounding of \(\sum S_i\). The \(\varphi \) in line 15 satisfies \(\varphi \le 1-4\textbf{u}\) reasonable values of n and e, so that ps in line 14 satisfies

$$\begin{aligned} ps = \textrm{float}\left( \varphi \sqrt{\texttt {sumS}}\right) \le \left( 1+\frac{\textbf{u}}{1+\textbf{u}}\right) ^2 (1-4\textbf{u}) \sqrt{\texttt {sumS}} < (1-2\textbf{u}) \sqrt{\texttt {sumS}}. \end{aligned}$$
(33)

In the for-loop the element ps is appended to the vector p and \(-ps^2 = -ps_1 - ps_2\) to the vector S, so that the sum \(T = \sum S_i\) changes into \(T - ps^2\). Since sumS is a faithful rounding of \(\sum S_i\), (33) implies that \(T - ps^2 > 0\).

At the end of every loop, sumS is always a faithful rounding of the sum \(\sum S_i\) by lines \(19-20\), and the construction implies that sumS decreases into \((1-\varphi ^2) \texttt {sumS}\) in each step. The starting value of sumS is about \(f^2\), and \(\varphi \) and K are chosen such that \(\texttt {sumS} \le |e|\) after finishing the loop.

After finishing the for-loop, sumS is a faithful rounding of \((f + \frac{1}{2})^2 + e\cdot f - \sum p_i^2\). Since \(f \ge 2^{52}\) we conclude that \(\Vert p\Vert ^2\) is very close to \((f + \frac{1}{2})^2 + e\cdot f\), hence

$$\begin{aligned} \Vert p\Vert \approx \sqrt{\left( f + \frac{1}{2}\right) ^2 + e\cdot f} \approx f + \frac{1}{2} + \frac{e}{2}. \end{aligned}$$
(34)

In the above setting \(\delta = \frac{1}{2}\) and \(\delta ' = \frac{1+e}{2}\), so that the relative distance of \(\Vert p\Vert \) is \(\frac{\delta '-\delta }{\delta } = e\). The “approximations” in (34) are very accurate.

Finally, if \(e<0\), then \(\Vert p\Vert \) is left of the switching point \(f+\frac{1}{2}\) and f is the nearest rounding, otherwise, as computed in line 24, the nearest rounding is \(f+1\). The random perturbation in line 22 may be useful for testing the generality of algorithms.

It is clear from the code that the elements of one segment are close together, and the segments decay with the factor \(\varphi \). If the number of segments K is increased, then a better distrubution of the vector elements of p is obtained, however, at the cost of increasing computing time.

7 Computational results

The following computational results are all performed using MATLAB Version 2020b on some core i7 Laptop. In all of the following examples the number of test cases is generally 1 million, but for large dimensions chosen such that the computing time stays below 1 hour.

We start with some timing comparisons of variants of MATLAB implementations. For example, an alternative to Algorithm Split in Fig. 2 is the following (Fig. 16).

Fig. 16
figure 16

Alternative splitting

The following Table 1 shows the computing time of Algorithm Split1 divided by that for Algorithm Split1. We also compare sqr(a) vs. a.*a, TwoProduct vs. TwoSquare as in Fig. 3 and VecSum vs. FastVecSum as in Fig. 7.

Table 1 Time comparisons for different vector lengths

The original Algorithm Split is significantly faster than the simulation by log2 and round, so we use Algorithm Split. Similarly, Algorithm TwoSquare in Fig. 3 is some \(50\%\) faster than Algorithm TwoProduct, and the loop-free variant Algorithm FastVecSum in Fig. 7 is much faster than Algorithm VecSum in [25]. We use a.*a because the time seems the same as for sqr(a).

Before we come to timing comparisons, we give information about the accuracy of our algorithms and competitors. We start with possibilities to approximate \(\Vert x\Vert \) by the built-in MATLAB routines, where the obvious candidate is norm(x). We generate random testcases and display triples of numbers: The first and second is the percentage of nearest and faithful roundings, respectively, and the third the percentage where the result is not faithful.

Table 2 Percentage of rounding is nearest/faithful/none

As can be seen in the third row of Table 2, the built-in function norm(x) is surprisingly accurate, more accurate than theory predicts [7,8,9].

We therefore perform tests on the same data using sqrt(sum(x.*x)) and an ordinary for-loop. Still sqrt(sum(x.*x)) is more accurate than expected, only the for-loop shows the anticipated behavior.

Details on the actual implementation of norm and sum are confidential, but the data in Table 2 suggests that some higher precision or compensating algorithms are used.

The essential difference between Algorithms normG [6] and normL [20] is the use of AccSqrt of [17] for the square root approximation in the last line of normL. To see the advantage, we use Algorithm normGacc which is identical to Algorithm normG except that

$$\begin{aligned} \text{ the } \text{ last } \text{ line } \;\;\texttt {res = sqrt(S);} \;\;\text{ is } \text{ changed } \text{ into } \;\;\texttt {res = AccSqrt(S,s);} \end{aligned}$$

The following Table 3 shows the percentage of nearest rounding random test cases with different dimensions.

Table 3 Percentage of nearest rounding of normG vs. normGacc

There was no case with not faithful rounding, as proved in [6], and for the improved Algorithm normGacc we found only for \(n = 10^7\) cases where the rounding was not to nearest, in fact, some \(17 \%\).

Up to now we used random vectors produced by randn(n,1) for which it is not too difficult to calculate a nearest rounding of \(\Vert x\Vert \). That changes when the true result is close to the midpoint between two adjacent floating-point numbers, i.e., close to a “switching point”.Footnote 2

To that end we use Algorithm GenVec to generate vectors x of different dimensions with relative distance \(\varepsilon \) of \(\Vert x\Vert \) to a switching point. For each pair of dimension n and relative distance \(\varepsilon \), we display the percentage of nearest roundings in Table 4.

In all test cases and for all algorithms we did not encounter an example with not faithful rounding. As already seen in Table 3, generally Algorithm normGacc outperforms the original normG in terms of accuracy. Algorithm normG is targeted to a faithfully rounded result S, not minimizing the error of \(S+s\) versus \(\Vert x\Vert ^2\). Thus about half the results of Algorithm normG are nearest, the other half faithful but not nearest.

Algorithm normDD uses a general purpose double-double arithmetic, and Algorithm normCpair our pair arithmetic with computable error bounds. As Algorithms normGacc and normL are tailored methods but use the same principle, we expect similarly accurate results. Indeed, that can be seen in Table 4 for all test examples including very small distance to a switching point. For a relative distance \(\varepsilon \) downto about \(10^{-14}\) the rounding is nearest. Similarly, as Algorithms normSum2 and normSum3 are based on similar principles, they show the same accuracy, with normSum3 being a little bit better.

Table 4 Percentage of nearest rounding for relative distance \(\varepsilon \) of \(\Vert x\Vert \) to switching point

Algorithms normExtract and normExtract2 are based on a different principle, namely on an absolute splitting. As mentioned this avoids the costly application of Algorithm AccSquare. For moderate distance \(\varepsilon \) the rounding is nearest, including large vector lengths, for distance \(10^{-10}\) and below the accuracy is similar to normG with roughly a 50-50 chance of nearest result. As we will see next, the little less number of nearest roundings is compensated by a much better performance.

The number of nearest cases improves a little bit with normExtract2, and the result of Algorithm normNearest is, of course, always rounded to nearest.

Next we present timing results for our algorithms and competitors for random vectors and for dimension up to \(n = 10^7\). It is appropriate to use random vectors because the computing time of all algorithms except normNearest do not depend on the difficulty of the problem, only on the length of the input vector; times for normNearest for different \(\varepsilon \) are displayed separately.

It turns out that our new Algorithm normExtract is always the fastest. Therefore the following Table 5 shows the time ratio against normExtract. The timing for Algorithms normDD and normCpair is dominated by MATLAB’s interpretation overhead and in particular by the use of operator overloading. Therefore comparing the computing times hardly gives information on the performance of the algorithms and is omitted.

Table 5 Timing relative to normExtract for random vectors of length n

From the operation count it may surprise that normExtract is so much faster than normG and normL. Now normExtract is based on AccSum in [28], and it was analyzed by Langlois [19] that it enjoys a better instruction-level parallelism than other algorithms. The same applies to normSum2 and may explain its relatively good performance, and also to normSum3 where we see twice the computing time of normSum2, as expected. That is still faster than normG and normL. There is an exception to all algorithms, namely \(n=10^5\). We think this is due to unfortunate cache management, similarly for normNearest.

Algorithm Nearest is about as fast as normL, for medium size dimension much faster, although it guarantees a nearest rounding of \(\Vert x\Vert \).

The time in seconds for 5000 calls in dimension up to 1 million of all algorithms is shown in Fig. 17. The legend on the left is ordered by performance, from the slowest normG downto the fastest normExtract. All algorithms except Algorithm normNearest execute the same code independent of the difficulty of the problem, hence the computing time depends almost linearly on the dimension. For normNearest we see small zig-zags depending on the number of transformations.

Fig. 17
figure 17

Timing for 5000 calls for random vectors of dimension up to 1 million

Finally we investigate whether the guarantee of nearest rounding causes a time penalty for Algorithm normNearest if \(\Vert x\Vert \) is very close to a switching point. As before we generate examples with relative distance \(\varepsilon \) to a switching point. The ratio of computing time of Algorithm normNearest to normExtract are displayed in Table 6; the time for the other algorithms does not change because they are independent of the condition of the problem.

Table 6 Timing of normNearest/normExtract, relative distance \(\varepsilon \) of \(\Vert x\Vert \) to switching point

There is not much performance impact on the computing time of Algorithm normNearest in our examples despite the guarantee of nearest rounding of \(\Vert x\Vert \), even for a tiny relative distance \(\varepsilon = 10^{-100}\) to a switching point.

8 Summary

We may use a general purpose pair arithmetic such as double-double [2] or [17] to calculate an accurate approximation of the Euclidean norm \(\Vert x\Vert \) of a vector. To that end we presented Algorithms normDD and normCpair in Fig. 6. Specialized algorithms based on a pair arithmetic have been presented in [6, 20] and are displayed as Algorithms normG and normL in Fig. 5.

In this note we developed Algorithms Sum2 and Sum3 in Sect. 3 based on relative splitting as algorithm in [25]. The performance is significantly improved by a vectorized version FastVecSum of VecSum in [25]. In addition, Algorithms Extract and Extract2 based absolute splittings as in [28, 29] are presented in Sect. 4.

All algorithms mentioned so far compute a faithfully rounded result of \(\Vert x\Vert \), in many cases the nearest result. A first algorithm to provably compute the rounded to nearest result is presented as Algorithm normNearest.

The computing times of our new algorithms compare favorably to the competitors, where Algorithm normExtract is significantly faster than all others. Algorithm normNearest is also fast despite the guaranteed nearest rounding. That includes difficult cases where the true Euclidean norm \(\Vert x\Vert \) has a relative distance as small as \(\varepsilon = 10^{-100}\) to a switching point.