1 Introduction

The Arithmetic of Infinity was introduced by Y.D. Sergeyev with the aim of devising a new coherent computational environment able to handle finite, infinite and infinitesimal quantities, and to execute arithmetical operations with them. It is based on a positional numeral system with the infinite radix \({\textcircled {1}}\), called grossone and representing, by definition, the number of elements of the set of natural numbers \(\mathbb {N}\) (see, for example, Sergeyev (2008, (2009) and the survey paper Sergeyev (2017)). Similar to the standard positional notation for finite real numbers, a number in this system is recorded as

$$\begin{aligned} c_{{m}} {\textcircled {1}}^{p_{m}} \ldots c_{{1}} {\textcircled {1}}^{p_{1}} c_{{0}} {\textcircled {1}}^{p_{0}} c_{{-1}} {\textcircled {1}}^{p_{-1}} \ldots c_{{-k}} {\textcircled {1}}^{p_{-k}}, \end{aligned}$$

with the obvious meaning

$$\begin{aligned} c_{{m}} {\textcircled {1}}^{p_{m}} + \cdots + c_{{1}} {\textcircled {1}}^{p_{1}} +c_{{0}} {\textcircled {1}}^{p_{0}} + c_{{-1}} {\textcircled {1}}^{p_{-1}} + \cdots + c_{{-k}} {\textcircled {1}}^{p_{-k}}. \end{aligned}$$
(1)

The coefficients \(c_i\), called grossdigits, are real numbers while the grosspowers \(p_i\), sorted in decreasing order

$$\begin{aligned} p_{m}> \cdots> p_{1}> p_0=0> p_{-1}> \cdots > p_{-k}, \end{aligned}$$

may be finite, infinite or infinitesimal even though, for our purposes, only finite integer grosspowers will be considered.

Notice that, since \({\textcircled {1}}^0=1\) by definition, the set of real numbers and the related operations are naturally included in this new system. In this respect, the Arithmetic of Infinity should be perceived as a more powerful tool that improves the ability of observing and describing mathematical outcomes that the standard numeral system could not properly handle. In particular, the new system allows us to better inspect the nature of the infinite objects we are dealing with. For example, while \(\infty +1=\infty \) in the standard thinking, if we are in the position to specify as, say \({\textcircled {1}}\), the kind of infinity we are observing using the new methodology, such an equality could be better replaced with \({\textcircled {1}}+1>{\textcircled {1}}\). According to the principle that the part is less than the whole, this novel perception of the infinite dimensionality has proved successful in resolving a number of paradoxes involving infinities and infinitesimals, the most famous being Hilbert’s paradox of the Grand Hotel (see Sergeyev (2008, (2009)).

The Arithmetic of Infinity paradigm is rooted in three methodological postulates and its consistency has been rigorously recognized in Lolli (2015). Its theoretical and practical implications are formidable also considering that the final goal is to make the new computing system available through a dedicated processing unit. The computational device that implements the Infinity Arithmetic has been called Infinity Computer and is patented in EU, USA, and Russia (see, for example, Sergeyev (2010)). It is worth emphasizing that the Infinity Computer methodology is not related to non-standard analysis (see Sergeyev (2019)).

Among the many fields of research this new methodology has been successfully applied, we mention numerical differentiation and optimization (Cococcioni et al. 2020; De Cosmis and Leone 2012; Sergeyev 2011; Žilinskas 2012), numerical solution of differential equations (Amodio et al. 2016; Iavernaro et al. 2019; Mazzia et al. 2016; Sergeyev 2013; Sergeyev et al. 2016), models for percolation and biological processes (Iudin et al. 2012; Vita et al. 2012), cellular automata (D’Alotto 2015; Iudin et al. 2015). Finally, in Falcone et al. (2020) the authors introduce a Simulink-based simulator of the Infinity Computer, especially suitable for visual programming in control theory. For further references and applications see the survey (Sergeyev 2017).

The aim of the present study is to devise a dynamic precision floating-point arithmetic by exploiting the computational platform provided by the Infinity Computer. In contrast with standard variable precision arithmetics, here not only may the accuracy be dynamically changed during the execution of a given algorithm, but variables stored with different accuracies may be combined through the usual algebraic operations. This strategy is explored and addressed to the accurate solution of ill-conditioned/unstable problems (Brugnano et al. 2011; Iavernaro et al. 2006).

One interesting application is the possibility of handling ill-conditioned problems or even of implementing algorithms which are labeled as unstable in standard floating-point arithmetic.Footnote 1 One example in this direction has been illustrated in Amodio et al. (2020). It consists in the use of the iterative refinement to improve the accuracy of a computed solution to an ill-conditioned linear system until a prescribed input accuracy is achieved.

Observe that the new architectures for high-performance computing allow the use of low precision floating point arithmetic available in hardware devices like, for example, GPUs. Consequently, a new generation of numerical algorithms is going to be developed, that work with mixed precision (see Dongarra et al. (2020) for a review). In this direction, (Carson and Higham 2018) illustrates an example of the use of iterative refinement with the aid of three different machine precisions.

The paper is organized as follows. In the next section we highlight those features of the Infinity Computer that play a key role to set up the variable-precision arithmetic. This latter is discussed in Sect. 3 together with a few illustrative examples. As an application in Numerical Analysis, in Sect. 4 we consider the problem of finding the zero of a nonlinear function affected by ill-conditioning issues. Finally, some conclusions are drawn in Sect. 5.

2 Background

As is the case with the standard floating-point arithmetic, the Infinity Computer handles both numbers and operations numerically (not symbolically). Consequently, it is prone to efficiently afford the massive amount of computation needed while solving a wide variety of real-life problems. On the other hand, a roundoff error proportional to the machine accuracy is generated during representation of data (i.e., the coefficients \(c_i\) and \(p_i\) in (1)) and execution of the basic operations. We will give a more detailed description about how the representation of grossdigits and the floating-point operations should be carried out in the next section. Here, for sake of simplicity, we will neglect these sources of errors.

The grossnumbers that will be considered in the sequel are those that admit an expansion in terms of integer powers of \({\textcircled {1}}^{-1}\) and, thus, take the form

$$\begin{aligned} X =\sum _{j=0}^{T} c_j{\textcircled {1}}^{-j}, \end{aligned}$$
(2)

where T denotes the maximum order of infinitesimal appearing in X. For this special set, the arithmetic operations on the Infinity Computer follow the same rules defined for the polynomial ring. For example, given the two grossnumbers

$$\begin{aligned} X=x_0{\textcircled {1}}^0+x_1{\textcircled {1}}^{-1}, \quad Y=y_0{\textcircled {1}}^0+y_1{\textcircled {1}}^{-1}+y_2{\textcircled {1}}^{-2}, \end{aligned}$$
(3)

we get

$$\begin{aligned} X+Y&=(x_0+y_0){\textcircled {1}}^0+(x_1+y_1){\textcircled {1}}^{-1}+y_2{\textcircled {1}}^{-2},\\ X\cdot Y&=x_0y_0{\textcircled {1}}^0+(x_0y_1+x_1y_0){\textcircled {1}}^{-1}\\&\quad +(x_0y_2+x_1y_1){\textcircled {1}}^{-2}+x_1y_2{\textcircled {1}}^{-3}, \end{aligned}$$

and analogously for the division X/Y. Notice that, on the Infinity Computer, variables may coexist with different storage requirements. Taking aside the (negative) powers of \({\textcircled {1}}\) that, as we will see, need not to be stored in our usage, the variable Y displays infinitesimals quantities up to the order 2, thus requiring one extra record to store the grossdigit \(y_2\), if compared with the variable X that only contains a first-order infinitesimal. This circumstance also influences the computational complexity associated with each single floating-point operation. As a consequence of the different amount of memory allocated for storing grossnumbers, the global computational complexity associated with a given algorithm performed on the Infinity Computer, cannot be merely estimated in terms of how many flops are executed, but should also take into account how many grossdigits are involved in each operation.

If X is chosen as in (2), we denote by \(X^{(q)}\) its section obtained by neglecting, in the sum, all the infinitesimals of order greater than q, that is

$$\begin{aligned} X^{(q)}= \sum _{j=0}^q c_j {\textcircled {1}}^{-j}. \end{aligned}$$
(4)

For example, choosing \(q=0\) and X and Y as in (3), we see that \(X^{(0)}+Y^{(0)} = x_0+y_0\) and \(X^{(0)}\cdot Y^{(0)} = x_0y_0\) would resemble the floating-point addition and multiplication in standard arithmetic, respectively, while additional effort is needed if other powers of \({\textcircled {1}}^{-1}\) are successively involved. More precisely, the computational cost associated with a single operation of two grossnumbers will depend on how many infinitesimal are considered. Assuming \(q<p\) and denoting by \(d_j\) the grossdigits associated with Y, for the two sections \(X^{(q)}\) and \(Y^{(p)}\), the addition

$$\begin{aligned} X^{(q)}+Y^{(p)} = \sum _{j=0}^q (c_j+d_j) {\textcircled {1}}^{-j} + \sum _{j=q+1}^p d_j {\textcircled {1}}^{-j} \end{aligned}$$
(5)

requires \(q+1\) additions of grossdigits, while the multiplication

$$\begin{aligned} X^{(q)}\cdot Y^{(p)}&=\sum _{j=0}^q \sum _{i=0}^j c_id_{j-i} {\textcircled {1}}^{-j} + \sum _{j=q+1}^p \sum _{i=0}^q c_id_{j-i} {\textcircled {1}}^{-j} \nonumber \\&\quad + \sum _{j=p+1}^{p+q}\; \sum _{i=j-p}^q c_id_{j-i} {\textcircled {1}}^{-j} \nonumber \\&=\sum _{j=0}^{q+p}\quad \sum _{i=\max \{0,j-p\}}^{\min \{q,j\}} c_id_{j-i} {\textcircled {1}}^{-j}, \end{aligned}$$
(6)

amounts to \((q+1)(p+1)\) multiplications and \(qp-q(q-1)/2\) additions/subtractions of grossdigits.Footnote 2 It is worth noticing that, since in both operations all the coefficients of \({\textcircled {1}}^{-j}\) may be independently calculated, there is room for a huge parallelization. We will not consider this aspect in detail in the present study.

3 A variable-precision representation of floating-point numbers on the Infinity Computer

Grossnumbers of the form (2) and their Sect. (4) form the basis of the new floating-point arithmetic where numbers with a different accuracy may be simultaneously represented and combined. The idea is to let \({\textcircled {1}}^{-1}\) and its powers act as machine infinitesimal quantities when related to the classical floating-point system. These infinitesimal entities, if suitably activated or deactivated, may be conveniently exploited to increase or decrease the required accuracy during the flow of a given computation. This strategy may be used to automatically detect ill-conditioning issues during the execution of a code that solves a given problem, and to change the accuracy accordingly, in order to optimize the overall computational effort under the constrain that the resulting error in the output solution should fit a given input tolerance. A formal introduction of the new dynamic precision arithmetic is discussed hereafter.

3.1 Machine numbers and their storage in the Infinity Computer

Let t and T be two given non-negative integers and \(N=(T+1)(t+1)-1\). The set of machine numbers we are interested in is given by

$$\begin{aligned} {\mathbb {F}}= \left\{ X\in {\mathbb {R}}~\big |~ X=\pm \beta ^p\sum _{i=0}^N d_i \beta ^{-i} \right\} \cup \{0\}, \end{aligned}$$
(7)

where \(\beta \ge 2\) denotes the base of the numeral system, the integer p is the exponent ranging in a given finite interval, and \(0\le d_i\le \beta -1\) are the significant digits, with \(d_0\not =0\) (normalization condition). Starting from \(d_0\), we group the digits \(d_i\) in \(T+1\) adjacent strings each of length \(t+1\):

$$\begin{aligned} X&= \pm \beta ^p \, \underbrace{d_0.d_1\cdots d_t}_{t+1}\underbrace{d_{t+1}\cdots d_{2t+1}}_{t+1}\cdots \nonumber \\&\quad \underbrace{d_{j(t+1)}\cdots d_{(j+1)(t+1)-1}}_{t+1} \cdots \underbrace{d_{T(t+1)}\cdots d_{(T+1)(t+1)-1}}_{t+1}\nonumber \\&= \pm \beta ^p \sum _{j=0}^T \beta ^{-j(t+1)} \sum _{i=0}^t d_{j(t+1)+i} \beta ^{-i}. \end{aligned}$$
(8)

The representation of the numbers X as in (7), under the shape (8), suggests an interesting application of the Infinity Computer. Introducing the new symbol , called dark grossone, as

(9)

and setting

$$\begin{aligned} c_j=\sum _{i=0}^t d_{j(t+1)+i} \beta ^{-i}, \end{aligned}$$
(10)

the number X in (8) may be rewritten as

(11)

Its section of index q is then given by

(12)

In analogy with the arithmetic of grossone, we continue referring to the coefficients \(c_j\) as grossdigits. We assume that a real number x is represented by a floating-point number X in the form (11) by truncating or rounding it to the nearest even, after the digit \(d_N\). This is the most attainable accuracy during the data representation phase but, in general, a lower accuracy (and hence faster execution times) will be required while processing the data, which will be achieved by involving sections of X of suitable indices q during the computations. In this respect, for \(q=0,1,2,\dots , T\), we can interpret the section \(X^{(q)}\) in (12) as the single, double, triple and, in general, multiple precision approximation of x, respectively. These attributes are specific for the variable-precision arithmetic epitomized by the sections \(X^{(q)}\) defined in (12) and should not be confused with the corresponding meaning in the IEEE-754 standard for binary floating-point arithmetic: only for the particular choice \(\beta =2\) and \(t=23\) may the two contexts exhibit some similarities.

Echoing the symbol \({\textcircled {1}}\), the new symbol emphasizes the formal analogy between a machine number and a grossnumber (compare (11) with (2) and (12) with (4)). This correspondence suggests that the computational platform provided by the Infinity Computer may be conveniently exploited to host the set \({\mathbb {F}}\) defined at (7) and to execute operations on its elements using a novel methodology. This is accomplished by formally identifying the two symbols, which means that, though they refer to two different definitions, they are treated in the same way in relation to the storage and execution of the basic operations. In accord with the features outlined in Sect. 2, the Infinity Computer will then be able to:

  1. (a)

    store floating-point numbers at different accuracy levels, by involving different infinitesimal terms, according to the need;

  2. (b)

    easily access to sections of floating-point numbers as defined in (12);

  3. (c)

    perform computations involving numbers stored with different accuracies.

The affinity between the meaning of the two symbols goes even beyond what has been stated above. We have already observed that the case \(q=0\) in (12) resembles the standard set of floating-point numbers with \(t+1\) significant figures. This means that when the Infinity Computer works with numbers of the form \(X^{(0)}\) it precisely matches the structure designed following the principles of the IEEE 754 standard. In this mode, the operational accuracy is set at its minimum value and the upper bound on the relative error due to rounding (unit roundoff) is . In other words, will be perceived as an infinitesimal entity which cannot be handled unless we let numbers in the form \(X^{(1)}\) come into play. This argument can then be iterated to involve , \(i=2,\dots ,T\). Mimicking the same concept expressed by the use of \({\textcircled {1}}\), negative powers of act like lenses to observe and combine numbers using different accuracy levels.

Remark 1

What about the role of as an infinite-like quantity? Consider again the basic operational mode with numbers in the form \(X^{(0)}\). If we ask the computer to count integer numbers according to the scheme

figure a

it would stop at , yielding a further similarity with the definition of \({\textcircled {1}}\) in the Arithmetic of Infinity. Again, involving sections of higher index, the counting process could be safely continued. Anyway, to avoid possible sources of confusion that might arise from similarities such as the one highlighted above, we stress that the roles of and \({\textcircled {1}}\), as well as their methodological and philosophical meanings, are completely different. For example, while the dark grossone is mainly addressed to extend the capabilities of the standard floating-point arithmetic, the grossone methodology covers a wider spectrum of applications and, in particular, may successfully cope with situations where the infinite or infinitesimal nature of an object has to be explored, e.g., measuring infinite or infinitesimal sets.

In conclusion, the role of could be interpreted as an inherent feature of the machine architecture which, consistently with the Infinity Arithmetic methodology, could activate suitable powers of to get, when needed, a better perception of numbers. The examples included in the sequel further elucidate this aspect.

3.2 Floating-point operations

We have seen that, through the formal identification of with \({\textcircled {1}}\), it is possible to store the elements of \({\mathbb {F}}\) as if they were grossnumbers and, consequently, to take advantages of the facilities provided by the Infinity Computer in accessing their sections and performing the four basic operations on them, according to the rules described in Sect. 2 (see, for example, (5) and (6)). For these reasons, in the sequel, we shall use \({\textcircled {1}}\) in place of when working on the Infinity Computer, even though, due to the finite nature of , the result of a given operation may not be in the form (12), so that a normalization procedure has to be considered. Hereafter, we report a few examples in order to elucidate this aspect. For all cases, a binary base has been adopted for data representation.

Addition Set \(t=3\) and \(T=2\) (three grossdigits each with four significant digits), and consider the sum of the two floating-point normalized numbers:

$$\begin{aligned} X&= 2^0 \cdot 1.11010101110, \\ Y&= 2^{-3} \cdot 1.11111001011. \end{aligned}$$

Table 1 summarizes the procedure by a sequence of commented steps.

Table 1 Scheme of the addition of two positive floating-point numbers

First of all, the two numbers are stored in memory by distributing their digits along the powers \({\textcircled {1}}^0\), \({\textcircled {1}}^{-1}\) and \({\textcircled {1}}^{-2}\) (step (a)). Before summing the two numbers, an alignment is performed to make the two exponents equal (step (b)). Notice that shifting to the right the digits of the second number causes a redistribution of the digits along the three mantissas. Step (c) performs a polynomial-like sum of the two numbers. The contribution of each term has to be consistently redistributed (step (d)), in order to take into account possible carry bits, and the three mantissas accordingly updated (step (e)). Steps (f) and (g) conclude the computation by normalizing and rounding the result.

Subtraction As usual, floating-point subtraction between two numbers sharing the same sign is performed by inverting the sign bit of the second number, converting to 2’s complement its mantissa, and then performing the addition as outlined above. It is well-known that subtracting two close numbers may lead to cancellation issues. We consider an example where the accuracy may be dynamically changed in order to overcome ill-conditioning issues. We assume to work with the arithmetic resulting by setting \(t=7\) and \(T=3\) (four grossdigits each consisting of eight binary digits, i.e., a byte) with truncation.

Table 2 Avoiding cancellation issues when evaluating the function \(f(x,y,z)=x+y+z\) for the input data in Example 2
Table 3 Avoiding cancellation issues when evaluating the function \(f(x,y,z)=x+y+z\) for the input data in Example 3

Loss of accuracy, resulting from a subtraction between two numbers having the same sign, will be detected during the normalization phase, and estimated when shifting the mantissa by a given number of bits. For example, if X and Y are the floating-point representations of two close positive real numbers x and y, and we see that \(X^{(0)}-Y^{(0)}=0.0\cdots 0d_k\cdots d_t\), then normalization requires shifting the decimal point k places to the right. On the other hand, when the operation is performed in single, double, triple and quadruple precision, the relative errors in the result

$$\begin{aligned} E_r^{(q)}=\frac{|(X^{(q)}-Y^{(q)})-(x-y)|}{|x-y|}, \quad q=0,\dots ,3, \end{aligned}$$

for the four considered sections of X and Y are estimated as

$$\begin{aligned} E_r^{(q)} \sim \beta ^{-(q+1)t+k}, \quad q=0,\dots ,3. \end{aligned}$$

Consequently, the size k of the shift required to normalize the result gives us information about the size of the relative error.

Consider the simple problem of evaluating the function \(f(x,y,z)=x+y+z\) that computes the sum of three real numbers, and assume that the user requires a single precision accuracy in the result. In the examples below, we discuss three different situations.

Example 1

The three real numbers

$$\begin{aligned} x&= 2^{-1} \cdot 1.0001100000010111111001001110110\cdots , \\ y&= 2^0\;\;\cdot 1.0010101010110010110101001101011\cdots ,\\ z&= 2^0\;\;\cdot 1.1011011010111011011011010111001\cdots , \end{aligned}$$

are represented on the Infinity Computer as

$$\begin{aligned} X&= 2^{-1} \cdot ({\textcircled {1}}^0 1.0001100 +{\textcircled {1}}^{-1} 0.0001011\\&\qquad + {\textcircled {1}}^{-2} 0.1101010 + {\textcircled {1}}^{-3} 0.1101011), \\ Y&= 2^0 \cdot ({\textcircled {1}}^0 1.0010101 + {\textcircled {1}}^{-1} 0.1011001 \\&\qquad + {\textcircled {1}}^{-2} 1.0110110 + {\textcircled {1}}^{-3} 0.1101011), \\ Z&= 2^{0} \cdot ({\textcircled {1}}^0 1.1011011 + {\textcircled {1}}^{-1} 0.1011101\\&\qquad + {\textcircled {1}}^{-2} 1.0110110 + {\textcircled {1}}^{-3} 1.0111001). \end{aligned}$$

Since we are adding positive numbers, no control on the accuracy is needed here, and the result is yielded as

$$\begin{aligned} f(X,Y,Z)\approx X^{(0)}+Y^{(0)}+Z^{(0)}= 2^1 \cdot 1.1011011, \end{aligned}$$

with a relative error \(E^{(0)}\approx 1.1 \cdot 2^{-10}\), as is expected in single precision.

Example 2

Given the three real numbers defined in the previous example, we want now to evaluate \(f(x,y,-z)\) again requiring an eight-bit accuracy in the result. Table 2 shows the sequence of steps performed to achieve the desired result.

Table 4 Scheme of the multiplication of two floating-point numbers

The computation in single precision, as in the previous example, is described in step (a): it leads to a clear cancellation phenomenon and, once detected, the accuracy is improved by letting the \({\textcircled {1}}^{-1}\) terms enter into play (step (b)). However, the relative error remains higher than the prescribed tolerance, and accuracy needs to be improved by also considering the \({\textcircled {1}}^{-2}\) terms. The computation is then repeated at step (c) and the correct result is finally achieved. Notice that, in performing steps (b) and (c), one can evidently exploit the work already carried out in the previous step. The overall procedure thus requires 6 additions/subtractions of grossdigits, the same that would be needed by directly working with a 24-bit register which, for this case, is the minimum accuracy requirement to obtain eight correct bit in the result. This means that no extra effort is introduced during the steps. As a further remark, we stress again that a parallelization through the steps is also possible, even though we will not discuss this issue here.

Example 3

We want to evaluate \(f(x,-y,-z)\) requiring an eight-bit accurate result, now choosing

$$\begin{aligned} \begin{array}{rclcl} x &{} =&{} 2^0 &{} \cdot &{} 1.0010101101010111111001001110110\cdots , \\ y &{} =&{} 2^0 &{} \cdot &{} 1.0010100010110010110101001101011\cdots ,\\ z &{} =&{} 2^{-7} &{} \cdot &{} 1.0101000011011110010010110001010\cdots . \end{array} \end{aligned}$$

Table 3 shows the sequence of steps performed to achieve the desired result for this case.

When working in single precision, an accuracy improvement is already needed when subtracting the first two terms \(X^{(0)}\) and \(Y^{(0)}\) and, consequently, step (a) is stopped. At step (b), the difference \(x-y\) is evaluated in double precision which, on balance, assures an eight-bit accuracy in the result. However, a new cancellation issue emerges when subtracting \(Z^{(0)}\) from \(X^{(1)}-Y^{(1)}\), suggesting that the two terms need to be represented more accurately. This is done in step (c), evaluating \(x-y\) in triple precision and representing z in double precision. The overall procedure requires 5 additions/subtractions of grossdigits. This example, compared with the previous one, reveals the coexistence of variables combined with different precisions.

Summarizing the three examples above, we observe how the accuracy of representation and combination of variables may be dynamically changed, in order to overcome possible loss of significant figures in the result when evaluating a function. Of course, for this strategy to work, it is necessary that the input data are stored with high precision and a technique to detect the loss of accuracy be available. In Sect. 4 we will illustrate this procedure applied to the accurate determination of zeros of functions [a further example may be found in Amodio et al. (2020)].

Concerning the computational complexity, it should be noticed that Example 1 reflects the normal situation where the use of the standard precision is enough to produce a correct result, while Examples 2 and 3 highlight less frequent events.

Multiplication Set \(t=3\) and \(T=2\) (three grossdigits each with four significant digits). Consider the product of the two floating-point normalized numbers

$$\begin{aligned} X= & {} 2^0 \cdot 1.01101111100, \\ Y= & {} 2^{0} \cdot 1.10111111101. \end{aligned}$$

Table 4 summarizes the procedure by a sequence of commented steps.

After expanding the input data along the negative powers of \({\textcircled {1}}\) for data storage (step (a)), the convolution product described in (6) is performed (step (b)). At step (c), the contribution of each term is redistributed, and a sum is then needed to update the mantissas (step (d)). Steps (e) and (f) conclude the computation by normalizing and rounding the result. Notice that step (e) may be carried out by applying the rules for the addition described in Table 1. Again, we stress that the terms in the convolution product, as well as in the subsequent sum, may be computed in parallel.

Table 5 Newton iteration to compute the reciprocal of \(Y=2^0\cdot 1010\) on the Infinity Computer

Division The division of two floating-point numbers X and Y has been switched to the multiplication of X by the reciprocal of Y. This latter, in turn, is obtained with the aid of the Newton–Raphson method applied to find the zero of the function \(f(Z) = 1/Z - Y\). Hereafter, without loss of generality, we assume \(Y>0\). Starting from a suitable initial guess \(Z_0\), the Newton iteration then reads

$$\begin{aligned} Z_{k+1}=Z_k+Z_k(1-Y Z_k). \end{aligned}$$
(13)

The relative error

$$\begin{aligned} E_k:=\frac{1/Y-Z_k}{1/Y}=1-YZ_k \end{aligned}$$

satisfies

$$\begin{aligned} E_{k+1}= 1-YZ_{k+1}=1-2YZ_k+(YZ_k)^2=E_k^2, \end{aligned}$$
(14)

which means that, as is expected in presence of simple zeros, the sequence \(Z_k\) eventually converges quadratically to 1/Y, and the number of correct figures doubles at each iteration. This feature makes the division procedure extremely efficient in our context, since the required accuracy may be easily increased to an arbitrary level. In order to obtain such a good convergence rate starting from the very beginning of the sequence, the numerator X and denominator Y are scaled by a suitable factor \(\beta ^s\) so that \({\widehat{Y}}:=\beta ^s Y\) lies in the interval \([0.5,\,1]\). In the literature, the minmax polynomial approximation is often used to estimate the reciprocal of \({\widehat{Y}}\) (see, for example, Habegger et al. (2010)). For the linear approximating polynomial, the resulting initial guess is

$$\begin{aligned} Z_0=\frac{48}{17}-\frac{32}{17}{\widehat{Y}}, \end{aligned}$$

which assures an initial error \(E_0\le 1/17\). Taking into account the equality (14), the relative error at step k decreases as

$$\begin{aligned} E_k = E_0^{2^k} \le \left( \frac{1}{17} \right) ^{2^k}, \end{aligned}$$

and consequently, assuming \(\beta =2\), a q-bit accurate approximation is obtained by setting

$$\begin{aligned} k = \left\lceil \log _2 \frac{q+1}{\log _2 17}\right\rceil , \end{aligned}$$

where \(\lceil \cdot \rceil \) denotes the ceiling function. As an example, four iterations suffice to get an approximation with at least 32 correct digits. Table 5 shows the sequence generated from the scheme above applied to find, on the Infinity Computer, the reciprocal of the binary number \(Y=(1010)_2\) (1/10 in decimal base), under the choice \(t=3\) and \(T=7\) (eight grossdigits each with four significant figures).

3.3 Implementation details

We have developed a MATLAB prototype emulating the Infinity Computer environment interfaced with a module that performs the suitable carrying, normalization and rounding processes, needed to ensure proper functioning of the resulting dynamic floating-point arithmetic.

The emulator represents input real numbers using a set of binary grossdigits, whose length and number are defined by the two input parameters t and T. This latter parameter is used to define the maximum available accuracy for storing variables. In accord with formulae such as (5) and (6), the actual accuracy used to execute a single operation will depend on the accuracy of the two operands but cannot exceed T.

At the moment, the emulator implements the four basic operations following the strategies described above, plus some simple functions. The vectorization issue, to speed up the execution time associated with each floating-point operation, has not yet been addressed, so that all operations between grossdigits are executed sequentially.

All computations reported in the present paper, including the results presented in the next section, have been carried out on an Intel i5 quad-core computer with 16GB of memory, running MATLAB R2019b.

4 A numerical illustration

As an application highlighting the potentialities of the dynamic precision arithmetic introduced above, we consider the problem of determining accurate approximations of the zeros of a function \(f:[a,b]\rightarrow {\mathbb {R}}\), in the case where this problem suffers from ill-conditioning issues.

The finite arithmetic representation of the function f introduces perturbation terms of different nature: analytical errors, errors in the coefficients or parameters involved in the definition of the function, or roundoff errors introduced during its evaluation.

From a theoretical point of view, these sources of errors may be accounted for by introducing a perturbation function g(x) and analyzing its effects on the zeros of the perturbed function \({\tilde{f}}(x):= f(x)+\varepsilon g(x)\) where the factor \(\varepsilon \) has the size of the unit roundoff. Under regularity assumptions on f, if \(\alpha \in (a,b)\) is a zero with multiplicity \(d>0\), it turns out that \({\tilde{f}}(x)\) admits a perturbed zero \(\alpha +\delta \alpha \), with the perturbing term \(\delta \alpha \) satisfying, in first approximation,

$$\begin{aligned} |\delta \alpha | \approx \varepsilon ^{1/d} d! \left| \frac{g(\alpha )}{f^{(d)}(\alpha )} \right| ^{1/d}. \end{aligned}$$
(15)

As an example, consider the polynomial

$$\begin{aligned} p(x)=x^5 -5x^4+10x^3 -10x^2 +5x -1 \end{aligned}$$
(16)

that admits \(\alpha =1\) as unique root with multiplicity \(d=5\) (indeed \(p(x)=(x-1)^5\)). For this problem, from formula (15) we get

$$\begin{aligned} \left| \frac{\delta \alpha }{\alpha }\right| = |\delta \alpha | \approx \varepsilon ^{1/d} |g(1)|^{1/d}. \end{aligned}$$
(17)

Working with 64-bit IEEE arithmetic, i.e., with a roundoff unit \(u=2^{-53}\), we expect a breakdown of the relative error proportional to \(u^{1/5} \approx 6.4 \cdot 10^{-4}>0.5 \cdot 10^{-3}\), so that, assuming \(|g(1)|^{1/d}\approx 1\), the approximation of the zero \(\alpha \) only contains \(3\div 4\) correct figures.

This is confirmed by the two plots in Fig. 1. They display the relative error \(|x_k-\alpha |/|\alpha |=|x_k-1|\) of the approximations to \(\alpha \) generated by applying the Newton method to the problem \(p(x)=0\), choosing \(x_0=2\) as initial guess:

$$\begin{aligned} x_{k+1}=x_k-\frac{p(x_k)}{p'(x_k)}. \end{aligned}$$
(18)

The solid line refers to the implementation of the iteration on the Infinity Computer using \(t=52\) and \(T=0\). This choice mimics the default double precision arithmetic in MATLAB, which uses a register of 64 bit to store a normalized binary number, 52 bit being dedicated to the (fractional part of the) mantissa. As a matter of fact, the dashed line, coming out from the implementation of the scheme using the standard MATLAB arithmetic, precisely overlaps with the solid line as long as the error decreases, while the two lines slightly depart from each other when they reach the saturation level right below \(10^{-3}\), namely starting from step 32.

Fig. 1
figure 1

Relative error related to the sequence of approximations generated by the Newton method applied to the polynomial (16). Solid line: implementation on the Infinity Computer with \(t=52\) and \(T=0\). Dashed line: implementation in MATLAB double precision arithmetic

Fig. 2
figure 2

Relative error corresponding to the sequence of approximations generated by the Newton method applied to the polynomial (16) on the Infinity Computer. Solid line: dynamic precision implementation. Dashed line: fixed precision implementation, for different accuracies

We want now to improve the accuracy of the approximation to the zero \(\alpha =1\) of (16) by exploiting the new computational platform. Hereafter, the 53-bit precision used above will be referred to as single precision. The dashed lines in Fig. 2 show the relative error reduction when the Newton method is implemented on the Infinity Computer by working with multiple fixed precision. From top to bottom, we can see the five saturation levels corresponding to the stagnation of the error at \(E_1\approx 6.8\cdot 10^{-4}\) in single precision, \(E_2\approx 3.7\cdot 10^{-7}\) in double precision, \(E_3\approx 2.0 \cdot 10^{-10}\) in triple precision, \(E_4\approx 1.5 \cdot 10^{-13}\) in quadruple precision, and \(E_5\approx 6.7 \cdot 10^{-18}\) in quintuple precision. These saturation values are consistently predicted by formula (17), after replacing \(\varepsilon \) with \(2^{-53k}\), for \(k=1,\dots ,5\).

Now suppose we want 53 correct binary digits in the approximation (i.e., about \(15\div 16\) correct decimal digits). From the discussion above, it turns out that we have to activate the quintuple precision, thus setting \(t=52\) and \(T=4\) (five grossdigits, each consisting of a 53-bit register). However, the computational effort may be significantly reduced if we increase the accuracy by involving new negative grosspowers only when they are really needed. In a dynamic usage of the accuracy, starting from \(x_0\), we can initially activate the single precision mode until we reach the first saturation level and, thereafter, switch to double precision until the second saturation level is reached, and so forth until we get the desired accuracy in the approximation. Denoting by

$$\begin{aligned} \hbox {err}(k) =\left| \frac{x_{k}-x_{k-1}}{x_k} \right| \end{aligned}$$

the estimated error at step k, and by prec the current precision, initially set equal to 1, the points where an increase of the accuracy is needed may be automatically detected by employing a simple control scheme such as

figure b

where \(s\le 1\) is a positive safety factor that we have set equal to 1. Notice that we have here considered the estimated (rather than the exact) relative error \(\hbox {err}(k)\) to reflect the more general context, where the value of \(\alpha \) is indeed unknown.

The solid line in Fig. 2 shows the corresponding reduction of the error and we can see that the change of precision scheme described above works quite well for this example, since all saturation levels are correctly detected and overcome. At step 162 the error reaches its minimum value of \(2.2 \cdot 10^{-16}\) and the iteration could be stopped by the standard criterion \(\hbox {err}(k)<10^{-15}\) even though, for clarity, we have generated additional points to reveal the last saturation level corresponding to prec\(=T+1=5\).

Now, let us compare the computational cost of the dynamic implementation versus the fixed quintuple precision one, considering that to reach the highest precision each mode requires 162 Newton iterations (see Fig. 2). On the basis of the formula reported right below (6), the dynamic implementation would take about \(2.4 \cdot 10^3\) grossdigits multiplications while the fixed quintuple precision implementation requires \(2.0\cdot 10^4\) grossdigits multiplications.Footnote 3 It follows that the former mode would reduce the execution times of a factor at least eight with respect to the latter. Actually, it does much better: the dynamic usage of variables and operations, understood as the ability of handling variables with different accuracy and executing operations on them, makes the resulting arithmetic definitely much more efficient than what emerged from the comparison above.

Table 6 The Horner method for evaluating p(x) in (16) at \(x=2^0\cdot 1.0000000000000000000000000000000000000000000001000010\)

In carrying out the computation above, for the dynamic precision mode we have assumed that all floating-point operations were executed with the current selected precision. For example, under this assumption, the computational effort per step of the two modes would become equivalent starting from step 139 onwards since, at that step, the dynamic mode activates the quintuple precision to overcome the threshold level \(E_4\) in Fig. 2.

There is, however, one fundamental aspect that we have not yet considered. In fact, to overcome the ill-conditioning of the problem, the higher precision is only needed during the evaluation of \(p(x_k)\) and \(p'(x_k)\) in (18), while the single 53-bit precision is enough to handle the sequence \(x_k\). In other words, to minimize the overall computational effort, we may improve the accuracy only in the part of the code that implements the Horner rule to evaluate the polynomial p(x) and its derivative.

Interestingly, we do not have to instruct the Infinity Computer to switch between single and quintuple precision: all is done automatically and naturally and, more importantly, even during the evaluation of \(p(x_k)\) and \(p'(x_k)\), the transition from single to quintuple precision is gradual, in that all the intermediate precisions are actually involved only when really needed, which makes the whole machinery much more efficient.

To better elucidate this aspect, we illustrate the sequence produced by the Horner rule to evaluate \(p(x_k)\) at step \(k=145\), where the quintuple precision is activated. The first column in Table 6 reports the five steps of the Horner method applied to evaluate the polynomial p(x) in (16) at the floating-point single precision number \(x=x_{145}\) (its value is in the caption of the table). The variable p is initialized with the leading coefficient of the polynomial, but is allowed to store five grossdigits, each 53-bit long, to host floating-point numbers up to quintuple precision. From the table we see that, as the iteration scheme proceeds, new negative grosspowers appear in the values taken by the variable p. More precisely, at step k the variable p stores a k-fold precision floating-point number, for \(k=1,\dots ,5\).

The increase in the precision of one unit at each step evidently arises from the product \(p\cdot x\), since x remains a single-precision variable and no rounding occurs. Let us better examine what happens at the last step. The product \(p\cdot x\) generates a quintuple-precision number whose expansion along negative grosspowers matches the number 1 up to \({\textcircled {1}}^{-3}\). Consequently, the last operation \(p-1\) only contains significant digits in the coefficient of \({\textcircled {1}}^{-4}\) so that, after normalization, p will store again a single-precision number that can be consistently combined inside formula (18).

In conclusion, the Horner procedure, though being enabled to operate in quintuple precision, actually involves lower precision numbers, except at the very last step. The five steps reported in Table 6 require 15 multiplications of grossdigits, with a clear saving of time, if we consider that the fixed quintuple-precision mode would require 125 multiplications of grossdigits. Comparing the execution times in MATLAB over 162 steps, we found out that the dynamic-precision implementation is about 1.75 times slower than the single-precision implementation (which however stagnates at level \(E_1\)) and about 19 times faster than the quintuple precision mode, thus confirming the expected efficiency.

5 Conclusions

We have proposed a variable precision floating-point arithmetic able to simultaneously storing numbers and execute operations with different accuracies. This feature allows one to dynamically change the accuracy during the execution of a code, in order to prevent inherent ill-conditioning issues associated with a given problem. In this context, the Infinity Computer has been recognized as a natural computational environment that can easily host such an arithmetic. The assumption that makes this paradigm work is the mutual replacement of the two symbols \({\textcircled {1}}\) (grossone) and (dark grossone) during the flow of a given computation. The latter, defined as , is evidently a finite quantity for our numeral system but, in many respects, its reciprocal behaves as an infinitesimal-like entity in the numeral system induced by a floating-point arithmetic operating with \(t+1\) significant figures. In the same spirit of the Infinity Computer, it turns out that negative powers of may be used as “lenses” to increase and decrease the accuracy when needed. An emulator of this dynamic precision floating-point arithmetic has been developed in MATLAB, and an application to the accurate solution of (possibly ill-conditioned) scalar nonlinear equations has been discussed.