Summing \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu (n)$$\end{document}μ(n): a faster elementary algorithm

We present a new elementary algorithm that takes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \textrm{time} \ \ O_\epsilon \left( x^{\frac{3}{5}} (\log x)^{\frac{8}{5}+\epsilon } \right) \ \ \textrm{and} \ \textrm{space} \ \ O\left( x^{\frac{3}{10}} (\log x)^{\frac{13}{10}} \right) $$\end{document}timeOϵx35(logx)85+ϵandspaceOx310(logx)1310 (measured bitwise) for computing \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M(x) = \sum _{n \le x} \mu (n),$$\end{document}M(x)=∑n≤xμ(n), where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu (n)$$\end{document}μ(n) is the Möbius function. This is the first improvement in the exponent of x for an elementary algorithm since 1985. We also show that it is possible to reduce space consumption to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(x^{1/5} (\log x)^{5/3})$$\end{document}O(x1/5(logx)5/3) by the use of (Helfgott in: Math Comput 89:333–350, 2020), at the cost of letting time rise to the order of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^{3/5} (\log x)^2 \log \log x$$\end{document}x3/5(logx)2loglogx.


Introduction
There are several well-studied sums in analytic number theory that involve the Möbius function. For example, Mertens [14] considered now called the Mertens function. Based on numerical evidence, he conjectured that |M(x)| ≤ √ x for all x. His conjecture was disproved by Odlyzko and te Riele [16]. Pintz [17] made their result effective, showing that there exists a value of x < exp(3.21 × 10 64 ) for which |M(x)| > √ x. It is still not known when |M(x)| > √ x holds for the first time; Dress [2] has shown that it cannot hold for x ≤ 10 12 , and Hurst has carried out a verification up to 10 16 [6]. Isolated values of M(x) have been computed in [2] and in subsequent papers.
additional complications and has not been implemented. Moreover, in the range explored to date (x ≤ 10 22 ), elementary algorithms are faster in practice, at least for computing π(x).
Since 1996, all work on these problems has centered on improving the implementation, with no essential improvements to the algorithm or to its computational complexity. The goal of the present paper is to develop a new elementary algorithm that is more time-efficient and space-efficient than the algorithm in [1]. We show: 3 5 (log x) 3 5 (log log x) 2 5 word operations, time O x 3 5 (log x) 8 5 (log log x) 7 5 , and space O x 3 10 (log x) 13 10 (log log x) − 3 10 .

Main Theorem We can compute M(x) in O x
Space here is measured in bits, and time is measured in bit operations. "Word operations" (henceforth "operations") means arithmetic operations (+, −, ·, /, √ ) on integers of absolute value up to x O (1) , as well as memory operations (read and write) in arrays of such integers with indices up to x O (1) . Some of the literature (including both [1] and earlier versions of the present paper) counts time in terms of word operations; some (e.g., [13]) makes it clear that it counts bit operations.
Ours is the first improvement in the exponent of x since 1985. Using our algorithm, we have been able to extend the work of Hurst and Kuznetsov, computing M(x) for x = 2 n , n ≤ 75, and for x = 10 n , n ≤ 23. We expect that professional programmers who have access to significant computer resources will be able to extend this range further.

Our approach
The general idea used in all of the elementary algorithms ( [12], [1], etc.) is as follows. One always starts with a combinatorial identity to break M(x) into smaller sums. For example, a variant of Vaughan's identity allows one to rewrite M(x) as follows: Swapping the order of summation, one can write The first term can be easily computed in O( √ x log log x) operations and space O(x 1/4 ), or else, proceeding as in [5], in O( √ x log x) operations and space O(x 1/6 (log x) 2/3 ). To handle the subtracted term, the idea is to fix a parameter v ≤ √ x, and then split the sum into two sums: one over m 1 , m 2 ≤ v and the other with max(m 1 , m 2 ) > v. The difference between the approach taken in the present paper and those that came before it is that our predecessors take v = x 1/3 and then compute the sum for m 1 , m 2 ≤ v in O(v 2 ) operations. We will take our v to be a little larger, namely, about x 2/5 . Because we take a larger value of v, we have to treat the case with m 1 , m 2 ≤ v with greater care than [1] et al. Indeed, the bulk of our work will be in Sect. 4, where we show how to handle this case.
Our approach in Sect. 4 roughly amounts to analyzing the difference between reality and a model that we obtain via Diophantine approximation, in that we show that this difference has a simple description in terms of congruence classes and segments. This description allows us to compute the difference quickly, in part by means of table lookups.

Alternatives
In a previous draft of our paper, we followed a route more closely related to the main ideas in papers by Galway [3] and by the first author [5]. Those papers succeeded in reducing the space needed for implementing the sieve of Eratosthenes (or the Atkin-Bernstein sieve, in Galway's case) down to about O(x 1/3 ). In particular, [5] provides an algorithm for computing μ(n) for all successive n ≤ x in O(x log x) operations and space O(x 1/3 (log x) 2/3 ), building on an approach from a paper of Croot, Helfgott, and Tao [19] that computes n≤x τ (n) in about O(x 1/3 ) operations. That approach is in turn related to Vinogradov's take on the divisor problem [20, Ch. III, exer. 3-6] (based on Voronoï).
The total number of word operations taken by the algorithm in the previous version of our paper was on the order of x 3/5 (log x) 8/5 . Thus, the current version is asymptotically faster. If an unrelated improvement present in the current version (Algorithm 23; see Sect. 3) were introduced in the older version, the number of word operations would be on the order of x 3/5 (log x) 6/5 (log log x) 2/5 . We sketch the older version of the algorithm in Appendix A.
Of course, we could use [5] as a black box to reduce space consumption in some of our routines, while leaving everything else as it is in the current version. Time complexity would increase slightly, while space complexity would be much reduced. More precisely: using [5] as a black box, and keeping everything else the same, we could compute M(x) in O(x 3/5 (log x)) word operations (and hence time O(x 3/5 (log x) 2 log log x)) and space O(x 1/5 (log x) 5/3 ). We choose to focus instead on the version of the algorithm reflected in the main theorem; it is faster but less space-efficient.

Notation and algorithmic conventions
For x ∈ R, we write x for the largest integer ≤ x, and {x} for x − x . Thus, {x} ∈ [0, 1) no matter whether x < 0, x = 0, or x > 0.
We write log b x to mean the logarithm base b of x, not log log · · · log x (log iterated b times).
We will count space in bits. We will assume that the time it takes to multiply two n-bit numbers (n > 1) is O(n log n), as shown by [7]. (This is a more than reasonable assumption in practice, even if we use older algorithms. In all of our experiments, n ≤ 128; we could consider n = 196 or n = 256, but much larger n would correspond to values of x so large that an algorithm running in time > x 3/5 would not be practical.) We will also assume that accessing O(log x) consecutive bits in an array of length ≤ x takes time O(log x).
All of the pseudocode for our algorithms appears at the end of this paper. We will keep track of the space and number of (word) operations used by each function. Total time (measured in bit operations) will be bounded by the number of word operations times O(log x log log x), since all of our arithmetic operations will be on integers of size x O(1) (or rationals of numerator and denominator bounded by x O (1) ), and all of our arrays will be of size much smaller than x. Since it may not be immediately clear that we cannot hope for a factor of O(log x) rather than O(log x log log x), we will point out two bottlenecks where the factor is indeed O(log x log log x). This is so because of multiplications, square-roots and divisions; addition and memory access only impose a factor of O(log x).

Preparatory work: identities
We will start from the identity This identity is simply the case K = 2 of Heath-Brown's identity for the Möbius function: for all K ≥ 1, n ≥ 1, and u ≥ n 1/K , (See [8, (13.38)]; note, however, that there is a typographical error under the sum there: m 1 . . . m k n 1 . . . n k = n should be m 1 . . . m k n 1 . . . n k−1 = n.) Alternatively, we can derive (2.1) immediately from Vaughan's identity for μ: that identity would, in general, have a term consisting of a sum over all decompositions m 1 m 2 n 1 = n with m 1 , m 2 > u, but that term is empty because u 2 ≥ x. We sum over all n ≤ x, and obtain x. Before we proceed, let us compare matters to the initial approach in [1]. Lemma 2.1 in [1] states that  [5].) Thus, we will be able to focus on the other term on the right side of (2.2). We can write, for any v ≤ u, In this way, computing M(x) reduces to computing the two double sums on the right side of (2.4).

The case of a large non-free variable
Let us work on the second sum in (2.4) first. It is not particularly difficult to deal with; there are a few alternative procedures that would lead to roughly the same number of operations, and several that would lead to a treatment for which the number of operations would be larger only by a factor of log x. Clearly, where S(m) = r≤m D(r; x/r) = 1 + x/u<r≤m D(r; x/r), since D(r; x/r) = b|r:b≤x/r μ(b) = b|r μ(r) for r ≤ √ x = u. Thus, to compute the right side of (3.2), it makes sense to let n take the values u , u − 1, . . . , v +1 in descending order; as n decreases, x/n increases, and we compute D(r; x/r), and thus S(x/n), for increasing values of r. Computing all values of μ(a) for v < a ≤ u using a segmented sieve of Eratosthenes takes O(u log log u) operations and space O( √ u). The main question is how to compute D(r; x/r) efficiently for all r in a given segment. Using a segmented sieve of Eratosthenes, we can determine the set of prime divisors of all r in an interval of the form [y, y + ], | | ≥ √ y, in O( log log y) operations and space O( log y). We want to compute the sum D(r; x/r) = b|r:b<x/r μ(b) for all r in that interval. The naive approach would be to go over all divisors b of all integers r in [y, y + ]; since those integers have log y divisors on average, doing so would take O( log y) operations. Fortunately, there is a less obvious way to compute D(r; x/r) using, on average, O(log log y) operations. We will need a simple lemma on the anatomy of integers.  Applying an upper bound sieve followed by partial summation, we see that (The term O(1) comes from a/K <d≤za/K 1/d.) By Mertens' Theorem, the product is 1/ log z. Hence, The number of divisors d |n with p|d ⇒ z 1/2 < p ≤ z depends only on K = P z (n). Therefore, the expected value of (3.3) is Now, log P z (n) = p|n log p. Let ξ denote the random variable given by ξ = We must also estimate the conditional expectation: for p ≤ z ≤ N , Hence, the expression in (3.5) is Proof Algorithm 23 computes D(n; a) recursively: it calls itself to compute D(n 0 ; a) and D(n 0 ; a/p r ), where n 0 = p 1 p 2 · · · p r−1 , and then returns D(n; a) = D(n 0 ; a) − D(n 0 ; a/p r ). The contribution of D(n 0 ; a) is that of divisors |n with p r , whereas the contribution of D(n 0 ; a/p r ) corresponds to that of divisors |n with p r | .
Here it is evident that the algorithm gives the correct output for the cases (1)-(2), whereas case (3) follows from D(n; a) = d|n: We can see recursion as traversing a recursion tree, with leaves corresponding to cases in which the algorithm terminates. (In the study of algorithms, trees are conventionally drawn with the root at the top.) The total number of operations is proportional to the number of nodes in the tree. If the algorithm were written to terminate only for n = 1, the tree would have 2 r leaves; as it is, the algorithm is written so that some branches terminate long before reaching depth r. We are to bound the average number of nodes of the recursion tree for inputs N < n ≤ 2N and a = a(n) ∈ [A, 2A].
Say we are at the depth reached after taking care of all p i with p i > z. The branches that have survived correspond to d|n with p|d ⇒ p > z, d ≤ 2A and d > A/P z (n). We are to compute D(P z (n); a/d). (If d > 2A, then a/d < 1, and so our branch has terminated by case (1) above. If d ≤ A/P z (n), then a/d ≥ P z (n), and we are in case (3).) Now we continue running the algorithm until we take care of all p i with p i > z 1/2 . On each branch that survived up to depth p > z, the nodes between that depth and depth p > z 1/2 correspond to square-free divisors d |n such that p|d ⇒ z 1/2 < p ≤ z.
In this way, letting As a decreases and m = x/a increases, we may (and should) discard values of S(m) and D(r; x/r) that we no longer need, so as to keep space usage down.
We have thus shown that we can compute the right side of (3 It is easy to see that, if we use the algorithm in [5,Main Thm.] instead of the classical segmented sieve of Eratosthenes, we can accomplish the same task Bitwise time bottleneck. Since our operations are all on integers ≤ x, each of our (word) operations involves O(log x log log x) bit operations, and so it is clear that our 1 One may take a little more space (but no more than O( √ x/v log(x/v))) if one decides to parallelize this summation procedure.
bit operations. The question is whether one could do a little better.
The segmented sieve of Eratosthenes for factorization takes only bit operations. (In the final step, multiply small factors before large ones.) However, our procedure for computing D(n; a(n)) does take time proportional to (x/v) log x(log log x) 2 in total. The reason is the following. Recall that, to keep the number of operations low, Algorithm 23 uses (and multiplies integers by) large primes before small ones. For a a fixed power of N , a positive proportion of integers n N have prime factors between a 1/3 and a 2/3 (say). Those prime factors are found early on; they correspond to the first two or three levels of the recursion tree in the proof of Prop. 3.2. Then every node further down in the recursion tree involves a multiplication by a number of size at least a 1/3 . That multiplication takes log a 1/3 log log a 1/3 log N log log N bit operations. Here log N log x. Thus, in our current algorithm, the number of bit operations is, in fact, on the order of (x/v) log x(log log x) 2 .
A few words on the implementation. See Algorithm 2.
Choice of . The size of the segments used by the sieve is to be chosen at the outset: for the improved segmented sieve in [5,Main Thm.]. Memory usage. It is understood that calls such as F ← SegFactor(a 0 , ) will result in freeing or reusing the memory previously occupied by F . (In other words, "garbagecollection" will be taken care of by either the programmer or the language.) Parallelization. Most of the running time is spent in function SArr (Algorithm 4), which is easy to parallelize. We can let each processor sieve a block of length . Other than that -the issue of computing an array of sums S (as in Algorithm 4) in parallel is a well-known problem (prefix sums), for which solutions of varying practical efficiency are known. We follow a common two-level algorithm: first, we divide the array into as many blocks as there are processing elements; then (level 1) we let each processing element compute, in parallel, an array of prefix sums for each block, ending with the total of the block's entries; then we compute prefix sums of these totals to create offsets; finally (level 2), we let each processing element add its block's offset to all elements of its block.

The case of a large free variable
We now show how to compute the first double sum on the righthand side of (2.4). That double sum equals Note that, in [1], this turns out to be the easy case. However, they take v = x 1/3 , while we will take v = x 2/5 . As a result, we have to take much greater care with the computation to ensure that the runtime does not become too large.

A first try
We begin by splitting [1, v] × [1, v] into neighborhoods U around points (m 0 , n 0 ). For simplicity, we will take these neighborhoods to be rectangles of the form I x × I y with (In Sect. 5, we will partition the two intervals [1, v] into intervals of the form [x 0 , (1 + η)x 0 ) and [y 0 , (1 + η)y 0 ), with 0 < η ≤ 1 a constant. We will then specify a and b for given x 0 and y 0 , and subdivide [ Applying a local linear approximation to the function x mn on each neighborhood yields where ET quad (m, n) is a quadratic error term (that is, a term whose size is bounded by and The quadratic error term will be small provided that U is small. We will show how to choose U optimally at the end of this section. The point of applying the linear approximation is that it will ultimately allow us to separate the variables in our sum. The one complicating factor is the presence of the floor function. If we temporarily ignore both the floor function in (4.1) and the quadratic error term, we can see very clearly how the linear approximation helps us. To wit: is approximately equal to

Handling the difference between reality and an approximation
Proceeding as above, we can compute the sum a, b)) operations and space O(log max(x 0 , y 0 )), given arrays with the values of μ(m) and μ(n). The issue is that S 0 is not the same as and it is certainly not the same as the sum we actually want to compute, namely, From now on, we will write Here m 0 , n 0 and x are understood to be fixed. Our challenge will be to show that the weights L 2 − L 1 and L 1 − L 0 actually have a simple form -simple enough that S 2 − S 1 and S 1 − S 0 can be computed quickly. We approximate c y by a rational number a 0 /q with q ≤ Q = 2b such that δ := c y − a 0 /q satisfies |δ| ≤ 1/qQ. Thus, We can find such an a 0 q in O(log Q) operations using continued fractions (see Algorithm 9).
Write r 0 = r 0 (m) for the integer such that the absolute value of is minimal (and hence ≤ 1/2q). If there are two such values, choose the greater one. Then We will later make sure that we choose our neighborhoods I x ×I y so that |ET quad (m, n)| ≤ 1/2b, where ET quad (m, n) is defined by (4.2). We also know that ET quad (m, n) > 0, since the function (m, n) → x/mn is convex. We are of course assuming that I x × I y is contained in the first quadrant, and so (m, n) → x/mn is well-defined on it.
The aforementioned notation will be used throughout this section.
It follows immediately from Lemmas 4.1 and 4.2 that Note that the first term on the right side of (4.13) depends only on n mod q (and a 0 mod q and r 0 ), and the second term depends only on n mod q, sgn(n − n 0 ) and sgn(δ) (and not on r 0 ; hence it is independent of m). Given the values of μ(n) for n ∈ I y , it is easy to make a table of Proof The question is whether L 2 (m, n) > L 1 (m, n). Since −1/2q ≤ β < 1/2q and |δ(n − n 0 )| ≤ 1/2q, (4.14) we know that where the last line follows from (4.14). Hence, L 2 (m, n) > L 1 (m, n) if and only if This, in turn, is equivalent to where c 0 = x/m, c 2 = −a 0 /q and Since a 0 /q is a Diophantine approximation to c y = −x/m 0 n 2 0 < 0, it is clear that a 0 /q is non-positive. Consequently, if q > 1, a 0 must be negative, since a 0 and q are coprime. Hence, c 2 is positive, and so (4.16) holds iff n / ∈ I, where I = (x − , x + ) if the equation has real roots x − ≤ x + , and I = ∅ otherwise.
Solving a quadratic equation is not computationally expensive; in practice, the function n → √ n generally takes roughly as much time to compute as a division. Thus it makes sense to count x → √ n as one (word) operation, like the four basic operations +, −, ·, /. Computing √ n takes O(log n log log n) bit operations, just like multiplication and division.
What we have to do now is keep a table of ρ r,≤n = n∈I y ,n≤n a 0 (n−n 0 )≡r mod q μ(n).
We need only consider values of n satisfying a 0 (n − n 0 ) ≡ r mod q (since ρ r,≤n = ρ r,≤n for n the largest number n ≤ n with a 0 (n − n 0 ) ≡ r mod q). It is then easy to see that we can construct the table in O(b) operations and space O(b log b), simply letting n traverse I y from left to right. (In the end, we obtain ρ r for every r ∈ Z/qZ.) In the remaining lemmas, we show how to handle the cases where a 0 (n − n 0 ) + r 0 ≡ 0 (mod q). (m, n) ∈ I x × I y . If a 0 (n − n 0 ) + r 0 ≡ 0 (mod q), then

Lemma 4.6
Let (m, n) ∈ I x × I y . If q = 1, has real roots x −,j ≤ x +,j , and I = ∅ otherwise. Here γ 0 = xq, γ 2 = −a 0 m and If a = 0, then Proof Just as in the proof of Lemma 4.5, If β + δ(n − n 0 ) < 0, then L 2 (m, n) − L 1 (m, n) > 0 holds iff (4.18) holds. The term δ(n − n 0 ) cancels out, and so, by (4.6), we obtain that (4.18) holds iff just as in Lemma 4.5. If β + δ(n − n 0 ) ≥ 0, L 2 (m, n) − L 1 (m, n) > 0 holds iff (4.15) holds. Again, the term involving δ(n − n 0 ) cancels out fully, and so (4.18) holds iff In summary: for a neighborhood I x ×I y small enough that |ET quad (m, n)| ≤ 1/2b, we need to prepare tables (in O(b) operations and space O(b log b)) and compute a Diophantine approximation (in O(log b) operations). Then, for each value of m, we need to (i) compute r 0 = r 0 (m), (ii) look up σ r 0 in a table, (iii) solve a quadratic equation to account for the case a 0 (n − n 0 ) + r 0 ≡ −1 mod q, and (iv) solve a quadratic equation and also a linear equation to account for the case a 0 (n − n 0 ) + r 0 ≡ 0 mod q. If q = 1, then (iii) and (iv) are replaced by the simple task of computing the expressions in Lemma 4.6. In any event, there are a bounded number of tasks per m, each taking a bounded amount of (word)

operations. Thus, the computation over the neighborhood I x × I y takes in total O(a + b) word operations and space O(b log b), given the values of μ(m) and μ(n).
Bitwise time bottleneck. It should be evident that tasks (i), (iii) and (iv) above each take time on the order of log x log log x; they involve multiplications, divisions and square roots of integers N with log N log x. Hence, the computation over I x × I y takes (a + b) log x log log x bit operations.

Parameter choice. Final estimates
What remains now is to choose our neighborhoods U = I x × I y optimally (within a constant factor), and to specify our choice of v. Recall that I x = [m 0 − a, m 0 + a), I y = [n 0 − b, n 0 + b).

Bounding the quadratic error term. Choosing a and b
We can use the formula for the error term bound in a Taylor expansion to obtain an upper bound on the error term. Since f : (x, y) → X/xy is twice continuously differentiable for x, y > 0, we know that, for (x, y) in any convex neighborhood U of any (x 0 , y 0 ) with x 0 , y 0 > 0, where the Lagrange remainder term ET quad (x, y) is given by for some (ξ , υ) = (ξ (x, y), υ(x, y)) ∈ U depending on (x, y). Working with our neighborhood U = I x ×I y of (x 0 , y 0 ) = (m 0 , n 0 ), we obtain that, for m ∈ I x and n ∈ I y , |ET quad (m, n)| is at most where m = min (m,n)∈U m and n = min (m,n)∈U n. Hence, by Cauchy-Schwarz, (From now on, we will write x, as we are used to, instead of X, since there is no longer any risk of confusion with the variable x.) Recall that we need to choose I x and I y so that ET quad ≤ 1/2b. Since (m − m 0 ) 2 ≤ a 2 and (n − n 0 ) 2 ≤ b 2 , it is enough to require that In turn, these conditions hold for More generally, if we are given that m ≥ A, n ≥ B for some A, B, we see that we can set At the end of Sect. 4, we showed that it takes O(a+b) operations and space O(b log b) for our algorithm to run over each neighborhood I x ×I y . Recall that we are dividing [1, v]× [1, v] into dyadic boxes (or, at any rate, boxes of the form B (A, B, η) = [A, (1+η)A)×[B, (1+η)B), where 0 < η ≤ 1 is a constant) and that these boxes are divided into neighborhoods I x ×I y . We have AB ab neighborhoods I x × I y in the box B(A, B, η). Thus, assuming that A ≥ B, it takes operations to run over this box, using the values of a and b in (5.2). Now, we will need to sum over all boxes B (A, B, η). Each A is of the form (1 + η) i and each B is of the form (1 + η) j for 1 ≤ (1 + η) i , (1 + η) j ≤ v. By symmetry, we may take j ≤ i, that is, A ≥ B. Summing over all boxes takes We tacitly assumed that a ≥ 1, b ≥ 1, and so we need to handle the case of a < 1 or b < 1 separately, by brute force. It actually makes sense to treat the broader case of a < C or b < C by brute force, where C is a constant of our choice. The cost of brute-force summation for (m, n) with n ≤ m (C 3 x) 1/4 (as is the case when a < C) is whereas the cost of brute-force summation for (m, n) with m ≤ v, n (6x/m) 1/3 (as is the case when Lastly, we need to take into account the fact that we had to pre-compute a list of values of μ using a segmented sieve (Algorithm 20), which takes O(v 3/2 log log x) operations and Putting everything together, we see that the large free variable case (Sect. 4) takes where the space bound comes from substituting b = 3 m (n ) 3 6x into the space estimate that we had for each neighborhood and adding it to the space bound from the segmented sieve.

Choice of v. Total time and space estimates
On the other hand, as we just showed, the case of a large free variable (Algorithm 5) takes O(v 2/3 x 1/3 log x + v 3/2 log log x) operations and space O( √ v log log x + (v 4 /x) 1/3 log x). Thus, in order to minimize our number of operations, we set the two time bounds equal to one another and solve for v, yielding v = x 2/5 (log log x) 3/5 /(log x) 3/5 . We already explained (at the end of Sects. 3 and 4) that one cannot really hope for a factor better than O(log x log log x) here, given our current algorithm. The constant c can be fine-tuned by the user or programmer. It is actually best to set it so that the time taken by the case of a large free variable and by the case of a large non-free variable are within a constant factor of each other without being approximately equal.
If we were to use [5] to factor integers in SArr (Algorithm 4) then LargeNonFree (Algorithm 2) would take O((x/v) log x) operations and space O((x/v) 1/3 (log(x/v)) 5/3 ). It would then be best to set v = c · x 2/5 for some c, leading to O(x 3/5 log x) operations in total and total space O x 1/5 (log x) 5/3 .

Implementation details
We wrote our program in C++ (though mainly simply in C). We used gmp (the GNU MP multiple precision library) for a few operations, but relied mainly on 64-bit and 128-bit arithmetic. Some key procedures were parallelized by means of OpenMP pragmas.
Basics on better sieving. Let us first go over two well-known optimization techniques. The first one is useful for sieving in general; the second one is specific to the use of sieves to compute μ(n).
(1) When we sieve (function SegPrimes, SegMu or SegFactor), it is useful to first compute how our sieve affects a segment of length M = 2 3 · 3 2 · 5 · 7 · 11, say. (For instance, if we are sieving for primes, we compute which elements of Z/MZ lie in (Z/MZ) * .) We can then copy that segment onto our longer segment repeatedly, and then start sieving by primes and prime powers not dividing M. (2) As is explained in [9] and [6], and for that matter in [4, § 4.5.1]: in function SegMu, for n ≤ x 0 = n 0 + , we do not actually need to store j = p≤ √ x 0 :p|n p; it is enough to store S j p≤ √ x 0 log 4 p . The reason is that (as can be easily checked) j < p|n p if and only if S j < log 4 n . In this way, we use space O( log log x 0 ) instead of space O( log x 0 ). We also replace many multiplications by additions; in exchange, we need to compute log 4 p and log 4 n , but that takes very little time, as it only involves counting the space occupied by p or n in base 2, and that is a task that a processor can usually accomplish extremely quickly. Technique (2) here is not essential in our context, as SegMu is not a bottleneck, whether for time or for space. It is more important to optimize factorization -as we are about to explain.
Factorizing via a sieve in little space. We wish to store the list of prime factors of a positive number n in at most twice as much space as it takes to store n. We can do so simply and rapidly as follows. We initialize a n and b n to 0. When we find a new prime factor p, we reset a n to 2 k a n + 2 k−1 , where k = log 2 p , and b n to 2 k b n + p − 2 k . In the end, we obtain, for example, We can easily read the list of prime factors 2, 3, 5, 7 of n = 2 · 3 · 5 · 7 from a n and b n , whether in ascending or in descending order: we can see a n as marking where each prime in b n begins, as well as providing the leading 1: 2 = 10 2 , 3 = 11 2 , 5 = 101 2 , 7 = 111 2 .
The resulting savings in space lead to a significant speed-up in practice, due no doubt in part to better cache usage. The bitwise operations required to decode the factorization of n are very fast, particularly if one is willing to go beyond the C standard; we used instructions available in gcc (__builtin_clzl, __builtin_ctzl).
Implementing the algorithm in integer arithmetic. Manipulating rationals is time consuming in practice, even if we use a specialized library. (Part of the reason is the frequent need to reduce fractions a/b by taking the gcd of a and b.) It is thus best to implement the algorithm -in particular, procedure SumByLin and its subroutines -using only integer arithmetic. Doing so also makes it easier to verify that the integers used all fit in a certain range (|n| < 2 127 , say), and of course also helps them fit in that range, in that we can simplify fractions before we code: (a/bc)/(d/bf ) (say) becomes af /bd, represented by the pair of integers (af, bd).
Square-roots and divisions. On typical current 64-bit architectures, a division takes as much time as several multiplications, and a square-root takes about as much time as a division. (These are obviously crude, general estimates.) Here, by "taking a squareroot" of x we mean computing the representable number closest to √ x, or the largest representable number no larger than √ x, where "representable" means "representable in extended precision", that is, as a number 2 e n with |n| < 2 128 and e ∈ [−(2 14 − 1), 2 14 Incidentally, one should be extremely wary of using hardware implementations of any floating-point operations other than the four basic operations and the square-root; for instance, an implementation of exp can give a result that is not the representable number closest to exp(x) for given x. Fortunately, we do not need to use any floating-point operations other than the square-root. The IEEE 754 standard requires that taking a square-root be implemented correctly, that is, that the operation return the representable number closest to √ x, or the largest representable number ≤ √ x, or the smallest such number ≥ √ x, depending on how we set the rounding mode. We actually need to compute √ n for n a 128-bit integer. (We can assume that n < 2 125 , say.) We do so by combining a single iteration of the procedure in [21] (essentially Newton's method) with a hardware implementation of a floating-point extended-precision square-root in the sense we have just described.
It is of course in our interest to keep the number of divisions (and square-roots) we perform as low as possible; keeping the number of multiplications small is of course also useful. Some easy modifications help: for instance, we can conflate functions Special1 and Special0B into a single procedure; the value of γ 1 in the two functions differs by exactly m.
Parallelization. We parallelized the algorithm at two crucial places: one is function SArr (Algorithm 4), as we already discussed at the end of Sect. 3; the other one is function DDSum (Algorithm 7), which involves a double loop. The task inside the double loop (that is, DoubleSum or BruteDoubleSum) is given to a processing element to compute on its own. How exactly the double loop is traversed and parcelled out is a matter that involves not just the usual trade-off between time and space but also a possible trade-off between either and efficiency of parallelization.
More specifically: it may be the case that the number of processing elements is greater than the number of iterations of either loop ( (A − A)/ and (B − B)/ , respectively), but smaller than the number of iterations of the double loop. In that case, parallelizing only the inside loop or the outside loop leads to an under-utilization of processing elements. One alternative is a naïve parallelization of the double loop, with each processing element recomputing the arrays μ, μ that it needs. That actually turns out to be a workable solution: while recomputing arrays in this way is wasteful, the overall time complexity does not change, and the total space used is O(ν log log max (A , B )), where ν is the number of threads; this is slightly less space than ν instances of SumbyLin use anyhow.
The alternative of computing and storing the whole arrays μ, μ before entering the double loop would allow us not to recompute them, but it would lead to using (shared) memory on the order of max ( to a significantly worse running time, at least for x = 10 19 ; in the end, we went with the "workable solution" above. In the end, what is best may depend on the parameter range and number of threads one is working with.

Numerical results
We computed M(x) for x = 10 n , n ≤ 23, and x = 2 n , n ≤ 75, beating the records in [9] and [6]. Our results are the same as theirs, except that we obtain a sign opposite to that in [9, Table 1] for x = 10 21 ; presumably [9] contains a transcription mistake.  We also ran our code for x = 2 n , 68 ≤ n ≤ 75, on a 128-core machine based on two AMD EPYC 7702 (2GHz) processors. The results were of course the same as on the first computer, but running time scaled more poorly, particularly when passing from 2 73 to 2 74 . (For whatever reason, the program gave up on n = 2 75 on the second computer.) The percentage of total time taken by the case of a large non-free variable was also much larger than on the first computer, and went up from 2 73 to 2 74 . The reason for the difference in running times in the two computers presumably lies in the differences between their respective memory architectures. The dominance (in the second computer) of the case of a large non-free variable, whose usage of sieves is the most memory-intensive part of the program, supports this diagnosis. It would then be advisable, for the sake of reducing running times in practice, to improve on the memory usage of that part of the program, either replacing SegFactor by the improved sieve in [5] -sharply reducing memory usage at the cost of increasing the asymptotic running time slightly, as we have discussedor using a cache-efficient implementation of the traditional segmented sieve as in [15,Algorithm 1.2]. These two strategies could be combined.
Thus, our implementation should give a correct result for x = 10 23 , for the choice c = 9/8. One can obviously go farther by using wider (or arbitrary-precision) integer types.
There is another integer that might seem to be possibly larger, namely the discriminant = b 2 −4ac in the quadratic equations solved in QuadIneqZ, which is called by functions Special1 and Special0B. However, that discriminant is smaller than it looks at first. The coefficient γ 1 in Special0B is Here the second term is negligible compared to the first one, and the third term is negligible compared to the fourth one. We know that .
We also see that The dominant term is thus 2(c/6) 1/3 x 4/5 ((log log x)/ log x) 1/5 . The coefficient γ 1 in Spe-cial1 is equal to the one we just considered, minus m, and thus has the same dominant term.
As for the term −4ac (or −4γ 0 γ 2 , so as not to conflict with the other meanings of a and c here), it equals 4 times Since and mx ≤ vx ≤ cx 7/5 (log log x) 3/5 /(log x) 3/5 , we see that the main term here is at most Since the two expressions we have just considered have opposite sign, we conclude that the main term in the discriminant γ 2 1 − 4γ 0 γ 2 is thus at most (16c 2/3 /6 2/3 )x 8/5 (log log x) 2/5 /(log x) 2/5 , that is, considerably smaller than the term in (7.1), at least for x larger than a constant. For c = 3/2 and x = 2 75 , and thus we are out of danger of overflow for those parameters as well.
Part of the improvement over that older algorithm resides in a better (yet simple) procedure for computing sums of the form d|n:d≤a μ(d) (see Algorithm 23); we analyzed it in Sect. 3. Other than that, the difference lies mainly in the computation of the sum of μ(m)μ(n) x/mn for (m, n) in a neighborhood U = I x ×I y (see Sect. 4.2 and Algorithm 11). Let us use the notation in §4.2. In particular, write I x = [m 0 −a, m 0 +a), I y = [n 0 −b, n 0 +b). We have sums S 0 , S 1 , S 2 , where S 0 is easy to compute and S 2 is the sum that we actually want to determine.
In the algorithm given in the current version of the paper, we compute the difference The crux is how to compute S 2 − S 1 . In the current version, we analyze this difference with great care, after having determined the (at most) two arithmetic progressions in which the terms of S 2 − S 1 that are non-zero must be contained. In the older version, we determined those arithmetic progressions in the same way as here (namely, by finding a Diophantine approximation a/q to c y ). Within those progressions, however, we did not establish precisely what the non-zero terms were, but simply showed that they had to be contained in an interval I ⊂ I y . We also showed that, for q small, the interval I had to be small as well, at least on average. (The number of elements of an arithmetic progression modulo q within I y is O(b/q), and so the case of q large is not the main worry.) It is here that the argument in [20, Ch. III, exer. 3-6] came in handy: as we move from neighborhood to neighborhood, the quantity c y keeps changing at a certain moderate speed, monotonically; thus, c y mod Z cannot spend too much time in major arcs on the circle R/Z. Only when c y mod Z lies in the major arcs can q be small and the interval I be large. Thus, just as claimed, the case of q small and I large occurs for few neighborhoods.
We can thus simply determine I, and compute the terms that lie in the intersection of either of those two arithmetic progressions and their corresponding intervals I, and sum those terms. The number of operations will be about O(ab/q), unless q is small, in which case one can do better, viz., O(a|I|/q) or so. (Compare with the corresponding bound for the newer algorithm, namely, O(a + b).) On average, we obtained savings of a factor of O((log b)/b), rather than O(1/b), as we do now.
Whether or not we use [5] to factor integers n ≤ x/v, we set v = cx 2/5 /(log x) 3/5 , for c a constant of our choice.