Computing Generalized Convolutions Faster Than Brute Force

In this paper, we consider a general notion of convolution. Let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D$$\end{document}D be a finite domain and let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D^n$$\end{document}Dn be the set of n-length vectors (tuples) of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D$$\end{document}D. Let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f :D\times D\rightarrow D$$\end{document}f:D×D→D be a function and let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\oplus _f$$\end{document}⊕f be a coordinate-wise application of f. The \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document}f-Convolution of two functions \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g,h :D^n \rightarrow \{-M,\ldots ,M\}$$\end{document}g,h:Dn→{-M,…,M} is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} (g \mathbin {\circledast _{f}}h)(\textbf{v}) {:}{=}\sum _{\begin{array}{c} \textbf{v}_g,\textbf{v}_h \in D^n\\ \text {s.t. } \textbf{v}= \textbf{v}_g \oplus _f \textbf{v}_h \end{array}} g(\textbf{v}_g) \cdot h(\textbf{v}_h) \end{aligned}$$\end{document}(g⊛fh)(v):=∑vg,vh∈Dns.t.v=vg⊕fvhg(vg)·h(vh)for every \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{v}\in D^n$$\end{document}v∈Dn. This problem generalizes many fundamental convolutions such as Subset Convolution, XOR Product, Covering Product or Packing Product, etc. For arbitrary function f and domain \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D$$\end{document}D we can compute \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document}f-Convolution via brute-force enumeration in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{{\mathcal {O}}}(|D|^{2n} \cdot \textrm{polylog}(M))$$\end{document}O~(|D|2n·polylog(M)) time. Our main result is an improvement over this naive algorithm. We show that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document}f-Convolution can be computed exactly in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{{\mathcal {O}}}( (c \cdot |D|^2)^{n} \cdot \textrm{polylog}(M))$$\end{document}O~((c·|D|2)n·polylog(M)) for constant \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c {:}{=}3/4$$\end{document}c:=3/4 when \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D$$\end{document}D has even cardinality. Our main observation is that a cyclic partition of a function \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f :D\times D\rightarrow D$$\end{document}f:D×D→D can be used to speed up the computation of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document}f-Convolution, and we show that an appropriate cyclic partition exists for every f. Furthermore, we demonstrate that a single entry of the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document}f-Convolution can be computed more efficiently. In this variant, we are given two functions \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$g,h :D^n \rightarrow \{-M,\ldots ,M\}$$\end{document}g,h:Dn→{-M,…,M} alongside with a vector \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textbf{v}\in D^n$$\end{document}v∈Dn and the task of the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document}f-Query problem is to compute integer \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(g \mathbin {\circledast _{f}}h)(\textbf{v})$$\end{document}(g⊛fh)(v). This is a generalization of the well-known Orthogonal Vectors problem. We show that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f$$\end{document}f-Query can be computed in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{{\mathcal {O}}}(|D|^{\frac{\omega }{2} n} \cdot \textrm{polylog}(M))$$\end{document}O~(|D|ω2n·polylog(M)) time, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\omega \in [2,2.372)$$\end{document}ω∈[2,2.372) is the exponent of currently fastest matrix multiplication algorithm.


Introduction
Convolutions occur naturally in many algorithmic applications, especially in the exact and parameterized algorithms.The most prominent example is a subset convolution procedure [22,36], for which an efficient O(2 n •polylog(M )) time algorithm for subset convolution dates back to Yates [39] but in the context of exact algorithms it was first used by Björklund et al. [6]. 1  Researchers considered a plethora of other variants of convolutions, such as: Cover Product, XOR Product, Packing Product, Generalized Subset Convolution, or Discriminantal Subset Convolution [6,7,8,10,11,20,34].These subroutines are crucial ingredients in the design of efficient algorithms for many exact and parameterized algorithms such as Hamiltonian Cycle, Feedback Vertex Set, Steiner Tree, Connected Vertex Cover, Chromatic Number, Max k-Cut or Bin Packing [5,10,19,27,38,40].These convolutions are especially useful for dynamic programming algorithms on tree decompositions and occur naturally during join operations (e.g., [19,33,34]).Usually, in the process of algorithm design, the researcher needs to design a different type of convolution from scratch to solve each of these problems.Often this is a highly technical and laborious task.Ideally, we would like to have a single tool that can be used as a blackbox in all of these cases.This motivates the following ambitious goal in this paper: Goal: Unify convolution procedures under one general umbrella.
Towards this goal, we consider the problem of computing f -Generalized Convolution (f -Convolution) introduced by van Rooij [33].Let D be a finite domain and let D n be the n length vectors (tuples) of D. Let f : D × D → D be an arbitrary function and let ⊕ f be a coordinate-wise application of the function f . 2 For two functions g, h : D n → Z the f -Convolution, denoted by (g f h) : D n → Z, is defined for all v ∈ D n as Here we consider a standard Z(+, •) ring.Through the paper we assume that M is the absolute value of the maximum integer given on the input.
In the f -Convolution problem the functions g, h : D n → {−M, . . ., M } are given as an input and the output is the function (g f h).Note, that the input and output of the f -Convolution problem consist of 3 • |D| n integers.Hence it is conceivable that f -Convolution could be solved in O(|D| n • polylog(M )).Such a result for arbitrary f would be a real breakthrough in how we design parameterized algorithms.So far, however, researchers have focused on characterizing functions f for which f -Convolution can be solved in O(|D| n • polylog(M )) time.In [33] van Rooij considered specific instances of this setting, where for some constant r ∈ N the function f is defined as either (i) standard addition: f (x, y) := x+y, or (ii) addition with a maximum: f (x, y) := min(x+y, r −1), or (iii) addition modulo r, or (iv) maximum: f (x, y) := max(x, y).Van Rooij [33] showed that for these special cases the f -Convolution can be solved in O(|D| n • polylog(M )) time.His results allow the function f to differ between coordinates.A recent result regarding generalized Discrete Fourier Transform [31] can be used in conjunction with Yates algorithm [39] to compute f -Convolution in O(|D| ω•n/2 •polylog(M )) time when f is a finite-group operation and ω is the exponent of the currently fastest matrix-multiplication algorithms. 3To the best of our knowledge these are the most general settings where convolution has been considered so far.
Nevertheless, for an arbitrary function f , to the best of our knowledge the state-of-the-art for f -Convolution is a straightforward quadratic time enumeration.Similar questions were studied from the point of view of the Fine-Grained Complexity.In that setting the focus is on convolutions with sparse representations, where the input size is only the size of the support of the functions g and h.It is conjectured that even subquadratic algorithms are highly unlikely for these representations [18,24].However, these lower bounds do not answer Question 1, because they are highly dependent on the sparsity of the input.

Our Results
We provide a positive answer to Question 1 and show an exponential improvement (in n) over a naive O(|D| 2n • polylog(M )) algorithm for every function f .Observe that the running time obtained by Theorem 1.1 improves upon the brute-force for every |D| ≥ 2. Our technique works in a more general setting when g : L n → Z and h : R n → Z and f : L × R → T for arbitrary domains L, R and T (see Section 2 for the exact running time dependence).
Our Technique: Cyclic Partition.Now, we briefly sketch the idea behind the proof of Theorem 1.1.We say that a function is k-cyclic if it can be represented as an addition modulo k (after relabeling the entries of the domain and image).These functions are somehow simple, because as observed in [32,33] f -Convolution can be computed in O(k n • polylog(M )) time if f is k-cyclic.In a nutshell, our idea is to partition the function f : D × D → D into cyclic functions and compute the convolution on these parts independently.More formally, a cyclic minor of the function f 1 for an example of a cyclic partition.
Our first technical contribution is an algorithm to compute f -Convolution when the cost of a cyclic partition is small.We highlighted a cyclic partition with red, blue, yellow and blue colors.Each color represents a different minor of f .On the right figure we demonstrate that the red-highlighted minor can be represented as addition modulo 3 (after relabeling a → 0, b → 1 and c → 2).Hence the red minor has cost 3.The reader can further verify that green and blue minors have cost 2 and yellow minor has cost 1, hence the cost of that particular partition is 3 + 2 + 2 + 1 = 8.Lemma 1.2 (Algorithm for f -Convolution).Let D be an arbitrary finite set, f : D ×D → D and let P be the cyclic partition of f .Then there exists an algorithm which given g, h : The idea behind the proof of Lemma 1.2 is as follows.Based on the partition P, for any pair of vectors u, w ∈ D n , we can define a type p ∈ . Our main idea is to go over each type p and compute the sum in the definition of f -Convolution only for pairs (v g , v h ) that have type p.In order to do this, first we select the vectors v g and v h that are compatible with this type p.For instance, consider the example in Figure 1.1.Whenever p i refers to, say, the red-colored minor, then we consider v g only if its i-th coordinate is in {b, c, d} and consider v h only if its i-th coordinate is in {b, d}.After computing all these vectors v g and v h , we can transform them according to the cyclic minor at each coordinate.Continuing our example, as the red-colored minor is 3-cyclic, we can represent the i-th coordinate of v g and v h as {0, 1, 2} and then the problem reduces to addition modulo 3 at that coordinate.Therefore, using the algorithm of van Rooij [33] for cyclic convolution we can handle all pairs of type p in O(( As we go over all m n types p the sum of m n terms is Hence, the overall running time is O(cost(P) n •polylog(M )).This running time evaluation ignores the generation of the vectors given as input for the cyclic convolution algorithm.The efficient computation of these vectors is nontrivial and requires further techniques that we explain in Section 3.
It remains to provide the low-cost cyclic partition of an arbitrary function f .

Lemma 1.3. For any finite set D and any function
For the sake of presentation let us assume that |D| is even.In order to show Lemma 1.3, we partition D into pairs A 1 , . . ., A k where k := |D|/2 and consider the restrictions of f to A j × D one by one.Intuitively, we partition the D × D table describing f into pairs of rows and give a bound on the cost of each pair.This partition allows us to encode f on A j × D as a directed graph G with |D| edges and |D| vertices.We observe that directed cycles and directed paths can be represented as cyclic minors.Our goal is to partition graph G into such subgraphs in a way that the total cost of the resulting cyclic partition is small.Following this argument, the proof of Lemma 1.3 becomes a graph-theoretic analysis.The proof of Lemma 1.3 is included in Section 4. We also give an example which suggests that the constant 3  4 in Lemma 1.3 cannot be improved further while using the partition of D into arbitrary pairs (see Lemma 4.16).
Our method applies for more general functions f : L × R → T , where domains L, R, T can be different and have arbitrary cardinality.We note that a weaker variant of Lemma 1.3 in which the guarantee is cost(P f ) ≤ 7  8 |D| 2 is easier to attain (see Section 4).
Efficient Algorithm for Convolution Query.Our next contribution is an efficient algorithm to query a single value of f -Convolution.In the f -Query problem, the input is g, h : D n → Z and a single vector v ∈ D n .The task is to compute a value (g f h)(v).Observe that this task generalizes4 the fundamental problem of Orthogonal Vectors.We show that computing f -Query is much faster than computing the full output of f -Convolution.
Theorem 1.4 (Convolution Query).For any finite set D and function f : Here O(m ω •polylog(M )) is the time needed to multiply two m×m integer matrices with values in {−M, . . ., M } and currently ω ∈ [2, 2.372) [2,21].Note, that under the assumption that two matrices can be multiplied in the linear in the input time (i.e., ω = 2) then Theorem 1.4 runs in the nearly-optimal O(|D| n • polylog(M )) time.Theorem 1.4 is significantly faster than Theorem 1.1 even when we plug-in the naive algorithm for matrix multiplication (i.e., ω = 3).The proof of Theorem 1.4 is inspired by an interpretation of the f -Query problem as counting length-4 cycles in a graph.

Related Work
Arguably, the problem of computing the Discrete Fourier Transform (DFT) is the prime example of convolution-type problems in computer science.Cooley and Tukey [17] proposed the fast algorithm to compute DFT.Later, Beth [4] and Clausen [16] initiated the study of generalized DFTs whose goal has been to obtain a fast algorithm for DFT where the underlying group is arbitrary.After a long line of works (see [30] for the survey), the currently best algorithm for generalized DFT concerning group G runs in O(|G| ω/2+ ) operations for every > 0 [31].
A similar technique to ours was introduced by Björklund et al. [9].The paper gave a characterization of lattices that admit a fast zeta transform and a fast Möbius transform.Their paper used the notion of covering pairs, which is similar to cyclic partitions used in this paper but with a completely different goal.
From the lower-bounds perspective to the best of our knowledge only a naive Ω(|D| n ) lower bound is known for f -Convolution (as this is the output size).We note that known lower bounds for different convolution-type problems, such as (min, +)-convolution [18,24], (min, max)-convolution [13], min-witness convolution [25], convolution-3SUM [14] or even skew-convolution [12] cannot be easily adapted to f -Convolution as the hardness of these problems comes primarily from the ring operations.
The Orthogonal Vector problem is related to the f -Query problem.In the Orthogonal Vector problem we are given two sets of n vectors A, B ⊆ {0, 1} d and the task is to decide if there is a pair a ∈ A, b ∈ B such that a • b = 0.In [37] it was shown that no n 2− • 2 o(d) algorithm for Orthogonal Vectors is possible for any > 0 assuming SETH [35].The currently best algorithm for Orthogonal Vectors run in n 2−1/O(log(d)/ log(n)) time [1,15], O(n • 2 cd ) for some constant c < 0.5 [29], or O(|↓A| + |↓B|) [7] (where |↓F | is the total number of vectors whose support is a subset of the support of input vectors).

Organization
In Section 2 we provide the formal definitions of the problems alongside the general statements of our results.In Section 3 we give an algorithm for f -Convolution that uses a given cyclic partition.In Section 4 we show that for every function f : D × D → D there exists a cyclic partition of low cost.Finally, in Section 5 we give an algorithm for f -Query and prove Theorem 1.4.In Section 6 we conclude the paper and discuss future work.

Preliminaries
Throughout the paper, we use Iverson bracket notation, where for the logic expression P , the value of P is 1 when P is true and 0 otherwise.For n ∈ N we use [n] to denote {1, . . ., n}.Through the paper we denote vectors in bold, for example, q ∈ Z k denotes a k-dimensional vector of integers.We use subscripts to denote the entries of the vectors, e.g., q := (q 1 , . . ., q k ).Let L, R and T be arbitrary sets and let f : L × R → T be an arbitrary function.We extend the definition of such an arbitrary function f to vectors as follows.For two vectors u ∈ L n and w ∈ R n we define u ⊕ f w := (f (u 1 , w 1 ), . . ., f (u n , w n )).
In this paper, we consider the f -Convolution problem with a more general domain and image.We define it formally as follows: Definition 2.1 (f -Convolution).Let L, R and T be arbitrary sets and let f : L × R → T be an arbitrary function.The f -Convolution of two functions g : L n → Z and h : R n → Z, where n ∈ N, is the function (g f h) : T n → Z defined by As before the operations are taken in the standard Z(+, •) ring and M is the maximum absolute value of the integers given on the input.Now, we formally define the input and output to the f -Convolution problem.Our main result stated in the most general form is the following.
We refer to the functions σ A , σ B and σ C as the relabeling functions of f .The restriction of f : 3 follows from the following lemmas.Lemma 3.1 (Algorithm for Generalized Convolution).Let L, R and T be finite sets.Also, let f : L × R → T be a function and let P be a cyclic partition of f .Then there is an O((cost(P For any K ⊆ N we define the K-Cyclic Convolution Problem in which we restrict the entries of the vector r in Definition 2.4 to be in K.

Definition 2.5 (K-Cyclic Convolution Problem). For any K ⊆ N the K-Cyclic
Convolution Problem is defined as follows.
Van Rooij [32] claimed that the N-Cyclic Convolution Problem can be solved in O k i=1 r i • polylog(M ) time.However, for his algorithm to work it must be given an appropriate large prime p and several primitives roots of unity in F p .We are unaware of a method which deterministically finds such a prime and roots while retaining the running time.To overcome this obstacle we present an algorithm for the K-Cyclic Convolution Problem when K ⊆ N is a fixed finite set.Our solution uses multiple smaller primes and the Chinese Reminder Theorem.We include the details in Appendix A.
Theorem 2.6 (K-Cyclic Convolution).For any finite set K ⊆ N, there is an

Generalized Convolution
In this section we prove Lemma 3.1.
Lemma 3.1 (Algorithm for Generalized Convolution).Let L, R and T be finite sets.Also, let f : L × R → T be a function and let P be a cyclic partition of f .Then there is an O((cost(P Throughout the section we fix L, R and T , and f : L × R → T to be as in the statement of Lemma 3.1.Additionally, fix a cyclic partition . We assume the labeling functions are also fixed throughout this section.
In order to describe our algorithm for Lemma 3.1, we first need to establish several technical definitions.

Definition 3.2 (Type). The type of two vectors
Observe that the type of two vectors is well defined as P is a cyclic partition.For any type p ∈ {1, . . ., m} n we define to be vector domains restricted to type p.For any type p we introduce relabeling functions on its restricted domains.The relabeling functions of p are the functions σ L p : L p → Z p , σ R p : R p → Z p , and σ T p : Z p → T n defined as follows: Our algorithm heavily depends on constructing the following projections.

Definition 3.3 (Projection of function). The projection of a function
for every q ∈ Z p .
Similarly, the projection h p : Z p → Z of a function h : R n → Z with respect to the type p ∈ [m] n is defined as The projections are useful due to the following connection with g f h.
Lemma 3.4.Let g : L n → Z and h : R n → Z, then for every v ∈ T n it holds that: where g p h p is the cyclic convolution of g p and h p .
We give the proof of Lemma 3.4 in Section 3.1.It should be noted that the naive computation of the projection functions of g and h with respect to all types p is significantly slower than the running time stated in Lemma 3.1.To adhere to the stated running time we use a dynamic programming procedure for the computations, as stated in the following lemma.

Lemma 3.5.
There exists an algorithm which given a function g : Remark 3.6.Analogously, we can also construct every projection of a function h : The proof of Lemma 3.5 in given in Section 3.1.
Our algorithm for f -Convolution (see Algorithm 1 for the pseudocode) is a direct implication of Lemma 3.4 and Lemma 3.5.First, the algorithm computes the projections of g and h with respect to every type p. Subsequently, the cyclic convolution of g p and h p is computed efficiently as described in Theorem 2.6.Finally, the values of (g f h) are reconstructed by the formula in Lemma 3.4.

return r
Proof of Lemma 3.1.Observe that Algorithm 1 returns r : where the last equality is by Lemma 3.4.Thus, the algorithm returns (g f h) as required.
It therefore remains to bound the running time of the algorithm.By Lemma 3.5, Line 1 of Algorithm 1 runs in time O((cost(P Finally, observe that the construction of r in Line 3 can be implemented by initializing r to be zeros and iteratively adding the value of c p (q) to r(σ T p (q)) for every p ∈ [m] n and q ∈ Z p .The required running time is thus O(|T | n • polylog(M )) for the initialization and for the addition operations.Thus, the overall running time of Line 3 is This concludes the proof of Lemma 3.1.

Properties of Projections
In this section we provide the proofs for Lemma 3.4 and Lemma 3.5.The proof of Lemma 3.4 uses the following definitions of coordinate-wise addition with respect to a type p.
Definition 3.7 (Coordinate-wise addition modulo for type).For any p ∈ [m] n we define a coordinate-wise addition modulo as q + p r := (q 1 + r 1 mod k p 1 ), . . ., (q n + r n mod k p n ) for every q, r ∈ Z p .
Proof of Lemma 3.4.By Definition 2.1 it holds that: Recall that the type of every two vectors (u, w) ∈ L n × R n is unique and [m] n contains all possible types and hence, we can rewrite (3.1) as By the properties of the relabeling functions, we get Observe that we can partition L p (respectively R p ) by considering the inverse images of r ∈ Z p under σ L p (respectively σ R p ), i.e.L p = r∈Z p {u ∈ L p | σ L p (u) = r}.Hence, for every p ∈ [m] n and q ∈ Z p it holds that By plugging (3.3) into (3.2) we get as required.

Existence of Low-Cost Cyclic Partition
In this section we prove Lemma 4.1.
Lemma 4.1.Let f : L × R → T where L, R and T are finite sets.Then there is a cyclic partition P of f such that cost(P) when |L| is even, and cost(P) 2 ) when |L| is odd.We first consider the special case when |L| = 2. Later we reduce the general case to this scenario and use the result as a black-box.
As a warm-up we construct a cyclic partition of cost at most 7 8 |D| 2 assuming that L = R = T = D and that |D| is even.For this, we first partition D into pairs d 2 where i ∈ [|D|/2] and show for each such pair that f restricted to {d with the above constraints.Note that f restricted to {d 2 } and D can be decomposed into at most 2|D| trivial minors.Hence, the cyclic partition for f restricted to {d

Special Case: |L| = 2
In this section, we prove the following lemma that is a special case of Lemma 4.1.
To construct the cyclic partition we proceed as follows.First, we define, for a function f , the representation graph G f .Next, we show that if this graph has a special structure, which we later call nice, then we can easily find a cyclic partition for the function f .Afterwards we decompose (the edges of) an arbitrary representation graph G f into nice structures and then combine the cyclic partitions coming from these parts to a cyclic partition for the original function f .We put an edge between vertices u and v if there is an ri with u = f ( 0, ri) and v = f ( 1, ri).We highlight an example decomposition of the edges into a cycle with 4 vertices (highlighted red) and three paths with 5, 2 and 4 vertices (highlighted blue, yellow and green respectively).The cost of this cyclic partition is 4 + 5 + 2 + 4 = 15.

Definition 4.3 (Graph Representation). Let
We say a function We say that the representation graph G f is nice if G f is a directed cycle or a directed path (potentially with a single edge).Let E prime ⊆ E(G f ) be a subset of edges inducing the subgraph G prime of G f .With T prime := V (G prime ) and R prime := {r ∈ R | λ f (r) ∈ E prime }, we define f prime : L × R prime → T prime as the restriction of f such that the representation graph of f prime is G prime .Formally, f prime ( , r) = f ( , r) for all ∈ L and r ∈ R prime .We say that f prime is the function represented by G prime or E prime , respectively.
A decomposition of a directed graph G is a family F of edge-disjoint subgraphs of G, such that each edge belongs to exactly one subgraph in F. The following observation follows directly from the previous definition.Observation 4.5.Let {G 1 , . . ., G k } be a decomposition of the graph G f into k subgraphs, let f i be the function represented by G i , and let P i be a cyclic partition of Cyclic Partitions Using Nice Representation Graphs.As a next step, we show that functions admit cyclic partitions if the representation graph is nice.We extend these results to functions with arbitrary representation graphs by decomposing these graphs into nice subgraphs.Finally, we combine these results to obtain a cyclic partition for the original function f .Proof.By definition, a nice graph is either a cycle or a path.We handle each case separately in the following.Let L = { 0 , 1 }.G f is a cycle.We first define the relabeling functions of f to show that f is |T |-cyclic.
For the elements in L, let σ L : L → Z 2 with σ L ( i ) = i.To define σ R and σ T , fix an arbitrary t 0 ∈ T .Let t 1 , . . ., t |T | be the elements in T with t |T | = t 0 such that, for all j ∈ Z |T | , there is some r j ∈ R with λ f (r j ) = (t j , t j+1 ). 6Note that these r i exist since G f is a cycle.Using this notation, we define σ T : Z |T | → T with σ T (j) = t j , for all j ∈ Z |T | .
For the elements in R we define σ R : R → Z |R| with σ R (r) = j whenever λ f (r) = (t j , t j+1 ) for some j.
It is easy to check that f can be seen as addition modulo |T |.Indeed, let i ∈ {0, 1} and r ∈ R with λ f (r) = (t j , t j+1 ).Then we get Thus, f is |T |-cyclic and {(L, R, |T |)} is a cyclic partition of f .G f is a path.Similarly to the previous case, f can be represented as addition modulo |T |.
As the proof is essentially identical to the cyclic case, we omit the details here.
In the next step, we decompose arbitrary graphs into nice subgraphs.To present our decomposition we need to introduce the following notation related to the degree of vertices.Definition 4.7 (Sources, Sinks and Middle Vertices).Let G = (V, E) be a directed graph.We denote by indeg(v) the in-degree of v, i.e., the number of edges terminating at v, and by outdeg(v) the out-degree of v, i.e., the number of edges starting at v.
We partition V into the three sets V src (G), V mid (G), and V snk (G) defined as follows: Set V src (G) contains all source vertices of G, that is, vertices with no incoming edges (i.e., indeg(v) = 0).This includes all isolated vertices.Set V mid (G) contains all middle vertices of G, that is vertices with incoming and outgoing edges (i.e., indeg(v), outdeg(v) ≥ 1).Set V snk (G) contains the (remaining) sink vertices of G, that is, vertices with incoming but no outgoing edges (i.e., indeg(v) ≥ 1 and outdeg(v) = 0).We additionally introduce the notion of deficiency which we use in the following proofs.
We define Defi(G) := v∈V defi(v) as the total deficiency of the graph G.
We omit the graph G from the notation if it is clear from the context.We use the deficiency to decompose the acyclic graphs into paths.

Lemma 4.9. Every directed graph G can be decomposed into Defi(G) paths and an arbitrary number of cycles.
Proof.We construct the decomposition F of G as follows.In the first phase, we exhaustively find a directed cycle C in G.We add cycle C to the decomposition F and remove the edges of C from G. We continue the above procedure until graph G becomes acyclic.Next, in the second phase we exhaustively find a directed maximum length path P (note that P may be a single edge).We add P to the decomposition F and remove the edges of P from G. We repeat the second phase until the graph G becomes edgeless.This concludes the construction of decomposition F. For correctness observe that the above procedure always terminates because in each step we decrease the number of edges of G.Moreover, at the end of the above procedure F is a decomposition of G that consists only of paths and cycles.
We are left to show that the number of paths in F is exactly Defi(G).Note that deleting a cycle in G does not change the value of Defi(G), hence the first phase of the procedure does not influence Defi(G) and we can assume that G is acyclic.
Next, we show that deleting a maximum length path from an acyclic graph decrements its deficiency by exactly 1.This then conclude the proof, because in the second phase of the procedure the deficiency of G decreases from Defi(G) down to 0, which means that exactly Defi(G) maximum length paths were added to F.
Let P be a maximum length, directed path in the acyclic graph G. Let s, t ∈ V (G) be the starting and terminating vertices of path P .Path P must start at a vertex with a positive deficiency, because otherwise P could have been extended at the start which would contradict the fact that P is of maximum length.Similarly, since P is of maximum length it must terminate in a sink vertex.Hence defi(s) > 0 and defi(t) = 0.Moreover, every vertex v ∈ P \ {s, t} has exactly one incoming and one outgoing edge in P .Therefore, in the graph G \ P the contribution to the total deficiency decreased only in the vertex s and only by 1.This means that Defi(G) = Defi(G \ P ) + 1 which concludes the proof.Now we combine Lemmas 4.6 and 4.9 to show Lemma 4.10.Proof.First, use Lemma 4.9 to decompose the graph into cycles and Defi(G f ) paths.Then, for each of these paths and cycles, use Lemma 4.6 to obtain the cyclic minor.By Observation 4.5, these minors form a cyclic partition for the function represented by G f .Let P be the resulting cyclic partition.
It remains to analyze the cost of the cyclic partition P. By construction, each cyclic minor in P corresponds to a path or a cycle (possibly of length 1).By Lemma 4.6 the cost of a path or a cycle is the number of vertices it contains.Thus, for a path, the cost is equal to the number of edges plus one, and for a cycle the cost is equal to the number of edges.Hence, the cost of P is bounded by the number of edges of G f plus the number of paths in the decomposition.The latter is precisely Defi(G f ) by Lemma 4.9.
Cyclic Partitions Using a Direct Construction.In the following, we use a different method to construct a cyclic partition of the function f .Instead of decomposing the graph into nice subgraphs, we directly construct a partition and bound its cost.Proof.For each ∈ L, we use a single cyclic minor.Let L = { 0 , 1 }.For i ∈ {0, 1} define Bounding The Cost of Cyclic Partitions.Now, we combine the results from Lemmas 4.10 and 4.11.We first show how the number of edges relates to the total deficiency of a graph and the number of middle vertices.Proof.Let m be the number of edges of G and let e 1 , . . ., e m ∈ E(G) be some arbitrarily fixed order of its edges.For every i ∈ {0, . . ., m} let G i be the graph with vertices V (G) and edges E(G i ) = {e 1 , . . ., e i }.Hence G 0 is an independent set of V (G) and G m = G.
For every i ∈ {0, . . ., m} let LHS(G i ) := |V mid (G i )| + Defi(G i ) be the quantity we need to bound.We show that

LHS(G
which then concludes the proof because From now, we focus on the proof of Equation (4.1).For every v ∈ V (G) and i ∈ {0, . . ., m}, let defi i (v) be the deficiency of vertex v in graph G i .Next, for every v ∈ V (G) and i ∈ [m], we define . Let e i = (s, t) be an ith edge that starts at a vertex s and terminates at a vertex t.It holds that Proof.We consider two cases depending on whether u became a middle vertex.
) which means that s has more incoming than outgoing edges in G i−1 .Hence defi i−1 (s) = defi i (s) = 0 and we conclude that ∆ i (s ).Because the edge e i starts at s, the deficiency of s can increase by at most 1.Hence, by (defi i (s) Finally, we consider the end vertex t of the edge e i .Claim 4.14.It holds that ∆ i (t) ≤ 0.
Proof.We again distinguish two cases depending on whether t became a middle vertex.If ) and moreover, t has no incoming edges and the positive number of outgoing edges in G i−1 .Therefore defi i (t) = defi i−1 (t) − 1 which means that ∆ i (t) ≤ 0.
It remains to analyse the case when t / ∈ V mid (G i ) \ V mid (G i−1 ).Since the edge e i ends at t, the deficiency of t cannot increase and defi i (v) ≤ defi i−1 (v).This means that ∆ i (t) ≤ 0. By Claims 4.13 and 4.14, it follows that ∆ i (s) + ∆ i (t) ≤ 1.This establishes Equation (4.1) and concludes the proof.Now we are ready to combine Lemmas 4.10 and 4.11 and prove Lemma 4.2.
Proof of Lemma 4.2.As before, we denote by G f the representation graph of f .Let V and E be the set of vertices and edges of graph G f .
Let P 1 be the cyclic partition of f from Lemma 4.10 with cost at most |E| + Defi(G f ) and let P 2 be the cyclic partition of f from Lemma 4.11 with cost at most |V | + |V mid (G f )|.We define P as the minimum cost partition among P 1 and P 2 .This implies that cost(P) ≤ min{cost(P 1 ), cost(P 2 )} ≤ cost(P 1 ) + cost(P 2 ) 2 Next, we use the inequality

General case: Proof of Lemma 4.1
Now we have everything ready to prove the main result of this section.
Proof of Lemma 4.1.We first handle the case when |L| is even.We partition L into λ = |L|/2 sets L 1 , . . ., L λ consisting of exactly two elements.We use Lemma 4.2 to find a cyclic partition P i for each f i : L i ×R → T .By definition of the cyclic partition, P = i∈[λ] P i is a cyclic partition for f , hence it remains to analyze the cost of P.
Observe that for each G i we have that By the definition of the cost of the cyclic partition, we immediately get that If |L| is odd, then we remove one element from L and let L 0 = { }.There is a trivial cyclic partition P 0 for f 0 : L 0 × R → T of cost at most |R|.Then we use the above procedure to find a cyclic partition P prime for the restriction of f to L \ { } and R. Hence, setting P = P 0 ∪ P prime gives a cyclic partition for f with cost Computing Generalized Convolutions Faster Than Brute Force by swapping the role of L and R and considering the function f prime : R × L → T with f prime (r, ) = f ( , r) for all ∈ L and r ∈ R.

Tight Example: Lower bound on Lemma 4.2
To complement the previous results, we show that Lemma 4.2 is tight.That is, there is a function f : L × R → T with |L| = 2 such that no cyclic partition P of f has smaller cost, i.e., cost(P) < |R| + |T |/2.In particular, this demonstrates that to improve the constant c := 3/4 in Theorem 1.1 new ideas are needed.Proof.Define L = { 0 , 1 }, R = {r 1 , r 2 , r 3 , r 4 }, and T = {a, b, c, d}.Let f be the function as defined in Figure 4.2.Note that we need to show that every cyclic partition of f has cost at least 6.
Let P be a cyclic partition of f .We first claim that the cyclic partition P of f contains a single cyclic minor, i.e., P = {(L, R, k)} for some integer k.For contradictions sake, we analyse every other remaining structure of P and argue that in each case cost(P) Every cyclic minor in P is of the form ({ i }, B, k) (i.e., uses only values from a single row).Then, cost(P) ≥ 6 as each row has 3 distinct values.There is a cyclic minor ({ 0 , 1 }, {r j }, k) in P. Since each column contains two distinct elements, It must hold that k ≥ 2. Furthermore, the cyclic minors which cover the remainder of the graph must have a total cost of 4 (or more) as all values in T appear in the remainder of the graph.Hence cost(P) ≥ 6.There is a cyclic minor ({ 0 , 1 }, {r j , r j prime }, k) in P. Since each pair of two columns contains (at least) three values, it must hold that k ≥ 3.There are at least 3 distinct values in the remainder of the graph, hence, the cost of the remaining minors in P is at least 3. Thus cost(P) ≥ 6.There is a cyclic minor ({ 0 , 1 }, R \ {r j }, k) in P. It holds that k ≥ 4 as every three columns include all values in T .In each case, there are two different values in the remaining column.Hence, the cost of the remaining minors is at least 2. Therefore cost(P) ≥ 6.
With this, we know that P contains only the single cyclic minor (L, R, k).Let σ L , σ R and σ T be the relabelling functions of (L, R, k).From the definition of the relabeling functions, we get that F := {σ L ( i ) + σ R (r j ) mod k | i ∈ {0, 1} and j ∈ {1, 2, 3}} contains at least four elements.
We claim that (σ L ( 0 ) + σ R (r 4 ) mod k) / ∈ F .For the sake of contradiction assume otherwise.Then, by the definition of σ T , it must hold that σ L ( 0 which is a contradiction.Similarly, we get that (σ L ( 1 ) + σ R (r 4 ) mod k) / ∈ F .Again assuming otherwise, we have that σ R (r 3 ) = σ R (r 4 ) which then implies ) mod k} ⊆ Z k , contains at least six distinct elements, we get k ≥ 6 and therefore, cost(P) ≥ 6.

Querying a Generalized Convolution
In this section, we prove Theorem 1. ).If we assume that n is even, then, for a vector v ∈ D n , let v (high) , v (low) ∈ D n/2 be the unique vectors such that v (high) v (low) = v.Indeed, to achieve this assumption let n be odd, fix an arbitrary d ∈ D, and define g, h : ).Thus, we can solve the f -Query instance g, h and v (f (d, d)) and obtain the correct result.
We first provide the intuition behind the algorithm and then formally show the existence.
Intuition.We define a directed multigraph G where the vertices are partitioned into four layers L (high) , L (low) , R (low) , and R (high) .Each of these sets consists of |D| n/2 vertices representing every vector in D n/2 .For ease of notation, we use the vectors to denote the associated vertices; furthermore, the intuition assumes g and h are non-negative.The multigraph G contains the following edges: g(w x) parallel edges from w ∈ D n/2 in L (high) to x ∈ D n/2 in L (low) .One edge from x ∈ D n/2 in L (low) to y ∈ D n/2 in R (low) if and only if x ⊕ f y = v (low) .h(z y) parallel edges from y ∈ D n/2 in R (low) to z ∈ D n/2 in R (high) .One edge from z ∈ D n/2 in R (high) to w ∈ D n/2 in L (high) if and only if w ⊕ f z = v (high) .In the formal proof, we denote the adjacency matrix between L (high) and L (low) by W , between L (low) and R (low) by X, between R (low) and R (high) by Y , and between R (high) and L (high) by Z. See Figure 5.1 for an example of this construction.Each vertex in a layer corresponds to the vector in D n/2 .We highlighted 4 vectors w, x, y, z ∈ D n/2 each in a different layer.Note that the number of 4 cycles that go through all four w, x, y, z is equal to g(w x) • h(z y).The total number of directed 4-cycles in this graph corresponds to the value (g f h)(v) and tr(W Let w, x, y, z ∈ D n/2 be vertices in L (high) , L (low) , R (low) , and R (high) .It can be observed that if (w x) ⊕ f (y z) = v, then G does not contain any cycle of the form w → x → y → z → w as one of the edges (x, y) or (z, w) is not present in the graph.Conversely, if (w x) ⊕ f (y z) = v, then one can verify that there are g(w x) • h(z y) cycles of the form w → x → y → z → w.We therefore expect that (g f h)(v) is the number of cycles in G that start at some w ∈ D n/2 in L (high) , have length four, and end at the same vertex w in L (high) again.
Formal Proof.We use the notation Mat Z (D n/2 × D n/2 ) to refer to a |D| n/2 × |D| n/2 matrix of integers where we use the values in D n/2 as indices.The transition matrices of g, h and v are the matrices W, X, Y, Z ∈ Mat Z (D n/2 × D n/2 ) defined by Recall that the trace tr(A) of a matrix A ∈ Mat Z (m × m) is defined as tr(A) := m i=1 A i,i .The next lemma formalizes the correctness of this construction.Lemma 5.1.Let n ∈ N be an even number, g, h : ) be the transition matrices of g, h and v.Then, Proof.For any w, y ∈ D n/2 it holds that, x ⊕ f y = v (low) • g(w x). (5.1) Similarly, for any y, w ∈ D n/2 it holds that, Therefore, for any w where the second equality follows by (5.1) and (5.2).Thus,

Conclusion and Future Work
In this paper, we studied the f -Convolution problem and demonstrated that the naive brute-force algorithm can be improved for every f : D × D → D. We achieve that by introducing a cyclic partition of a function and showing that there always exists a cyclic partition of bounded cost.We give an O((c|D| 2 ) n •polylog(M )) time algorithm that computes f -Convolution for c := 3/4 when |D| is even.The cyclic partition is a very general tool and potentially it can be used to achieve greater improvements for certain functions f .For example, in multiple applications (e.g., [19,23,28,33]) the function f has a cyclic partition with a single cyclic minor.Nevertheless, in our proof we only use cyclic minors where one domain is of size is at most 2. We suspect that larger minors have to be considered to obtain better results.Indeed, the lower bound from Lemma 4.16 implies that our technique of considering two arbitrary rows together cannot give a faster algorithm than O((3/4 • |D| 2 ) n • polylog(M )) in general.An improved algorithm would have to select these rows very carefully or consider three or more rows at the same time.

A Proof of Theorem 2.6
In this section we prove Theorem 2.6.We crucially rely on the following result by van Rooij [32].

Theorem A.1 ([32, Lemma 3]).
There is an algorithm which given k ∈ N, r ∈ N k a prime p, an r j -th primitive root of unity ω j for every j ∈ [k] and two functions g, h : Ideally, we would like to use the algorithm from Theorem A.1 with a sufficiently large prime p such that the values of g h could be recovered from the values of g h modulo p. Finding such a prime p along with the required roots of unity is, however, a non trivial task which we do not know how to perform deterministically while retaining the running time at O (R • polylog(M )).The basic idea behind our approach is to compute g h modulo p i for a sufficiently large number of distinct small primes p i using Theorem A.1.If i p i is sufficiently large, then the values of g h can be uniquely recovered using the Chinese Remainder Theorem.
Theorem A.2 (Chinese Remainder Theorem).Let p 1 , . . ., p m denote a sequence of integers that are pairwise coprime and define P : Then there is a unique number 0 ≤ s < P such that s ≡ a i mod p i for all i ∈ [m].Moreover, there is an algorithm that, given p 1 , . . ., p m and a 1 , . . ., a m , computes the number s in time O((log P ) 2 ).
To find the small primes for the application of the Chinese Remainder Theorem, we additionally use density properties of primes in arithmetic progression.Given q ∈ N, we say p ∈ N is a q-prime if p is a prime number and p ≡ 1 mod q.We use prime q (i) to denote the i-th q-prime.That is, prime q (i) is a q-prime such that the number of q-primes smaller than prime q (i) is exactly i − 1.Also, for any B, q ∈ N, we define prime_bound q (B) := min m ∈ N m i=1 prime q (i) ≥ B to be the minimal number m such that the product of the first m q-primes is at least B. We use the following upper bound on prime_bound.
In the proof of Lemma A.3 we use a known result for the density of primes in arithmetic progressions taken from [3].For any x, q ∈ N, define θ(x, q) to be the sum of ln(p) for all q-primes p such that p ≤ x.Formally, we define With this definition, we can now state the result about the density of primes in arithmetic progressions.

Now we have everything ready to prove Lemma A.3.
Proof of Lemma A.3.We first prove the bound for m.By the definition of m as m = prime_bound q (B), we get m−1 i=1 prime q (i) < B. As ln(prime q (i)) > 1 for every i, we have m − 1 < m−1 i=1 ln(prime q (i)) = ln m−1 i=1 prime q (i) < ln(B) which implies m < ln(B) + 1.Now we prove the bound for prime q (m).For this we set x = max exp 8 • √ q • ln 3 (q) , exp(q), 2q • ln(B) .
In the remainder we give the O ( For every t ∈ [ ], let D t be the prime factors of c t .We define R = k j=1 r j and observe that for any v ∈ Z it holds that |(g h)(v)| ≤ R • M 2 .Further define B := 3 • R • M 2 and q = c∈K c = t=1 c t .Assume without loss of generality that q ≥ 3 and note that q depends only on the fixed finite set K and therefore, can be viewed as a constant.
With this notation we can formally state the algorithm.1. Iterate over the numbers of the form q • a + 1 for a ∈ {1, 2, . ..} and test for each one if it is prime.The process continues until the product of the q-primes exceeds B. Denote these numbers by p 1 , . . ., p m .2. For every i ∈ [m] and t ∈ [ ], iterate over all elements x ∈ F pi and test whether x ct ≡ 1 mod p i and x ct/d ≡ 1 mod p i for every d ∈ D t .If so, then set x as the c t -th root of unity in F pi .3. For all i ∈ [m], use Theorem A.1 with the prime p i and appropriate roots of unity to compute the function f (i) : Z → Z pi defined by f (i) (v) := (g h)(v) mod p i ∀v ∈ Z.

4.
Define P = m i=1 p i , we define a function f P : Z → Z P as follows.For each v ∈ Z, use the Chinese Remainder Theorem (cf.Theorem A.2) to compute the value 0 ≤ f P (v) < P such that f P (v) ≡ f (i) (v) mod p i for all i ∈ [m]. 5. Finally, compute the function f : Z → Z using the formula for all v ∈ Z and return f .Before we move to proving the correctness, we first argue that the algorithm is well-defined.From the definition, the first step computes the first m = prime_bound q (B) q-primes such that p 1 = prime q (1), . . ., p m = prime q (m).It remains to show that, for every i ∈ [m] and t ∈ [ ], the c t -th primitive root of unity in F pi exists.Indeed, since c t divides p i − 1 (which is in turn true as p i ≡ 1 mod q and c t divides q), such a root of unity exists.Moreover, as D t contains all prime factors of c t , one can easily show that it actually suffices to consider only values of the form x ct/d for every d ∈ D t to correctly decide if x is a primitive c t -th root of unity in F pi .The application of Theorem A.1 in the second step is possible as r j ∈ K = {c 1 , . . ., c } for every j ∈ [n] and the roots of unity are computed by the second step.
Now we argue about the correctness of the algorithm.
Proof.As the algorithm is well defined, the third step computes, the convolution of g and h modulo p i for every i ∈ [m].Now fix some v ∈ Z.We define b(v) = (h g)(v) mod P and observe 0 ≤ b(v) < P .Moreover, for every i ∈ [m] it holds that b(v) mod p i = ((h g)(v) mod P ) mod p i = (h g)(v) mod p i = f (i) (v).
Since Theorem A.2 also guarantees the resulting number to be unique, it follows that f P (v) = b(v) which implies f P (v) = (g h)(v) mod P .Now we focus on the last step.By the definition of m = prime_bound q (B), it holds that P = m i=1 p i ≥ B = 3 • R • M 2 .Consider the following cases.In case (g h)(v) ≥ 0 we have This implies that f P (v) = (g h)(v) mod P = (g h)(v) < B 2 .Thus, f (v) = f P (v) = (g h)(v).In case (g h)(v) < 0 it holds that (g h)(v) ≥ −R • M 2 > −P.
From Claim A.5 we know that the algorithm is correct and the function f returned by the algorithm is indeed (g h).It only remains to analyze the running time of the procedure.
Proof.We consider each step on its own. 1.Since prime testing can be done in polynomial time (in the representation size of the number), we can find the sequence p 1 , . . ., p m in time O(p m • polylog p m ).By Lemma A.4, and since q is a constant, it follows that p m ≤ max exp 8 • √ q • ln 3 (q) , exp(q), 2q

Figure 1 . 1
Figure 1.1 Left figure illustrates exemplar function f : D × D → D over domain D := {a, b, c, d}.We highlighted a cyclic partition with red, blue, yellow and blue colors.Each color represents a different minor of f .On the right figure we demonstrate that the red-highlighted minor can be represented as addition modulo 3 (after relabeling a → 0, b → 1 and c → 2).Hence the red minor has cost 3.The reader can further verify that green and blue minors have cost 2 and yellow minor has cost 1, hence the cost of that particular partition is 3 + 2 + 2 + 1 = 8.

Definition 2 . 2
(f -Convolution Problem (f -Convolution)).Let L, R and T be arbitrary finite sets and let f : L × R → T be an arbitrary function.The f -Convolution Problem is the following.Input: Two functions g : R n → {−M, . . ., M } and h : L n → {−M, . . ., M }.Task: Compute g f h.

Theorem 2 . 3 .
Let f : L × R → T such that L, R and T are finite.There is an algorithm for the f -Convolution problem with O(c n • polylog(M )) time, where

Theorem 1 .
1 is a corollary of Theorem 2.3 by setting L = R = T = D.The proof of Theorem 2.3 utilizes the notion of cyclic partition.For any k ∈ N, let

Lemma 4 . 1 . 2 •
Let f : L × R → T where L, R and T are finite sets.Then there is a cyclic partition P of f such that cost(P) ≤ |L| 2 • (|R| + |T | 2 ) when |L| is even, and cost(P) ≤ |R| + |L|−1 (|R| + |T | 2 ) when |L| is odd.The proof of Lemma 3.1 is included in Section 3 and proof of Lemma 4.1 is included in Section 4. The proof of Lemma 3.1 uses an algorithm for Cyclic Convolution.Definition 2.4 (Cyclic Convolution).Let k ∈ N and r ∈ N k .Also, let g, h : Z → N be two functions where Z = Z r1 × . . .× Z r k .The Cyclic Convolution of g and h is the function (g h) : Z → N defined by

Algorithm 1 2
Cyclic Partition Algorithm for the f -Convolution problem Setting : Finite sets L, R and T , f : L × R → T and a cyclic partition P of f , of size m.Input: g : L n → {−M, . . ., M }, h : R n → {−M, . . ., M } 1 Construct the projections of g and h w.r.t p, for all p ∈ [m] n Lemma 3.5 For every p ∈ [m] n compute c p = g p h p Cyclic convolutions (Definition 2.4) 3 Define r : T n → Z by

2 } 2 }
and D has a cyclic partition of cost at most7  4 |D|.The union of these cyclic partitions forms a cyclic partition of f with cost at most |D| 2 • 7 4 |D| = 7 8 |D| 2 .To construct the cyclic partition for a fixed i ∈ [|D|/2], we find a maximal number r of pairwise disjoint pairs e b ) | a, b ∈ {1, 2}}| ≤ 3 for each j ∈ [r], i.e. for each j at least one of the four values f (d is either a cyclic minor of cost at most 3 or can be decomposed into 3 trivial cyclic minors of the total cost at most 3.We claim that r ≥ |D|/4.Indeed, assume that there are fewer than |D|/4 such pairs, i.e. r < |D|/4.Let D denote the |D| − 2 • r > |D|/2 remaining values in D. As the set {f (d (i) a , d) | d ∈ D, a ∈ {1, 2}} can only contain at most |D| values, we can find another pair e

Figure 4 . 1
Figure 4.1 Example of the construction of a representation graph from the function f to obtain a cyclic partition.We put an edge between vertices u and v if there is an ri with u = f ( 0, ri) and v = f ( 1, ri).We highlight an example decomposition of the edges into a cycle with 4 vertices (highlighted red) and three paths with 5, 2 and 4 vertices (highlighted blue, yellow and green respectively).The cost of this cyclic partition is 4 + 5 + 2 + 4 = 15.

Definition 4 . 4 (
Restriction of f ).Let f : L × R → T be a such that |L| = 2 and let G f be a graph representation of f .

Lemma 4 . 6 .
Let f : L × R → T be a function such that G f is nice.Then f has a cyclic partition of cost at most |T | = |V (G f )|.

Lemma 4 . 10 .
Let f : L×R → T be a function with |L| = 2 and let G f be the representation graph of f .Then, there exists a cyclic partition P for f with cost(P) ≤ |E(G f )| + Defi(G f ).

Lemma 4 . 11 .
Let f : L×R → T be a function with |L| = 2 and let G f be the representation graph of f .Then, there is a cyclic partition P of f with cost(P) ≤ |V (G f )| + |V mid (G f )|.

Figure 4 . 2
Figure 4.2 The function f from Lemma 4.16 which shows that the bound from Lemma 4.2 is tight.The representation graph of f is depicted on the right.We highlight the cyclic partition returned by Lemma 4.16.The red path contains 4 vertices and the blue path contains 2 vertices.Hence, the cost of that cyclic partition is 6.Lemma 4.16 shows that this is the best possible.

Remark 4 . 15 .
If |L| and |R| are both even, one can easily achieve a cost of min

Lemma 4 . 16 .
There exist sets L, R, and T with |L| = 2 and a function f : L × R → T such that, every cyclic partition P of f has cost(P) ≥ |R| + |T |/2.

4 .
The main idea is to represent the f -Query problem as a matrix multiplication problem, inspired by a graph interpretation of f -Query.Let D be an arbitrary set and f : D × D → D. We assume D and f are fixed throughout this section.Let g, h : D n → {−M, . . ., M } and v ∈ D n be a f -Query instance.We use a b to denote the concatenation of a ∈ D m and b ∈ D k .That is (a 1 , . . ., a m ) (b 1 , . . ., b k ) = (a 1 , . . ., a m , b 1 , . . ., b k

Figure 5 . 1
Figure 5.1 Construction of the directed multigraph G.Each vertex in a layer corresponds to the vector in D n/2 .We highlighted 4 vectors w, x, y, z ∈ D n/2 each in a different layer.Note that the number of 4 cycles that go through all four w, x, y, z is equal to g(w x) • h(z y).The total number of directed 4-cycles in this graph corresponds to the value (g f h)(v) and tr(W • X • Y • Z).

Proof of Theorem 1 . 4 . 1 . 2 .
Now we have everything ready to give the algorithm for f -Query.The algorithm for solving f -Query works in two steps: Compute the transition matrices W , X, Y , and Z of g, h and v as described above.Compute and return tr(W • X • Y • Z).By Lemma 5.1 this algorithm returns (g f h)(v).Computing the transition matrices in Step 1 requires O(|D| n • polylog(M )) time.Observe the maximal absolute values of an entry in the transition matrices is M .The computation of W • X • Y • Z in Step 2 requires three matrix multiplications of |D| n/2 × |D| n/2 matrices, which can be done in O((|D| n/2 ) ω • polylog(M )) time.Thus, the overall running time of the algorithm is O(|D| ω•n/2 • polylog(M )).

Figure 6 . 1
Figure 6.1Here are three concrete examples of functions f for which we expect that the running times for f -Convolution should be O(3 n • polylog(M )), O(3 n • polylog(M )) and O(4 n • polylog(M )).However, the best cyclic partitions for this functions have costs 4, 4 and 5 (the partitions are highlighted appropriately).This implies that the best running time, which may be attained using our techniques are O(4 n • polylog(M )), O(4 n • polylog(M )) and O(5 n • polylog(M )).

k
i=1 r i ) • polylog(M ) algorithm for the K-Cyclic Convolution Problem.Proof of Theorem 2.6.Fix a finite set K = {c 1 , . . ., c } ⊆ N which is considered as a constant throughout this proof.Let integers k, M ∈ N, integer vector r ∈ K k and functions g, h : Z → {−M, . . ., M } where Z = Z r1 × • • • × Z r k be an input for the K-Cyclic Convolution Problem.
be different costs of cyclic minors in P. By Theorem 2.6, for any type p ∈ [m] n the computation of g p h p in Line 2 is an instance of K-Cyclic Convolution Problem which can be solved in time O((

35
Virginia Vassilevska-Williams. On Some Fine-Grained Questions in Algorithms and Complexity.In Proceedings of the International Congress of Mathematicians (ICM 2018), pages 3447-34, 2018.36 Louis Weisner.Abstract theory of inversion of finite series.Transactions of the American Mathematical Society, 38(3):474-484, 1935.37 Ryan Williams.A new algorithm for optimal 2-constraint satisfaction and its implications.