1 We Are in the Era of Big Data

The twenty-first century can be called the era of Big Data. The number of webpages on the Internet was estimated to be more than 1 trillion (=\(10^{12}\)) in 2008 [22], and the number of websites grows ten times in these 10 years [21]. Thus the number of webpages is estimated to be more than 10 trillion (=\(10^{13}\)) now. If we assume that \(10^6~\mathrm {bytes} (\approx 10^7~\mathrm {bits})\) of data is contained in a single webpage on average,Footnote 1 then the total amount of the data stored on the Internet would be more than 100 exabits (=\(10^{20}\) bits)! The various actions that everyone performs are collected by our smartphones and are stored in the memory of storage devices around the world. The remarkable development of computer memory has made it possible to store this information.

However, the ability to store data and the ability to make good use of the data are different problems. The speed of the data transfer using IEEE 802.11ac is 6.9 Gbps. Using this, it would take 1.7 days to read 1 petabit (\(10^{15}\) bit) of data. To read 1 exabit (\(10^{18}\) bit) of data, we would need over 4 years! Although the speed of data transfer is expected to continue to increase, the amount of available data is also expected to grow even faster.

This situation can create new problems that did not arise in past centuries, such as requiring a huge amount of time just to read an entire dataset. We are thus faced with new problems in terms of computation.

2 Theory of Computational Complexity and Polynomial-Time Algorithms

In the area of the theory of computational complexity, the term “polynomial-time algorithms” is often as a synonym for “efficient algorithms.” A polynomial-time algorithm is an algorithm that runs in time expressed by a function polynomial of the size of the instance (i.e., the input). For example, consider the sorting problem that takes a set of positive integers \(a_1, \ldots , a_n\) as input and outputs a permutation \(\pi : \{ 1, \ldots , n \} \rightarrow \{ 1, \ldots , n \}\) such that \(a_{\pi (i)} \le a_{\pi (i+1)}\) for every \(i \in \{ 1, \ldots , n-1 \}\). In this problem, the input is expressed by n integers and thus the input size is n.Footnote 2

We now briefly introduce the theory of computational complexity. Theoretically, the computation time of an algorithms is expressed in terms of the number of basic units of calculations (i.e., the basic arithmetic operations, reading or writing a value in a cell in memory, and comparison of two valuesFootnote 3). The complexity is then expressed as a function of n, say T(n), where n is the (data) size of the input. If there exists a fixed integer k such that \(T(n)=O(n^k)\), then we say that the algorithm runs in polynomial time.

For example, the sorting problem can be solved in \(O(n \log n)\) time, which is polynomial, and it has been proven that this is the minimum in the big-O sense, meaning that no algorithm exists that runs in \(o(n \log n)\)-time. In contrast, for the partitioning problem, which is the problem of finding a subset B of a given set A consisting of n integers \(a_1, \ldots , a_n\) such that \(\sum _{a_i \in B} a_i = \frac{1}{2} \sum _{a_i \in A} a_i\), no polynomial-time algorithms have been found and the majority of researchers believe that no such algorithm exists.Footnote 4

For many problems, constructing an exponential-time algorithm is easy. For the partitioning problem, for example, an algorithm that tests all subsets of A clearly solves the problem, and this requires \(2^n \cdot O(n)\) time, which is exponential. Therefore, the existence of an exponential-time algorithm is considered to be trivial for many cases. Constructing polynomial-time algorithms, however, requires additional ideas in many cases.

3 Polynomial-Time Algorithms and Sublinear-Time Algorithms

3.1 A Brief History of Polynomial-Time Algorithms

The idea that “polynomial-time algorithms are efficient” is sometimes called Cobham’s Thesis or Cobham–Edmonds’ Thesis, which is named after Alan Cobham and Jack Edmonds [4]. Cobham [3] identified tractable problems with the complexity class P, which is the class of problems solvable in polynomial-time with respect to the input size. Edmonds also stated the same thing in [7].

Although these papers were published in 1965, the idea behind this thesis seems to have been a commonly held belief among researchers in the late in 1950s. For example, Kruskal’s algorithm and Prim’s algorithms, which are both almost linear-time algorithms for the minimum spanning tree problem, were presented in 1956 [16] and 1957 [17], respectively. Dijkstra’s algorithm, which is an almost linear-time algorithm for the shortest path problem with positive edge lengths, was presented in 1959 [6]. Ford and Fulkerson presented the augmenting path algorithm for the maximum flow problem in 1956 [8]. The blossom algorithm was proposed by Jack Edmonds in 1961 for the maximum matching problem on general (i.e., not necessarily bipartite) graphs [7].

In 1971, Cook proposed the idea of NP-completeness and proved that the satisfiability problem (SAT) is NP-complete [5]. NP-complete problems are intuitively the most difficult problems among the class NP. NP is the set of problems that can be solved in polynomial-time by nondeterministic Turing machines. Although we do not have a proof yet, many researchers believe that no polynomial-time algorithms exist for any NP-complete problems.Footnote 5 Cook’s study created a new field of research through which countlessly many combinatorial problems have been found to be NP-complete [10].

By definition, it is trivial that every problem in NP can be solved in exponential time (by a Turing machine). The theory of NP-completeness explicitly and firmly fixed the idea that “polynomial-time algorithms are efficient” in the minds of researchers. We would like to call this idea the polynomial computation paradigm.

Many important polynomial-time algorithms are now known, including the two basic polynomial-time algorithms for the linear programming problem (LP), namely the ellipsoid method proposed by Khachiyan in 1979 [15] and the interior-point method proposed by Karmarkar in 1984 [13], the strongly polynomial-time algorithm for the minimum cost flow problem proposed by Éva Tardos in 1985 [19], the linear-time shortest path algorithm with positive integer edge lengths proposed by Mikkel Thorup in 1997 [20], and the deterministic polynomial-time algorithm for primality test proposed by Agrawal, Kayal, and Saxena in 2002 [1]. These algorithms pioneered new perspectives in the field of algorithm research. They are gems that were found under the polynomial computation paradigm.

3.2 Emergence of Sublinear-Time Algorithms

Although linear-time algorithms have naturally considered the fastest, since intuitively we basically have to read all the data when solving a problem, the new idea of “sublinear-time algorithms” emerged at the end of the twentieth century. Sublinear-time algorithms run by reading only a sublinear (i.e., o(n)) amount of data from the input.

The most popular framework for sublinear-time algorithms is “property testing.” This idea was first presented by Rubinfeld and Sudan [18] in 1996 (although it appeared even earlier at a conference version in 1992) in the context of program checking. In this paper, they introduced the ideas of “distance” between an instance (e.g., a function) and a property (e.g., linearity), and “\(\epsilon \)-farness.” They also gave constant-time testers for some properties of functions. The first study giving the notion of constant-time testability of combinatorial (mainly graph) structures was given by Goldreich, Goldwasser, and Ron [11], which was present a conference in 1995 (STOC’95). After the turn of the century, many studies that follow this idea of testability have appeared and the importance of this field is growing [2, 9].

3.3 Property Testing and Parameter Testing

We say that a testing algorithm (or tester for short) for a property \(\mathcal {P}\) accepts a given instance I with probability at least 2/3 if I has \(\mathcal {P}\) and rejects it with probability at least 2/3 if I is far from having \(\mathcal {P}\). \(\mathcal {P}\) is defined as a (generally infinitely large) subset of instances. The distance between I and \(\mathcal {P}\) is defined as the minimum Hamming distance between I and \(I' \in \mathcal {P}\). The distance is normalized to be in [0, 1] (i.e., \(\mathrm {dist}(I,\mathcal {P}) \in [0,1]\)). If an instance has the property, the distance is zero (i.e., \(\mathrm {dist}(I,\mathcal {P})=0\) if \(I \in \mathcal {P}\)). If \(\mathrm {dist}(I,\mathcal {P}) \ge \epsilon \) for an \(\epsilon \in [0,1]\), then we say that I is \(\epsilon \)-far from \(\mathcal {P}\) and otherwise \(\epsilon \)-close. A tester rejects I with probability at least 2/3 if I is \(\epsilon \)-far from \(\mathcal {P}\).

For a property, if a tester exists whose running timeFootnote 6 is bounded by a constant independent of the size of the input, then we call the property is testable.Footnote 7 This framework is called property testing.

Property testing is a relaxation of the framework of decision problems. In contrast, a relaxation of the framework of optimization problems is parameter testing. In parameter testing, we try to find an approximation to the value of the objective function with an additive error of at most \(\epsilon N\) from the optimum value, where N is the maximum value of the objective function.

This idea appeared at the end of the twentieth century, and was further developed in this century. See Chaps. 2 and 3 for these themes.

4 Ways to Decrease Computational Resources

In addition to property and parameter testing, there are various methods for decreasing the amount of computational resources needed for handling big data. Although some methods may require linear computation, each of them has strong merits. We briefly introduce these methods in this section.

4.1 Streaming Algorithms

Property testing generally uses the assumption that an algorithm can read any position (cell) of the input. However, this may be difficult in some situations, such as if the data arrives as a stream (sequence) and the algorithm is required to read the values one by one in the order of arrival. The key assumption of this framework is that an algorithm does not have enough memory to store the entire input. For example, to find the maximum value in a sequence of integers \(a_1\), \(\ldots \), \(a_n\), it is enough to use O(1) cells of memories.Footnote 8

Although this method requires linear computation time, since it must read all of the data, the amount of memory is constant in many cases. If we assume that the order of data arrival in the stream is random, then it becomes close to the setting of (nonadaptiveFootnote 9) property testing. In this book, streaming algorithms are covered in Chap. 16.

4.2 Compression

Compression is a traditional and typical method for treating digital data. Basically, there are two types of compression: one type is compression of data without losing any information. In this type of compression, there is an information-theoretical lower bound on the data size. This method is used when the original data needs be reconstructed perfectly from the compressed data, and it thus called lossless compression or reversible compression. The other type of compression allows discarding of some of the data such that the compressed data is an inexact approximation. Although some of these algorithms can compress data drastically, it is not possible to reconstruct the original data perfectly from the compressed data, and these algorithms are thus called lossy compression or irreversible compression. This method works remarkably well in the field of music and image compression. See Chaps. 6, 7, 10, and 16 in this book for results from this area.

4.3 Succinct Data Structures

When compressed data is used, it essentially needs to be decompressed. However, decompression requires extra computation. It is therefore useful to be able to use compressed data as-is without decompression. Succinct data structures are a framework that realizes this idea. Specifically, succinct data structures use an amount of space that is close to the information-theoretical lower bound while still allowing efficient (fast) query operations. These structures involve a tradeoff between space and time. See Chaps. 8 and 9 for details.

5 Need for the Sublinear Computation Paradigm

5.1 Sublinear and Polynomial Computation Are Both Important

Even though the sublinear computation paradigm has become necessary, it does not mean that the polynomial computation paradigm is obsolete. Polynomial computation is still important in normal computations. The typical cases where the sublinear computations are needed are when we need to treat big data. In such cases, traditional polynomial computation is sometimes too slow.

This relationship between the polynomial computation paradigm and the sublinear computation paradigm is analogous to the relationship between Newtonian mechanics and the theory of relativity in physics. While Newton mechanics is used for normal physical calculations, the theory of relativity is needed if we try to calculate the motion of very fast objects such as rockets, satellites, or electrons. We entered the era of the theory of relativity in the twentieth century and the era of sublinear computation era in the twenty-first century.

5.2 Research Project ABD

A research project named “Foundations on Innovative Algorithms for Big Data (ABD),”Footnote 10 in which the sublinear computation paradigm is the central concept was started by JST, CREST, Japan in October 2014 and concluded in September 2021. The total budget was more than 300 million yen. Although the project had 24 members at its inception, many more researchers later joined and the final number of regular members exceeded 40 in total. The leader of the project was Prof. Naoki Katoh of University of Hyogo.Footnote 11 The project consisted of three groups: the Sublinear Algorithm Group (Team A) led by Prof. Katoh; the Sublinear Data Structure Group (Team D) led by Prof. Tetsuo Shibuya of the University of Tokyo; and the Sublinear Modeling Group (Team M) led by Prof. Kazuyuki Tanaka of Tohoku University. In this project, we worked on problems in big data computation. The main purpose of this book is to introduce the results of this project. A special issue of The Review of Socionetwork Strategies [14] is also available for this project. While some of the methods adopted in this project are not sublinear, we are confident that every piece of research concluded under the project is useful and will form the foundations of innovative algorithms for big data!

5.3 The Organization of This Book

This part of the book, Part I, has provided an introduction. Parts II, III, and IV present the theoretical results of Teams A, D, and M, respectively. Application results leading to scientific and technological innovation are compiled in Part V.