1 Introduction

Computer systems pervade all parts of human activity: transportation systems, energy supply, medicine, the whole financial sector, and modern science have become unthinkable without hardware and software support. As these systems continuously acquire, process, exchange, and store data, we live in a big-data world where information is accumulated at an exponential rate.

The urging problem has shifted from collecting enough data to dealing with its impetuous growth and abundance. In particular, data volumes often grow faster than the transistor budget of computers as predicted by Moore’s law (i.e., doubling every 18 months). On top of this, we cannot any longer rely on transistor budgets to automatically translate into application performance, since the speed improvement of single processing cores has basically stalled and the requirements of algorithms that use the full memory hierarchy get more and more complicated. As a result, algorithms have to be massively parallel using memory access patterns with high locality. Furthermore, an x-times machine performance improvement only translates into x-times larger manageable data volumes if we have algorithms that scale nearly linearly with the input size. All these are challenges that need new algorithmic ideas. Last but not least, to have maximum impact, one should not only strive for theoretical results, but intend to follow the whole algorithm engineering development cycle consisting of theoretical work followed by experimental evaluation.

The “curse” of big data in combination with increasingly complicated hardware has reached all kinds of application areas: genomics research, information retrieval (web search engines, ...), traffic planning, geographical information systems, or communication networks. Unfortunately, most of these communities do not interact in a structured way even though they are often dealing with similar aspects of big-data problems. Frequently, they face poor scale-up behavior from algorithms that have been designed based on models of computation that are no longer realistic for big data.

In 2013, the German Research Foundation (DFG) established the priority programme SPP 1736: Algorithms for Big Data (https://www.big-data-spp.de) where researchers from theoretical computer science work together with application experts in order to tackle some of the problems discussed above. A nationwide call for the individual projects attracted over 40 proposals out of which an international reviewer panel selected 15 funded research projects plus a coordination project (totalling about 20 full PhD student positions) by the end of 2013. Additionally, a few more projects with own funding have been associated in order to benefit from collaboration and joint events (workshops, PhD meetings, summer schools etc.) organised by the SPP.

In the following, we give a short overview on the research topics and groups represented in the programme and highlight a few results obtained within the first funding period (2014–2017). Two project leaders also contributed separate articles within this special issue: H. Bast on a quality evaluation of combined search on a knowledge base and text, and M. Mnich on big data algorithms beyond machine learning.

2 Funded Research Projects

Most of the funded projects concentrate on big-data algorithms that are not machine learning and frequently tackle more than one of the following areas: (1) technological challenges, (2) fundamental algorithmic techniques, and (3) applications. There are various ways the respective research topics could be clustered—here is one attempt:

2.1 Technological Challenges

Several projects are concerned with algorithmically mastering constraints in the way data can be efficiently accessed, compactly maintained, and processed in parallel. Besides mere execution time and solution quality, further metrics such as energy consumption and limited data lifetime come into play:

  1. P1

    Energy-Efficient Scheduling S. Albers (TU München)

    This project explores methods to reduce the total energy consumption using scheduling, based on speed-scalable processors where typically much less energy is consumed if the processors run slower. While jobs come with deadlines (hence slowing down is not always possible), preemption and migration of jobs open up additional optimization opportunities. The scheduling objective is to minimize the total energy consumption while taking into account all constraints. Albers et al. considered non-homogeneous settings where trade-offs between speed and energy can differ among the set of processors used [2]. The authors improve the state of the art by providing several new approaches that are conceptionally easier and hence more practical than previous solutions.

  1. P2

    Dynamic, Approximate, and Online Methods for Big Data U. Meyer (U Frankfurt/M)

    One line of research in the project deals with dynamic, approximate, and online methods in the context of parallelism and memory hierarchies. An important application area is graph algorithms (see [22] and Sect. 3 for a more detailed treatment of joint results on graph generation in parallel external memory). Another line of research in (P2) aims to use methods from Game Theory (truthful mechanisms) in order to reasonably solve memory assignment problems for concurrently running programmes in shared memory environments. This is particularly challenging if the users do not have to pay money for the RAM their executed programmes occupy: in the absence of money, selfish programmers may claim to need unreasonably large chunks of central memory for a “fast” execution of their programmes. First results in a static setting with fixed RAM chunk sizes appeared in [18], where forced waiting times are used as a currency in order to yield a truthful mechanism that returns solutions minimizing the makespan, i.e. the maximum completion time.

  1. P3

    Distributed Data Streams in Dynamic Environments F. Meyer auf der Heide (U Paderborn)

    The research topic of this project is the design and analysis of distributed algorithms that continuously compute functions of streams of data, arising from many devices of potentially different types. Due to huge volumes and velocity, data can neither be completely stored, nor sent to a central server via a network, nor fully processed in real time. Initial results concern, among others, the communication complexity of so-called distributed aggregation problems; Mäcker et al. [20] considered the expected message complexity for the top-k Position Monitoring problem. Here, the task is to compute the IDs of the devices that observe the k largest items at every time step. They also gave an approximation variant [21].

  1. P4

    3D+T Terabyte Image Analysis Footnote 1 R. Mikut and P. Sanders (KIT Karlsruhe)

    The data dealt with in this project stems from light-sheet fluorescence microscopy, which is frequently applied in developmental biology in order to perform long-time observations of embryonic development. An exemplary application domain is tracking of objects (e.g., cell nuclei, cytoplasm, nano particles) in microscopic images where different object classes are labelled with particular fluorescent dyes. In the project, time series of high resolution 3D images of developing zebrafish embryos yield more than 10 terabytes per embryo, which is significantly more than state-of-the-art software tools can typically handle. While striving for improved algorithms on modern hardware, it is also important to carefully test the result quality of these new approaches. To this end, the project has successfully investigated methods to create large, realistic, simulated inputs with ground truth that can be used to quantitatively assess result quality [30].

  1. P5

    Kernelization for Big Data (See footnote 1) M. Mnich (U Bonn)

    The Project is concerned with kernelization in big-data contexts. Given a concrete optimization question q, a kernelization algorithm compresses a data set A to \(A'\) such that q can still be answered from \(A'\). Ideally, the size of \(A'\) is much smaller than that of A and depends only polynomially on some structures capturing particular aspects about the optimization question (and not on the size of A). As an example, Etscheid and Mnich discussed kernelization techniques for the Max-Cut problem [14]; more details are provided in Mnich’s article on big data algorithms beyond machine learning in this special issue.

2.2 Graphs

Another cluster of projects is mainly concerned with various kinds of graph problems which become very challenging once the input data is really big:

  1. P6

    Skeleton-based Clustering in Big and Streaming Social Networks U. Brandes (U Konstanz) and D. Wagner (KIT Karlsruhe)

    The scientific goal of the project is to devise novel methods to cluster large-scale static and dynamic online social networks. Their approach is based on skeleton structures, i.e. sparse (sub-)graphs, that represent the essential structural properties of the graphs. Besides supporting efficient clustering approaches, these skeletons are used to find patterns in online social relationships and interactions. An example concerns components of quasi-threshold graphs Footnote 2 since they share features frequently found in social network communities. Communities are then detected by finding a quasi-threshold graph that is close to a given graph in terms of edge edit distance. The problem is \({{\mathcal {N}}}{{\mathcal {P}}}\)-hard and existing FPT-approaches also fail to scale on real-world data. Hence, the project introduced Quasi-Threshold Mover (QTM), the first scalable quasi-threshold editing heuristic [9]. QTM constructs an initial skeleton forest and then refines it by moving vertices to reduce the number of edits required. (P6) is also active in graph visualization and graph generation (cf. Sect. 3).

  1. P7

    Engineering Algorithms for Partitioning Large Graphs (See footnote 1) P. Sanders, Ch. Schulz, and D. Wagner (KIT Karlsruhe)

    (Hyper-)Graph partitioning is crucial in many big-data graph applications as it subdivides the problem instance into smaller (and thus more manageable) pieces with little interaction. Unfortunately, Hypergraph Partitioning is \({\mathcal {N}}{\mathcal {P}}\)-hard, and it is even \({\mathcal {N}}{\mathcal {P}}\)-hard to obtain good approximations. Therefore, in practice, multi-level heuristics are applied. Project (P7) has significantly contributed to the large body of previous work in the area; see [10] for an overview. Their recent k-way partitioning result [1] represents the state of art concerning high-quality hypergraph partitioning: it always computes better solutions and is faster than some of the competitors.

  1. P8

    Competitive Exploration of Large Networks Y. Disser (TU Darmstadt) and M. Klimm (HU Berlin)

    This project looks into algorithms that operate on very large networks and the dynamics that arise from the competition or the cooperation between such algorithms. An initial result concerned the exploration of an unknown undirected graph with n vertices by an agent possessing very small memory [12]. While upper and lower memory bounds of \(\Theta (\log n)\) had been shown before for this setting, the project reduced the memory requirement of the agent to \(O(\log \log n)\) for bounded-degree graphs in case the agent gets access to another \(O(\log \log n)\) indistinguishable markers, called pebbles. A pebble can be dropped or collected whenever the agent visits a vertex, leaving or removing a mark. (P8) also showed that for sub-linear agent memory, \(\Omega (\log \log n)\) pebbles are required.

  1. P9

    Algorithms for Solving Time-Dependent Routing Problems with Exponential Output Size M. Skutella (TU Berlin)

    Methods for the solution of static routing problems have been successfully optimized over many decades. Unfortunately, real-life applications such as evacuation planning, logistic planning, or navigation systems for road networks crucially depend on dynamic edge costs that change over time (and even depend on the solution). The standard approach to build a huge time-expanded network whose size could be exponential in the input size becomes infeasible for big-data graphs due to memory limitation. Hence, project (P9) investigate alternative methods that try to avoid this data explosion. For example, Schlöter and Skutella presented memory-efficient solutions for evacuation problems [27].

  1. P10

    Local Identification of Central Nodes, Clusters, and Network Motifs in Very Large Complex Networks K. Zweig (TU Kaiserslautern)

    This Project focuses on the development of local methods to compute classic network analytic measures like centrality indices, network motifs (subgraphs) and clustering. Commonly, these measures are based on global properties of the graph such as the distance between all pairs of vertices or a global ranking of similar pairs of vertices or edges, thus, resulting in at least quadratic time complexity. Hence, it is difficult to scale those fundamental approaches directly to big-data graphs. Recent work in this direction concerns the identification of network motifs in the so-called fixed degree sequence model (FDSM) which refers to the set of all graphs with the same degree sequence excluding multi-edges or self-loops. Schlauch and Zweig proposed a set of equations, based on the degree sequence and a simple independence assumption, to estimate the occurrence of a set of subgraphs in the FDSM and empirically supported their findings [26]. Other parts of the research in this project have also been included in a newly published textbook [33].

2.3 Optimization

Two projects concentrate on generic optimization methods that can be applied in many different scenarios:

  1. P11

    Scaling Up Generic Optimization J. Giesen and S. Laue (U Jena)

    Dealing with large-scale convex optimization problems, the project developed a generic optimization code generator (GENO) which is capable of providing generic, parallel and distributed convex optimization software. Discrete and combinatorial big-data optimization problems can greatly benefit from GENO as well as machine learning, data analytics and other fields of research such as network analysis. The GENO approach to generic optimization is based on an extension of the alternating direction method of multipliers by Giesen and Laue [16] and is defined by a tight coupling of a modeling language and a generic solver. The modeling language allows to specify a class of (convex) optimization problems, and the generic solver gets instantiated for the specified problem class. Comparing the code produced by GENO with state-of-the-art, hand-tuned, problem-specific implementations show that GENO is faster and delivers better results (in terms of accuracy or objective function value for non-convex problems).

  1. P12

    Fast Inexact Combinatorial and Algebraic Solvers for Massive Networks H. Meyerhenke (U Köln)

    This project focuses on network analysis with three combinatorial optimization tasks with numerous applications: graph clustering, graph drawing, and network flow. Some of those applications are in the biological sciences, where most data sets are massive and contain inaccuracies. Hence, an inexact, yet faster solution process with approximation algorithms and heuristics is often useful. As an example, Bergamini and Meyerhenke [5] proposed the first betweenness centrality approximation algorithms with a provable bound on the approximation error for fully dynamic networks. Another important topic dealt with in (P12) concerns algebraic solvers. In 2016, Bergamini et al. [6] developed two algorithms that accelerate the current-flow computation for one vertex or a reasonably small subset of vertices significantly. The work also provides a reimplementation of the lean algebraic multigrid solver by Livne and Brandt [19] and is integrated into the open-source network analysis software NetworKit [28], which is freely available to the public.

2.4 Security

Further projects investigate practical cryptographic schemes that do not degrade in big-data contexts, for example when the number of users and ciphertexts grows tremendously:

  1. P13

    Security-Preserving Operations on Big Data M. Fischlin (TU Darmstadt) and A. May (U Bochum)

    Protecting outsourced data in cloud storage and cloud computing scenarios and when handling big data through third parties is rather complicated, since standard cryptographic means, such as encryption, in general do not work here. This is caused by the very nature of encryption: scrambling all reasonable information, the semantics of the data are hidden and cannot be used by third parties to perform operations, and the option of decrypting the data for the operations would violate the idea of protecting the data from the service provider. Thus, project (P13) works on efficient operations on secured data, targeted as well as through the deployment of functional encryption and indistinguishable obfuscation, certification of cryptographic primitives, and new algorithmic techniques for big cryptographic data. In 2017, Esser et al. [13] proposed new algorithms with small memory consumption for the Learning Parity with Noise (LPN) problem, both classically and quantumly. By using different advanced techniques they obtained a hybrid algorithm that achieves the best currently known run time for any fixed amount of memory.

  1. P14

    Scalable Cryptography D. Hofheinz (KIT Karlsruhe) and E. Kiltz (U Bochum)

    As mentioned before, in our modern digital society, we rely on encryption and signature schemes for security. However, today’s cryptographic schemes do not scale well, and thus are not suited for the increasingly large sets of data they are used on. For instance, the security guarantees currently known for RSA encryption, which is an important type of encryption scheme, degrade linearly in the number of users and cipher texts. Therefore, project (P14) aims to construct cryptographic schemes that scale well to large scenarios. Until now, several practical cryptographic schemes suitable for truly large settings have been developed by the project, such as the first authenticated key exchange protocol whose security does not degrade with an increasing number of users or sessions [3], the first identity-based encryption scheme whose security properties do not degrade in the number of ciphertexts, and the first public-key encryption scheme for large scenarios that does not require a mathematical pairing [15] (awarded the “Best Paper” at the EUROCRYPT 2016 conference). The last-mentioned scheme is solely based upon a very standard computational assumption, namely the Decisional Diffie-Hellman assumption, and is thereby efficient.

2.5 Text Applications

In spite of huge improvements over the last decade, efficiently mining big text data remains an important topic:

  1. P15

    Efficient Semantic Search on Big Data H. Bast (U Freiburg)

    Within the predecessor priority programme on algorithm engineering (2007–2013), H. Bast and her group have developed semantic full-text search, a deep integration of full-text and ontology search. Their search engine Broccoli [4] is able to handle queries like “Astronauts who walked on the moon and who were born in 1925–1930” or “German researchers who work on algorithms” where parts of the required information is contained in ontologies, whereas other parts only occur in text documents. In (P15), they aim to scale semantic search to text sets and ontologies being about 100 times larger than in their original Broccoli engine while increasing query quality at the same time. More details can be found in their article on a quality evaluation of combined search on a knowledge base and text included in this special issue.

  1. P16

    Massive Text Indices J. Fischer (TU Dortmund) and P. Sanders (KIT Karlsruhe)

    The world wide web, digital libraries, biological sequences like DNA, or proteins all constitute large textual data that need to be stored, structured, searched, and compressed efficiently. The amount of such data has grown much faster than the storage and computation capacities of common desktop computers, by several orders of magnitude. Data structures for texts satisfying those needs, for instance suffix arrays or inverted indexes, are called text indexes and are the basic building block of all text-based applications, including well-known services like internet search engines. Since algorithms and data structures for texts are fundamentally different from those for other kind of data, project (P16) aims to develop an own algorithmic toolbox for large texts using both shared memory and distributed memory parallelism, focusing on general-purpose text indexes related to suffix arrays due to their applicability to any text type and their extended functionality. One step towards that goal was basic research on building blocks, yielding results, such as an extensive journal paper [8] that studies practical parallel string sorting algorithms based on the most important classical sorting algorithms. Another important step was to build a prototype of a tool for implementing algorithms that process large data sets on distributed memory machines. The result, Thrill [7], is based on C++, offers a rich set of operations on distributed arrays such as map, reduce, sort, merge, and prefix-sum. It can fuse pipelines of local operations into tight loops optimized at compile time, considerably outperforming established tools such as Spark or Flink.

2.6 Bio Applications

Similarly, new methods in bioinformatics are required as reduced costs for obtaining raw data is resulting in a data flood that becomes increasingly hard to process:

  1. P17

    Graph-Based Methods for Rational Drug Design O. Koch and P. Mutzel (TU Dortmund)

    The development of a new drug is a complex and costly process. The identification and optimization of bioactive molecules is to a large extent supported by computer-based methods, e.g. the semi-automated classification of molecules and their functional relationships. This is a big-data problem, since the theoretical chemical space is estimated to contain around \(10^{62}\) molecules. Many approaches within rational drug design are based on the basic hypothesis that structural similar molecules also show a similar biological effect. Common similarity measures use fast but inexact chemical fingerprints, yielding a high number of false positives. By proposing the Maximum Similar Subgraph (MSS) paradigm, an extension of the \({\mathcal {N}}{\mathcal {P}}\)-complete Maximum Common Subgraph problem with allowed deviations with respect of similar bioactivity, the project (P17) introduced an exact comparison method based on searching and clustering graph representations of molecules, where atoms are the vertices and bonds between the atoms are represented by edges. In 2017, Schäfer and Mutzel presented StruClus [25], a structural clustering algorithm for large-scale datasets of small labeled graphs based on the MSS paradigm. This algorithm achieves high quality and (human) interpretable clusterings, has a runtime linear in the number of graphs and outperforms competing clustering algorithms. The project also continuously develops Scaffold Hunter [24], a flexible visual analytics framework for the analysis of chemical compound data. One application of this tool is to identify whole scaffolds that are exchangeable by similar shape. This leads to a reduced graph which allows for a more efficient MSS computation. Scaffold Hunter was initiated as a collaboration with the group of H. Waldmann (Max Planck Institute of Molecular Physiology, Dortmund).

  1. P18

    Algorithmic Foundations for Genome Assembly A. Srivastav (U Kiel), Th. Reusch (GEOMAR Kiel), and Ph. Rosenstiel (Uniklinikum Schleswig-Holstein)

    This is a joint project between A. Srivastav from the Department of Computer Science of Kiel University, T. Reusch from GEOMAR Helmholtz Centre for Ocean Research Kiel and Ph. Rosenstiel from the Institute of Clinical and Molecular Biology University Medical Center Schleswig-Holstein, dealing with the genome assembly problem: given a high number of sequences of an unknown genome, called reads, which may contain errors, and perhaps some extra information, the task is to reconstruct the genome. The objectives of (P18) are the development of a comprehensive mathematical model for genome assembly as an optimization problem, the engineering and theoretical analysis of distributed and streaming assemblers and distributed probabilistic data structures to hold intermediate information, the engineering of an assembler based on the maximum-likelihood method, and applications to marine species investigated in the group of Th. Reusch and to the variational calling problems in the group of Ph. Rosenstiel. In 2017, Wedemeyer et al. [32] presented their read filtering algorithm Bignorm. They show how probabilistic data structures and biological parameters can be used to drastically reduce the amount of data prior to the assembly process and demonstrate its significance by the assembly of genomes of single-celled species.

2.7 Coordination Project

In addition to the research projects mentioned above, there is also a coordination project headed by U. Meyer. This project provides financial and organizational support for yearly colloquia of the whole priority programme, summer schools and smaller dedicated workshops and trainings, a guest programme, and gender equality measures. It also maintains the webpage of the priority programme https://www.big-data-spp.de/.

3 Scientific Output and SPP collaborations

During the first funding period, the SPP did not only publish more than 150 peer-reviewed papers, but also developed, extended and maintained a number of software libraries, e.g.: Broccoli [4] for semantic search, GENO Footnote 3 for generic optimization code generation, NetworKit [28] for network analysis, STXXL [11] for external-memory computing, and Thrill [7] for distributed batch data processing. The priority programme also creates visibility by its national and international events (e.g., Summer/Winter schools in Chennai 2016 and Tel Aviv 2017).

A particular feature of a priority programme is the intended collaboration between its participating researchers. The efficient generation of huge artificial input graphs for benchmarking turned out to become a highly active field of joint research: already more than ten papers in this area have been published by SPP members (see [22] for a recent overview), out of which three are co-authored between different SPP projects: (P11) and (P12) consider faster generation of random hyperbolic graphs [31], (P6) and (P12) propose how to generate scaled replicas of real-world complex networks [29], and (P6) and (P2) give improved generation algorithms for random graphs according to FDSM.

Examples for other joint publications include sparsification methods for social networks [17] by (P6) and (P12), and improved parallel graph partitioning for complex networks [23] by (P7) and (P12).

The second funding period for the Big Data priority programme has just started and most of the projects reviewed above also belong to the consortium of the second phase. Hence, we expect a number of further scientific results due to these established cooperations.