Distributed arrays: an algebra for generic distributed query processing

We propose a simple model for distributed query processing based on the concept of a distributed array. Such an array has fields of some data type whose values can be stored on different machines. It offers operations to manipulate all fields in parallel within the distributed algebra. The arrays considered are one-dimensional and just serve to model a partitioned and distributed data set. Distributed arrays rest on a given set of data types and operations called the basic algebra implemented by some piece of software called the basic engine. It provides a complete environment for query processing on a single machine. We assume this environment is extensible by types and operations. Operations on distributed arrays are implemented by one basic engine called the master which controls a set of basic engines called the workers. It maps operations on distributed arrays to the respective operations on their fields executed by workers. The distributed algebra is completely generic: any type or operation added in the extensible basic engine will be immediately available for distributed query processing. To demonstrate the use of the distributed algebra as a language for distributed query processing, we describe a fairly complex algorithm for distributed density-based similarity clustering. The algorithm is a novel contribution by itself. Its complete implementation is shown in terms of the distributed algebra and the basic algebra. As a basic engine the Secondo system is used, a rich environment for extensible query processing, providing useful tools such as main memory M-trees, graphs, or a DBScan implementation.


Introduction
Big data management has been a core topic in research and development for the last fifteen years. Its popularity was probably started by the introduction of the MapReduce paradigm [10] which allowed a simple formulation of data processing tasks by a programmer which are then executed in a highly scalable and fault tolerant way on a large set of machines. Massive data sets arise through the global scale of the internet with applications and global businesses such as Google, Amazon, Facebook. Other factors are the ubiquity of personal devices collecting and creating all kinds of data, but also the ever growing detail of scientific experiments and data collection, for example, in physics or astronomy, or the study of the human brain or the genome.
Dealing with massive data sets requires to match the size of the problem with a scalable amount of resources; therefore distributed and parallel processing is essential. Following MapReduce and its open source version Hadoop, many frameworks have been developed, for example, Hadoop-based approaches such as HadoopDB, Hive, Pig; Apache Spark and Flink; graph processing frameworks such as Pregel or GraphX.
All of these systems provide some model of the data that can be manipulated and a language for describing distributed processing. For example, MapReduce/Hadoop processes key-value pairs; Apache Spark offers resilient distributed data sets in main memory; Pregel manipulates nodes and edges of a graph in a node-centric view. Processing is described in terms of map and reduce functions in Hadoop; in an SQL-like style in Hive; by a set of operations on tables in Pig; by a set of operations embedded in a programming language environment in Spark; or by functions processing messages between nodes in Pregel.
In this paper, we consider the problem of transforming an extensible query processing system on a single machine (called the basic engine) into a scalable parallel query processing system on a cluster of computers. All the capabilities of the basic engine should automatically be available for parallel and distributed query processing, including extensions to the local system added in the future.
We assume the basic engine implements an algebra for query processing called the basic algebra. The basic algebra offers some data types and operations. The basic engine allows one to create and delete databases and within databases to create and delete objects of data types of the basic algebra. It allows one to evaluate terms (expressions, queries) of the basic algebra over database objects and constants and to return the resulting values to the user or store them in a database object.
The idea to turn this into a scalable distributed system is to introduce an additional algebra for distributed query processing into the basic engine, the distributed algebra.
The distributed system will then consist of one basic engine called the master controlling many basic engines called the workers. The master will execute commands provided by a user or application. These commands will use data types and operations of the distributed algebra. The types will represent data distributed over workers and the operations be implemented by commands and queries sent to the workers. The fundamental conceptual model and data structure to represent distributed data is a distributed array. A distributed array has fields of some data type of the basic algebra; these fields are stored on different computers and assigned to workers on these computers. Queries are described as mappings from distributed arrays to distributed arrays. The mapping of fields is described by terms of the basic algebra that can be executed by the basic engines of the workers. Further, the distributed algebra allows one to distribute data from the master to the workers, creating distributed arrays, as well as collect distributed array fields from the workers to the master.
These ideas have been implemented in the extensible DBMS Secondo which takes the role of the basic engine. The Secondo kernel is structured into algebra modules each providing some data types and operations; all algebras together form the basic algebra. Secondo provides query processing over the implemented algebras as described above for the basic engine. Currently there is a large set of algebras providing basic data types (e.g., integer, string, bool, ...), relations and tuples, spatial data types, spatio-temporal types, various index structures including B-trees, R-trees, M-trees; data structures in main memory for relations, indexes, graphs; and many others. The distributed algebra described in this paper has been implemented in Secondo.
In the main part of this paper we design the data types and operations of the distributed algebra and formally define their semantics.
To illustrate distributed query processing based on this model, we describe an algorithm for distributed density-based similarity clustering. That is, we show the "source code" to implement the algorithm in terms of the distributed algebra and the basic algebra.
The contributions of the paper are as follows: -A generic algebra for distributed query processing is presented.
-Data types and operations of the algebra are designed and their semantics are formally defined. -The implementation of the distributed algebra is explained. -A novel algorithm for distributed density-based similarity clustering is presented and its complete implementation in terms of the distributed algebra is shown. -An experimental evaluation of the framework shows excellent load balancing and good speedup.
The rest of the paper is structured as follows. Related work is described in Sect. 2. In Sect. 3, Secondo as a generic extensible DBMS is introduced, providing a basic engine and algebra. In Sect. 4, the distributed algebra is defined. Sect. 5 describes the implementation of this algebra in Secondo. In Sect. 6 we show the algorithm for distributed clustering and its implementation. A brief experimental evaluation of the framework is given in Sect. 7. Finally, Sect. 8 concludes the paper.

Related work
Our algebra for generic distributed query processing in Secondo has related work in the areas of distributed systems, distributed databases, and data analytics. In the application section of this paper, we present an algorithm for the density-based similarity clustering (see Sect. 6). The most related work in these areas is discussed in this section.

Distributed system coordination
Developing a distributed software system is a complex task. Distributed algorithms have to be coordinated on several nodes of a cluster. Apache ZooKeeper [29], HashiCorp Consul [8] and etcd [16] are software components used to coordinate distributed systems. These systems cover topics such as service discovery and configuration management. Even these components are used in many software projects; some distributed computing engines have also implemented their own specialized resource management components (such as YARN-Yet Another Resource Negotiator [54], which is part of Hadoop).
In our distributed array implementation, we send the information to coordinate the system directly from the master node to the worker nodes. The worker nodes are manually managed in the current version of our implementation. Topics such as high availability or replication will be part of a further version.

General remarks
In this section we discuss systems for distributed analytical data processing.
A major distinction between those systems and the distributed algebra of this paper is genericity. The systems to be discussed all have some data model that is manipulated by operations, for example, tables or key-value pairs. In contrast, distributed algebra does not have any fixed data model. What is predetermined is the model of a distributed array which is just a simple abstraction of a partitioned (and distributed) data set. Furthermore, it is fixed that sets of tuples are used for data exchange. The field types of distributed arrays are absolutely generic.
This makes it possible to plug in a basic engine providing types and operations, if you want, a "legacy system", with all its data structures and capabilities. The clear separation between the algebra describing distributed query processing and the basic algebra (or engine) defining local data and query processing by workers is unique for our approach.
Our implementation of the distributed algebra so far uses Secondo as a basic engine; we are currently working on embedding other systems, PostgreSQL in particular.
Many systems provide some level of extensibility such as user defined data types and functions. However, embedding a complete basic engine is a quite different matter: we simply inherit everything there is, without doing extra work for every part, as in the case of extensibility. The basic engine Secondo is a rich environment developed over many years. Beyond standard relational query processing it has specialized "algebra modules" for spatial and spatio-temporal data, index structures such as R-trees, TBtrees, M-trees, symbolic trajectories, image and raster data, map matching algorithms, DBScan, and so forth. There are persistent as well as main memory data structures, allowing distributed in-memory processing.
Moreover, the scope of extensibility within the basic engine Secondo is much higher than in other systems. Whereas in most systems user defined types can be added at the level of attribute types in tables, the architecture of Secondo is designed around extensibility. A DBMS data model is implemented completely in terms of algebra modules. Hence one can add not only atomic data types but also any kind of representation structure such as an index, a graph, or a column-oriented relation representation, for example.
In the following, we refer to Distributed Algebra as DA and to Distributed Algebra with Secondo as a basic engine as DA/Secondo, respectively.

MapReduce and distributed file systems
In 2004, the publication of the MapReduce paper [10] proposed a new technique for the distributed handling of computational tasks. Using MapReduce, calculations are performed in two phases: (1) a map phase and (2) a reduce phase. These tasks are executed on a cluster of nodes in a distributed and fault-tolerant manner. Map and reduce steps are formulated directly in a programming language.
The Google File System (GFS) [21] and its open-source counterpart Hadoop File System (HDFS) [45] are distributed file systems. These file systems represent the backbone of the MapReduce frameworks; they are used for the input and output of large datasets. Stored files are split up into fixed-sized chunks and distributed across a cluster of nodes. To deal with failing nodes, the chunks can be replicated. Due to the architecture of the file systems, data are stored in an append-only manner.
To exploit node level data locality, the MapReduce Master Node tries to schedule jobs in a way that the chunks of the input data are stored on the node that processes the data [21, p. 5]. If the chunks are stored on another node, the data need to be transferred over a network, which is slow and time-consuming. HDFS addresses data locality only on chunks and not based on the value of the stored data, which can lead to performance problems [14].
In our distributed array implementation, data are directly assigned to the nodes in a way that data locality is exploited, and the amount of transferred data is minimized. The output of a query can be directly partitioned and transferred to the worker nodes that are responsible for the next processing step (see the discussion of the dfmatrix data type in Sect. 4). In addition, Secondo uses a type system. Before a query is executed, the query is checked for type errors. Therefore, in our distributed array implementation, the data are stored typed. On a distributed file system, data are stored as raw bytes. Type and structure information need to be implemented in the application that reads and writes the data.

Distributed databases and frameworks for analytic processing
HBase [27] (the open-source counterpart of BigTable [6]) is a distributed columnoriented database built on top of HDFS. HBase is optimized to handle large tables of data. Tables consist of rows of multiple column families (a set of key-value pairs). Internally, the data are stored in String Sorted Tables (SSTables) [6,41], which are handled by HDFS. HBase provides simple operations to access the data (e.g., get and scan) and does not support complex operations such as joins; MapReduce jobs can be used to process the data. The DA can of course operate on data that is stored across a cluster of systems in a distributed manner. In addition, DA/Secondo offers a wide range of operators (such as joins), which can be used to process the data.
Key-Value Stores such as Amazon Dynamo [11] or RocksDB [49] provide a simple data model consisting of a key and a value. Systems such as HBase, BigTable, or Apache Cassandra [32] provide a slightly more complex data model. A key can have multiple values; internally, the data are stored as key-value pairs on disk. The values in these implementations are limited to scalar data types such as int, byte, or string.
Apache Hive [50] is a warehousing solution that is built on top of Apache Hadoop, which provides an SQL-like query language to process in HDFS stored data. Hive contains only a limited set of operations.
Apache Pig [18] provides a query language (called Pig Latin [40]) for processing large amounts of data. With Pig Latin, users no longer need to write their own MapReduce programs; they write queries which are directly translated into MapReduce jobs. Pig Latin focuses primarily on analytical workloads.
Pig Latin provides an interesting data model built from atomic types and tuples, bags and maps. Tuples may have flexible schemas and may be nested. A program is expressed as a sequence of assignments to variables, applying one operation in each step. Operations such as FILTER, FOREACH ... GENERATE, or COGROUP, JOIN, ORDER, can be applied to distributed data sets.
Comparing to DA, we find a fixed, not a generic data model. Operations on distributed data sets are tied to this model and perform implicit redistribution. In DA, we can nest operations (i.e., write sequences of operations) and we have a strict separation between distributed and local computation. Comparing to DA/Secondo, of course, the set of available data structures and operations is much more limited. For example, we do not have any indexes or index-based join algorithms.
Pigeon [13] is a spatial extension to Pig which supports spatial data types and operations. Pig was not extensible by atomic data types; any other type than number or string needed to be represented as bytearray. Hence the Pigeon extension represents spatial data types as Well-Known Text or Well-Known Binary exchange formats within Pig. Spatial functions need to convert from and to this format when working on the data.
Remarkable for a spatial extension is that there are no facilities for spatial indexing or spatial join. In the examples in [13], spatial join is expressed as cross product and filtering (CROSS and FILTER operators of PigLatin), a very inefficient evaluation. This is simply due to the fact that extensions by index structures or spatial join operators are not possible in the Pig framework.
In contrast, spatial data types, spatial indexing and spatial join are supported in DA/Secondo as demonstrated later in this paper.
The publication of the MapReduce paper created the foundation for many new applications and ideas to process distributed data. A widely used framework to process data in a scalable and distributed manner is Apache Spark [57]. In Spark, data are stored in resilient distributed datasets (RDDs) [56] in a distributed way across a cluster of nodes. In contrast to earlier work, RDDs can reside in memory in a fault-tolerant way. Hence Spark supports in particular distributed in-memory processing.
Spark defines an interesting set of operations on RDDs that may be compared to those of DA. For example, there is an operation (called transformation) map( f : T => U ), parameterized by a function mapping values of type T into those of type U . Applied to an RDD[T ], i.e., an RDD with partitions of type U , it returns an RDD [U ].
One can see that RDDs are generic and parameterized by types, hence they are fairly similar to the distributed arrays of this paper. Also the map transformation corresponds to our dmap operator, introduced later.
Differences are that RDDs and operations on them are not formalized, especially the way fields of RDDs are mapped to workers and are remapped by operations is determined only by the implementation. In DA, field indices do play a role, are controlled by a programmer and are part of the formalization. It is unclear whether in Spark the number of fields of an RDD can be chosen independently from the number of workers, as in DA. This is relevant for load balancing as shown later in the paper.
Another difference is that only some of the operations are generic; others assume types for key-value pairs or sequences. For example, group By K ey, join or cogroup transformations assume key-value pairs. In contrast, DA has only generic operations and all the transformation operations of RDDs can be expressed in DA/Secondo by combining DA operations with basic engine operations. How data are repartitioned is precisely defined in the DA.
Dryad [30] is a distributed execution engine that is developed at Microsoft. DryadLINQ [55] provides an interface for Dryad which can be consumed by Microsoft programming languages such as C#. In contrast to Secondo and our algebra implementation, the goal of Dryad is to provide a distributed environment for the parallel execution of user-provided programs such as Hadoop. The goal of our implementation is to provide an extensible environment with a broad range of predefined operators that can be used to progress data and which can also be enhanced with new operators by the user.
Another popular framework to process large amounts of data these days is Apache Flink [5]. This software system, originating from the Stratosphere Platform [1], is designed to handle batch and stream processing jobs. Processing batches (historical data or static data sets) is treated as a special form of stream processing. Data batches are processed in a time-agnostic fashion and handled as a bounded data stream. Like our system, Flink performs type checking and can be extended by user-defined operators and data types. However, Secondo ships with a larger amount of operators and data types. For example, it can handle spatial and spatio-temporal data out of the box.
Parallel Secondo [33] and Distributed Secondo [38] are two already existing approaches to execute queries in Secondo [24] in a distributed and parallel manner. Both approaches are integrating an existing software component into Secondo to achieve the distributed query execution. Parallel Secondo uses Apache Hadoop (the open source counterpart of the MapReduce framework) to distribute tasks over several Secondo installations on a cluster of nodes. Distributed Secondo uses Apache Cassandra as a distributed key-value store for the distributed storage of data, service discovery, and job scheduling. Both implementations use an additional component (Hadoop or Cassandra) to parallelize Secondo. The algebra for distributed arrays works without further components and provides the parallelization directly in Secondo.

Array databases and data frames
Array databases such as Rasdaman (raster data manager) [3], SciDB [47], or SciQL [58] focus on the processing of data cubes (multi-dimensional arrays). In addition to specialized databases, there are raster data extensions for relational database management systems such as PostGIS Raster [44] or Oracle GeoRaster [42]. Array databases are used to process data like maps (two dimensional) or satellite image time series (three dimensional).
Our distributed array implementation works with one-dimensional arrays. The array is just used to structure the data for the workers, representing a partitioned distributed data set. Array databases use the dimensions of the array to represent the location of the data in the n-dimensional space, which is a different concept. Secondo works with index structures (such as the R-Tree [26]) for efficient data access. In addition, in array databases, the values of the array cells are restricted to primitive or common SQL types like integers or strings. In our model and implementation, the data types of the fields can be any type provided by the basic engine, hence an arbitrary type available in Secondo.
Libraries for processing array structured data (also called data frames), such as Pandas [36] or NumPy [39], are widely used in scientific computing these days. Such libraries are used to apply operations such as filters, calculations, or mutations on array structured data. SciHadoop [4] is using Hadoop to process data arrays in a distributed and parallel way. SciHive [19] is a system that uses Hive to process array structured data. AFrame [46] is another implementation of a data frame library which is built on top of Apache AsterixDB [2]. The goal of the implementation is to process the data frames in a distributed manner and hide the complexity of the distributed system from the user. These libraries and systems are intended for direct integration into the source code. These libraries simplify the handling of arrays, bring along data types and functions, and some also allow the distributed and parallel processing of arrays. Our system instead works with a query language to describe operator trees. Further, Secondo is an extensible database system that can be extended with new operators and data types by a user.
In [17] a Query Processing Framework for Large-Scale Scientific Data Analysis is proposed. Using the described framework, large amounts of data can be processed by using an SQL-like query language. This framework enhances the Apache MRQL [37] language in such a way that array data can be efficiently processed. MRQL uses components, such as Hadoop, Flink or Spark, for the execution of the queries. In contrast to our distributed array implementation, the paper focuses on the implementation of matrix operations to speed up algorithms to process the data arrays.

Clustering
In Sect. 6, we present an algorithm for distributed density-based similarity clustering. The main purpose of the section in the context of this paper is to serve as an illustration of distributed algebra as a language for formulating and implementing distributed algorithms. Nevertheless, the algorithm is a novel contribution by itself.
Density-based clustering, a problem introduced in [15], is a well established technology that has numerous applications in data mining and many other fields. The basic idea is to group together objects that have enough similar objects in their neighborhood. For an efficient implementation, a method is needed to retrieve objects close to a given object. The DBScan algorithm [15] was originally formulated for Euclidean spaces and supported by an R-tree index. But it can also be used with any metric distance function (see for example [31]) and then be supported by an M-tree [7].
Here we only discuss algorithms for distributed density-based clustering. There are two main classes of approaches. The first can be characterized as (non-recursive) divide-and-conquer, consisting of the three steps: 1. Partition the data set. It is obvious that a spatial or similarity (distance-based) partitioning is needed for the problem at hand. Algorithms falling in this category are [9,28,43,53]. They differ in the partitioning strategy, the way neighbors from adjacent partitions are retrieved, and how local clusters are merged into global clusters. In [53] a global R-tree is introduced that can retrieve nodes across partition (computer) boundaries. The other algorithms [9,28,43] include in the partitioning overlap areas at the boundaries so that neighbors from adjacent partitions can be retrieved locally. [28] improves on [53] by determining cluster merge candidate pairs in a distributed manner rather than on the master. [9] strives to improve partitioning by placing partition boundaries in sparse areas of the data set. [43] introduces a very efficient merge technique based on a union-find structure.
These algorithms are all restricted to handle objects in vector spaces. Except for [53] they all have a problem with higher-dimensional vector spaces because in d dimensions 2 d boundary areas need to be considered.
A second approach is developed in [34]. This is based on the idea of creating a k-nearest-neighbor graph by a randomized algorithm [12]. This is modified to create edges between nodes if their distance is less than E ps, the distance parameter of density-based clustering. On the resulting graph, finding clusters corresponds to computing connecting components.
This algorithm is formulated for a node-centric distributed framework for graph algorithms as given by Pregel [35] or GraphX [52]. In contrast to all algorithms of the first class, it can handle arbitrary symmetric distance (similarity) functions. However, the randomized construction of the kNN graph does not yield an exact result; therefore the result of clustering is also an approximation.
The algorithm of this paper, called SDC (Secondo Distributed Clustering), follows the first strategy but implements all steps in a purely distance-based manner. That is, we introduce a novel technique for balanced distance-based partitioning that does not rely on Euclidean space. The computation of overlap with adjacent partitions is based on a new distance-based criterion (Theorem 1). All search operations in partitioning or local DBScan use M-trees.
Another novel aspect is that merging clusters globally is viewed and efficiently implemented as computing connected components on a graph of merge tasks. Repeated binary merging of components is avoided.
Compared to algorithms of the first class, SDC is the only algorithm working with arbitrary metric similarity functions. Compared to [34] it provides an exact instead of an approximate solution.

A basic engine: Secondo
As described in the introduction, the concept of the Distributed Algebra rests on the availability of a basic engine, providing data types and operations for query processing. In principle, any local 1 database system should be suitable. If it is extensible, the distributed system will profit from its extensibility.
The basic engine can be used in two ways: (i) it can provide query processing, and (ii) it can serve as an environment for implementing the Distributed Algebra. In our implementation, Secondo is used for both purposes.

Requirements for basic engines
The capabilities required from a basic engine to provide query processing are the following: 1. Create and delete, open and close a database (where a database is a set of objects given by name, type, and value); 2. create an object in a database as the result of a query and delete an object; 3. offer a data type for relations and queries over it; 4. write a relation resulting from a query 2 efficiently into a binary file or distribute it into several files; 5. read a relation efficiently from one or several binary files into query processing.
The capabilities (1) through (3) are obviously fulfilled by any relational DBMS. Capabilities (4) and (5) are required for data exchange and might require slight extensions, depending on the given local DBMS. In Sect. 4 we show how these capabilities are motivated by operations of the Distributed Algebra.

Secondo
In this section we provide a brief introduction to Secondo as a basic engine. It also shows an environment that permits a relatively easy implementation of the Distributed Algebra.
Secondo is a DBMS prototype developed at University of Hagen, Germany, with a focus on extensible architecture and support of spatial and spatio-temporal (moving object) data. The architecture is shown in Fig. 1. There are three major components: the graphical user interface, the optimizer and the kernel, written in Java, Prolog, and C++, respectively. The kernel uses Berke-leyDB as a storage manager and is extensible by so-called algebra modules. Each algebra module provides some types (type constructors in general, i.e., parameterized types) and operations. The query processor evaluates expressions over the types of the available algebras. Note that the kernel does not have a fixed data model. Moreover, everything including relations, tuples, and index structures is implemented within algebra modules.
The data model of the kernel and its interface between system frame and algebra modules is based on the idea of second-order signature [22]. Here a first signature provides a type system, a second signature is defined over the types of the first signature. This is explained in more detail in Sect. 4.3.
To implement a type constructor, one needs to provide a (usually persistent) data structure and import and export functions for values of the type. To implement an operator, one needs to implement a type mapping function and a value mapping function, as the objects manipulated by operators are (type, value) pairs.
A database is a pair (T , O) where T is a set of named types and O is a set of named objects. There are seven basic commands to manipulate such a generic database: type <identifier> = <type expression> delete type <identifier> create <identifier>: <type expression> update <identifier>:= <value expression> let <identifier> = <value expression> delete <identifier> query <value expression> Here a type expression is a term of the first signature built over the type constructors of available algebras. A value expression is a term of the second signature built by applying operations of the available algebras to constants and database objects.
The most important commands are let and query. let creates a new database object whose type and value result from evaluating a value expression. query evaluates an expression and returns a result to the user. Note that operations may have side effects such as updating a relation or writing a file. Some example commands are: let x = 5; query x; delete x; let inc = fun(x: int) x + 1; query inc; query inc(7); query 3 * 5; query Cities feed filter[.Name = "New York"] consume; query Cities_Name_btree Cities exactmatch["New York"] consume; The first examples illustrate the basic mechanisms and that query just evaluates an arbitrary expression. The last two examples show that expressions can in particular be query plans as they might be created by a query optimizer. In fact, the Secondo optimizer creates such plans. Generally, query plans use pipelining or streaming to pass tuples between operators; here the feed operator creates a stream of tuples from a relation; the consume operator creates a relation from a stream of tuples. The exactmatch operator takes a B-tree and a relation and returns the tuples fulfilling the exact-match query by the third argument. Operators applied to types representing collections of data are usually written in postfix notation. Operator syntax is decided by the implementor. Note that the query processing operators used in the examples and in the main algorithm of this paper can be looked up in the Appendix.
Obviously Secondo fulfills the requirements (1) through (3) stated for basic engines. It has been extended by operators for writing streams of tuples into (many) files and for reading a stream from files to fulfill (4) and (5).

The distributed algebra
The Distributed Algebra (technically in Secondo the Distributed2Algebra) provides operations that allow one Secondo system to control a set of Secondo servers running on the same or remote computers. It acts as a client to these servers. One can start and stop the servers, provided Secondo monitor processes are already running on the involved computers. One can send commands and queries in parallel and receive results from the servers.
The Secondo system controlling the servers is called the master and the servers are called the workers.
This algebra actually provides two levels for interaction with the servers. The lower level provides operations -to start, check and stop servers -to send sets of commands in parallel and see the responses from all servers -to execute queries on all servers -to distribute objects and files Normally a user does not need to use operations of the lower level. The upper level is implemented using operations of the lower level. It essentially provides an abstraction called distributed array. A distributed array has slots of some type X which are distributed over a given set of workers. Slots may be of any Secondo type, including relations and indexes, for example. Each worker may store one or more slots.
Query processing is formulated by applying Secondo queries in parallel to all slots of distributed arrays which results in new distributed arrays. To be precise, all workers work in parallel, but each worker processes its assigned slots sequentially.
Data can be distributed in various ways from the master into a distributed array. They can also be collected from a distributed array to be available on the master. In the following, we describe the upper level of the Distributed Algebra in terms of its data types and operations. We first provide an informal overview. In Sect. 4.3 the semantics of types and operations is defined formally and the use of operations is illustrated by examples.

Types
The algebra provides two types of distributed arrays called darray(X )distributed array -and dfarray(Y )distributed file array.
There exist also variants of these types called pdarray and pdfarray, respectively, where only some of the fields are defined ( p for partial).
Here X may be any Secondo type 3 and the respective values are stored in databases on the workers. In contrast, Y must be a relation type and the values are stored in binary files on the respective workers. In query processing, such binary files are transferred between workers, or between master and workers. Hence the main use of darray is for the persistent distributed database; the main use of dfarray and dfmatrix (explained below) is for intermediate results and shuffling of data between workers. Figure 2 illustrates both types of distributed arrays. Often slots are assigned in a cyclic manner to servers as shown, but there exist operations creating a different assignment. The implementation of a darray or dfarray stores explicitly how slots are mapped to servers. The type information of a darray or dfarray is the type of the slots, the value contains the number of slots, the set of workers, and the assignment of slots to workers.
A distributed array can be constructed by partitioning data on the master into partitions P 1 , ..., P m and then moving partitions P i into slots S i . This is illustrated in Fig. 3.
A third type offered is Slots Y of the matrix must be relation-valued, as for dfarray. This type supports redistributing data which are partitioned in a certain way on workers already. It is illustrated in Fig. 4.   The matrix arises when all servers partition their data in parallel. In the next step, each partition, that is, each column of the matrix, is moved into one slot of a distributed file array as shown in Fig. 5.

Operations
The following classes of operations are available: -Distributing data to the workers -Distributed processing by the workers -Applying a function (Secondo query) to each field of a distributed array -Applying a function to each pair of corresponding fields of two distributed arrays (supporting join) -Redistributing data between workers -Adaptive processing of partitioned data -Collecting data from the workers

Distributing data to the workers
The following operations come in a d-variant and a df-variant (prefix). The d-variant creates a darray, the df-variant a dfarray.
ddistribute2, dfdistribute2 Distribute a stream of tuples on the master into a distributed array. Parameters are an integer attribute, the number of slots and a Workers relation. A tuple is inserted into the slot corresponding to its attribute value modulo the number of slots. See Fig. 3. ddistribute3, dfdistribute3 Distribute a stream of tuples into a distributed array.
Parameters are an integer i, a Boolean b, and the Workers. Tuples are distributed round robin into i slots, if b is true. Otherwise slots are filled sequentially, each to capacity i, using as many slots as are needed. ddistribute4, dfdistribute4 Distribute a stream of tuples into a distributed array.
Here a function instead of an attribute decides where to put the tuple. share An object of the master database whose name is given as a string argument is distributed to all worker databases. dlet Executes a let command on each worker associated with its argument array; it further executes the same command on the master. This is needed so that the master can do type checking on the query expressions to be executed by workers in following dmap operations. dcommand Executes an arbitrary command on each worker associated with its argument array.

Distributed processing by the workers
Operations: dmap Evaluates a Secondo query on each field of a distributed array of type darray or dfarray. Returns a dfarray if the result is a tuple stream, otherwise a darray. In a parameter query, one refers to the field argument by "." or $1. Sometimes it is useful to access the field number within a parameter query. For this purpose, all variants of dmap operators provide an extra argument within parameter functions. For dmap, one can refer to the field number by ".." or by $2. dmap2 Binary variant of the previous operation mainly for processing joins. Always two fields with the same index are arguments to the query. One refers to field arguments by "." and "..", respectively, the field number is the next argument, $3. dmap3, ..., dmap8 Variants of dmap for up to 8 argument arrays. One can refer to fields by ".", "..", or by $1, ..., $8. pdmap, ..., pdmap8 Variants of dmap which take as an additional first argument a stream of slot numbers and evaluate parameter queries only on those slot numbers. They return a partial darray or dfarray (pdarray or pdfarray) where unevaluated fields are undefined. dproduct Arguments are two darrays or dfarrays with relation fields.
Each field of the first argument is combined with the union of all fields of the second argument. Can be used to evaluate a Cartesian product or a generic join with an arbitrary condition. No specific partitioning is needed for a join. But the operation is expensive, as all fields of the second argument are moved to the worker storing the field of the first argument. partition, partitionF Partitions the fields of a darray or dfarray by a function (similar to ddistribute4 on the master). Result is a dfmatrix. An integer parameter decides whether the matrix will have the same number of slots as the argument array or a different one. Variant partitionF allows one to manipulate the input relation of a field, e.g., by filtering tuples or by adding attributes, before the distribution function is applied. See Fig. 4. collect2, collectB Collect the columns of a dfmatrix into a dfarray. See Fig. 5.
The variant collectB assigns slots to workers in a balanced way, that is, the sum of slot sizes per worker is similar. Some workers may have more slots than others. This helps to balance the work load for skewed partition sizes. areduce Applies a function (Secondo query) to all tuples of a partition (column) of a dfmatrix. Here it is not predetermined which worker will read the column and evaluate it. Instead, when the number of slots s is larger than the number of workers m, then each worker i gets assigned slot i, for i = 0, ..., m − 1. From then on, the next worker which finishes its job will process the next slot. This is very useful to compensate for speed differences of machines or size differences in assigned jobs. areduce2 Binary variant of areduce, mainly for processing joins.

Operations:
dsummarize Collects all tuples (or values) from a darray or dfarray into a tuple stream (or value stream) on the master. Works also for pdarray and pdfarray. getValue Converts a distributed array into a local array. Recommended only for atomic field values; may otherwise be expensive. getValueP Variant of getValue applicable to pdarray or pdfarray. Provides a parameter to replace undefined values in order to return a complete local array on the master. tie Applies aggregation to a local array, e.g., to determine the sum of field values. (An operation not of the Distributed2Algebra but of the ArrayAlgebra in Secondo).

Formal definition of the distributed algebra
In this section, we formally define the syntax and semantics of the Distributed Algebra. We also illustrate the use of operations by examples. Formally, a system of types and operations is a (many-sorted) algebra. It consists of a signature which provides sorts and operators, defining for each operator the argument sorts and the result sort. A signature defines a set of terms. To define the semantics, one needs to assign carrier sets to the sorts and functions to the operators that are mappings on the respective carrier sets. The signature together with carrier sets and functions defines the algebra.
We assume that data types are built from some basic types and type constructors. The type system is itself described by a signature [22]. In this signature, the sorts are so-called kinds and the operators are type constructors. The terms of the signature are exactly the available types of the type system.
For example, consider the signature shown in Fig. 6. It has kinds BASE and ARRAY and type constructors int, real, bool, and array. The types defined are the terms of the signature, namely, int, real, bool, array(int), array(real), array(bool). Note that basic types are just type constructors without arguments.

Types
The Distributed Algebra has the type system shown in Fig. 7.
Here BASIC is a kind denoting the complete set of types available in the basic engine; REL is the set of relation types of that engine. In our implementation BASIC corresponds to the data types of Secondo. The type constructors build distributed Semantics of types are their respective domains or carrier sets, in algebraic terminology, denoted A t for a type t.
Let α be a type of the basic engine, α ∈ B ASI C, and let WR be the set of possible (non-empty) worker relations.
The carrier set of darray is: Hence the value of a distributed array with fields of type α consists of an integer n, defining the number of fields (slots) of the array, a set of workers W , a function f which assigns to each field a value of type α, and a mapping g describing how fields are assigned to workers.
The carrier set of type dfarray is defined in the same way; the only difference is that α must be a relation type, α ∈ R E L. This is because fields are stored as binary files and this representation is available only for relations.
Types pdarray and pdfarray are also defined similarly; here the difference is that f and g are partial functions.
Let α ∈ R E L. The carrier set of dfmatrix is: This describes a matrix with m rows and n columns where each row defines a partitioning of a set of tuples at one worker and each column a logical partition, as illustrated in Fig. 4.
The array type of the basic engine is defined as follows:

Operations for distributed processing by workers
Here we define the semantics of operators of Section 4.2.2. For each operator op, we show the signature and define a function f op from the carrier sets of the arguments to the carrier set of the result. All operators taking darray arguments also take dfarray arguments. All dmap, pdmap and areduce operators may return either darrays or dfarrays. The result type depends on the resulting field type: If it is a stream(tuple((α))) type, then the result is a dfarray, otherwise a darray. Hence in writing a query, the user can decide whether a darray or a dfarray is built by applying consume to a tuple stream for a field or not.
We omit these cases in the sequel, showing the definitions only for darray, to keep the formalism simple and concise.
To illustrate the use of operators, we introduce an example database with spatial data as provided by OpenStreetMap [48] and GeoFabrik [20]. We use example relations with the following schemas, originally on the master. Such data can be obtained for many regions of the world at different scales like continents, states, or administrative units. The first argument to dmap is the distributed array, the second a string, and the third the function to be applied. In the function, the "." refers to the argument. The string argument is omitted in the formal definition. In the implementation, it is used to name objects in the worker databases; the name has the form <name>_<slot_number>, for example, RoadsD30_5. One can give an empty string in a query where the intermediate result on the workers is not needed any more; in this case a unique name for the object in the worker database is generated automatically.
The result is a distributed array RoadsD30 where each field contains a relation with the roads having speed limit 30.
Note that the two arrays must have the same size and that the mapping of slots to workers is determined by the first argument. In the implementation, slots of the second argument assigned to different workers than for the first argument are copied to the first argument worker for execution.

Example 2
Using dmap2, we can formulate a spatial join on the distributed tables RoadsD and WaterwaysD. It is necessary that both tables are spatially co-partitioned so that joins can only occur between tuples in a pair of slots with the same index. In Sect. 4.3.3 it is shown how to create partitions in this way.
"Count the number of intersections between roads and waterways." Here for each pair of slots an itSpatialJoin operator is applied to the respective pair of (tuple streams from) relations. It joins pairs of tuples whose bounding boxes overlap. In the following refinement step, the actual geometries are checked for intersection. 4 The notation {r} is a renaming operator, appending _r to each attribute in the tuple stream.
The additional argument myPort is a port number used in the implementation for data transfer between workers.
Further operations dmap3, ..., dmap8 are defined in an analogous manner. For all these operators, the mapping from slots to workers is taken from the first argument and slots from other arguments are copied to the respective workers.
Here a stream of integer values is modeled formally as a sequence of integers. Operator pdmap can be used if it is known that only certain slots can yield results in an evaluation; for an example use see [51]. The operators pdmap2, ..., pdmap8 are defined similarly; as for dmap operators, slots are copied to the first argument workers if necessary.
The dproduct operator is defined for two distributed relations, that is, Here it is not required that the two arrays of relations have the same size (number of slots). Each relation in a slot of the first array is combined with the union of all relations of the second array. This is needed to support a general join operation for which no partitioning exists that would support joins on pairs of slots. In the implementation, all slots of the second array are copied to the respective worker for a slot of the first array. For this, again a port number argument is needed. Before applying the dproduct operator, we reduce to named roads, eliminate duplicates from spatial partitioning, and project to the relevant attributes. Then for all pairs of named roads, the edit distance of the names is determined by the ldistance operator and required to lie between 1 and 2. The symmjoin operator is a symmetric variant of a nested loop join. The filter condition after the symmjoin avoids reporting the same pair twice.
Here the union of all relations assigned to worker j is redistributed according to function h. See Fig. 4. The variant partitionF allows one to apply an additional mapping to the argument relations before repartitioning. It has the following signature: The definition of the function is a slight extension to the one for f partition and is omitted.  Assuming that roads are partitioned spatially, we need to repartition by names before executing the join. Here after repartitioning, the self-join can be performed locally for each slot. Assuming the result is relatively small, it is collected on the master by dsummarize.
Semantically, areduce is the same as collect2 followed by a dmap. In collecting the columns from the different servers (workers), a function is applied. The reason to have a separate operator and, indeed, the dfmatrix type as an intermediate result, is the adaptive implementation of areduce. Since the data of a column of the dfmatrix need to be copied between computers anyway, it is possible to assign any free worker to do that at no extra cost. Similar as for collectB, the value of function g is not defined for areduce as the assignment of slots to workers cannot be predicted.

Example 5
The previous query written with areduce is: Here within partitionF "." refers to the relation and ".." refers to the tuple argument in the first and second argument function, respectively.
The binary variant areduce2 has signature: The formal definition of semantics is similar to areduce and is omitted.

Operations for distributing data to the workers
The operators ddistribute2, ddistribute3, and ddistribute4 and their dfdistribute variants distribute data from a tuple stream on the master into the fields of a distributed array. 5 We define the first and second of these operators which distribute by an attribute value and randomly 6 , respectively. Operator ddistribute4 distributes by a function on the tuple which is similar to ddistribute2.
Let W R denote the set of possible worker relations (of a relation type). For a tuple type tuple(α) let attr(α, β) denote the name of an attribute of type β. Such an attribute a represents a function attr a on a tuple t so that attr a (t) is a value of type β. ddistribute2 :stream(tuple(α)) × attr(α, int) × int × WR → darray(rel(tuple(α))) In the following, we use the notation < s 1 , ..., s n | f (s i ) > to restrict a sequence to the elements s i for which f (s i ) = true. Functions f (s i ) are written in λx.expr(x) notation.
Hence the attribute a determines the slot that the tuple is assigned to. Note that all ddistribute operators maintain the order of the input stream within the slots.
The operator distributes tuples of the input stream either round robin to the n slots of a distributed array, or by sequentially filling each slot except the last one with n elements, depending on parameter b.

Example 6
We distribute the Buildings relation round robin into 50 fields of a distributed array. A relation Workers is present in the database.  The distribution is based on a regular grid as shown in Fig. 8. A spatial object is assigned to all cells intersecting its bounding box. The cellnumber operator returns a stream of integers, the numbers of grid cells intersecting the first argument, a rectangle. The extendstream operator makes a copy of the input tuple for each such value, extending it by an attribute Cell with this value. So we get a copy of each road tuple for each cell it intersects. The cell number is then used for distribution.
In some queries on the distributed Roads relation we want to avoid duplicate results. For this purpose, the tuple with the first cell number is designated as original. See Example 3.
The relation Waterways is distributed in the same way. So RoadsD and WaterwaysD are spatially co-partitioned, suitable for spatial join (Example 2).
The following two operators serve to have the same objects available in the master and worker databases. Operator share copies an object from the master to the worker databases whereas dlet creates an object on master and workers by a query function.
share : string × bool × WR → text The Boolean parameter specifies whether an object already present in the worker database should be overwritten. W R defines the set of worker databases.
The semantics for such operators can be defined as follows. These are operations affecting the master and the worker databases, denoted as M and D 1 , ..., D m , respectively. A database is a set of named objects where each name n is associated with a value of some type, hence a named object has structure (n, (t, v)) where t is the type and v the value. A query is a function on a database returning such a pair. Technically, an object name is represented as a string and a query as a text.
We define the mapping of databases denoted δ. Let (n, o) be an object in the master database. Note that a database object mentioned in a parameter function (query) of dmap must be present in the master database, because the function is type checked on the master. It must be present in the worker databases as well because these functions are sent to workers and type checked and evaluated there.
Whereas persistent database objects can be copied from master to worker databases, this is not possible for main memory objects used in query processing. Again, such objects must exist on the master and on the workers because type checking is done in both environments. This is exactly the reason to introduce the following dlet operator.
The dlet operator creates a new object by a query simultaneously on the master and in each worker database. The darray argument serves to specify the relevant set of workers. The operator returns a stream of tuples reporting success or failure of the operation for the master and each worker. Let n be the name and μ the query argument. An example for dlet is given in Sect. 6.5.
The dcommand operator lets an arbitrary command be executed by each worker. The command is given as a text argument. The darray argument defines the set of workers. The result stream is like the one for dlet.

Example 9
To configure for each worker how much space can be used for main memory data structures, the following command can be used: query RoadsD dcommand[query meminit(4000)] consume

Operations for collecting data from workers
The operator dsummarize can be used to make a distributed array available as a stream of tuples or values on the master whereas getValue transforms a distributed into a local array. dsummarize : darray(rel(tuple(α))) → stream(tuple(α)) The operator is overloaded. For the two signatures, the semantics definitions are: The operator getValueP allows one to transform a partial distributed array into a complete local array on the master.
Finally, the tie operator of the basic engine is useful to aggregate the fields of a local array.
Example 10 Let X be an array of integers. Then query X tie[. + ..] computes their sum. Here "." and ".." denote the two arguments of the parameter function which could also be written as query X tie[fun(x: int, y: int) x + y] Further, examples 2 and 4 demonstrate the use of these operators.

Final remarks on the distributed algebra
Whereas the algebra has operations to distribute data from the master to the workers, this is not the only way to create a distributed database. For huge databases, this would not be feasible, the master being a bottleneck. Instead, it is possible to create a distributed array "bottom-up" by assembling data already present on the worker computers. They may have got there by file transfer or by use of a distributed file system such as HDFS [45]. One can then create a distributed array by a dmap operation that creates each slot value by reading from a file present on the worker computer. Further, it is possible to create relations (or any kind of object) in the worker databases, again controlled by dmap operations, and then to collect these relations into the slots of a distributed array created on top of them. This is provided by an operation called createDarray, omitted here for conciseness. Examples can be found in [23,51].
Note that any algorithm that can be specified in the MapReduce framework can easily be transferred to Distributed Algebra, as map steps can be implemented by dmap, shuffling between map and reduce stage is provided by partition and collect or areduce operations, and reduce steps can again be implemented by dmap (or areduce) operations.
An important feature of the algebra design is that the number of slots of a distributed array may be chosen independently from the number of workers. This allows one to assign different numbers of slots to each worker and so to compensate for uneven partitioning or more generally to balance work load over workers, as it is done in operators collectB, areduce, and areduce2.

Implementing an algebra in Secondo
To implement a new algebra, data types and operators working on it must be provided. For the data types, a C++ class describing the type's structure and some functions for the interaction with Secondo must be provided. In the context of this article, the Secondo supporting functions are less important. They can be found e.g., in the Secondo Programmer's Guide [25].
An operator implementation consists of several parts. The two most important ones are the type mapping and the value mapping. Other parts provide a description for the user or select different value mapping implementations for different argument types.
The main task of the type mapping is to check whether the operator can handle the provided argument types and to compute the resulting type. Optionally further arguments can be appended. This may be useful for default arguments or to transfer information that is available in the type mapping only to the value mapping part.
Within the value mapping, the operator's functionality is implemented, in particular the result value is computed from the operator's arguments.

Structure of the main types
All information about the subtypes, i.e. the types stored in the single slots, is handled by the Secondo framework and hence not part of the classes representing the distributed array types.
The array classes of the Distributed2Algebra consist of a label (string), a defined flag (bool), and connection information (vector). Furthermore, an addi-tional vector holds the mapping from the slots to the workers. The label is used to name objects or files on the workers. An object corresponding to slot X of a distributed array labeled with myarray is stored as myarray_X . The defined flag is used in case of errors. The connection information corresponds to the schema of the worker relation that is used during the distribution of a tuple stream. In particular, each entry in this vector consists of the name of the server, the server's port, an integer corresponding to the position of the entry within the worker relation, and the name of a configuration file. This information is collected in a vector of DArrayElement.
The partial distributed arrays (arrays of type pdarray or pdfarray) have an additional member of type set<int> storing the set of used slot numbers.
The structure of a dfmatrix is quite similar to the distributed array types. Only the mapping from the slots to the workers is omitted. Instead the number of slots is stored. Figure 9 shows a simplified class diagram of the array classes provided by the Distributed2Algebra.

Class hierarchy of array classes
Note that the non-framed parts are not really classes but type definitions only, e.g., the definition of the darray type is just typedef DArrayT<DARRAY> DArray;.

Worker connections
The connections the to workers are realized by a class ConnectionInfo. This class basically encapsulates a client interface to a Secondo server and provides thread-safe access to this server. Furthermore, this class supports command logging and contains some functions for convenience, e.g., a function to send a relation to the connected server. Fig. 9 The class hierarchy for distributed array classes Instances of this class are part of the Distributed2Algebra instance. If a connection is requested by some operation, an existing one is returned. If no connection is available for the specified worker, a connection is established and inserted into the set of existing connections. Connections will be held until closing is explicitly requested or the master is finished. This avoids the time consuming start of a new worker connection.

Distribution of data
All distribution variants follow the same principle. Firstly the incoming tuple stream is distributed to local files on the master according to the distribution function of the operator. Each file contains a relation in a binary representation. The number of created files corresponds to the number of slots of the resulting array. After that, these files are copied to the workers in parallel over the worker connections. If the result of the operation is a darray, the binary file on the worker is imported into the database as a database object by sending a command to the worker. Finally, intermediate files are removed. In case of the failure of a worker, another worker is selected adapting the slot→worker mapping of the resulting distributed array.

The dmap family
Each variant of the dmap Operator gets one or more distributed arrays, a name for the result's label, a function, and a port number. The last argument is omitted for the simple dmap operator. As described above, the implementation of an operator consists of several parts where the type mapping and the value mapping are the most interesting ones. By setting a special flag of the operator (calling the SetUsesArgsInTypeMapping function), the type mapping is fed not only with the argument's types but additionally with the part of the query that leads to this argument. Both parts are provided as a nested list. It is checked whether the types are correct. The query part is only exploited for the function argument. It is slightly modified and delivered in form of a text to the value mapping of the operator.
Within the value mapping it is checked whether the slot-worker-assignment is equal for each argument array. If not, the slot contents are transferred between the workers to ensure the existence of corresponding slots on a single worker. In this process, workers communicate directly with each other. The master acts as a coordinator only. For the communication, the port given as the last argument to the dmapX operator is used. Note, that copying the slot contents is available for distributed file arrays only but not for the darray type.
For each slot of the result, a Secondo command is created mainly consisting of the function determined by the type mapping applied to the current slot object(s). If the result type is a darray, a let command is created, a query creating a relation within a binary file otherwise. This command is sent to the corresponding worker. Each slot is processed within a single thread to enable parallel processing. Synchronization of different slots on the same worker is realized within the ConnectionInfo class.
At the end, any intermediate files are deleted.

Redistribution
Redistribution of data is realized as a combination of the partition operator followed by collect2 or areduce.
The partition operator distributes each slot on a single worker to a set of files. The principle is very similar to the first part of the ddistribute variants, where the incoming tuple stream is distributed to local files on the master. Here, the tuples of all slots on this worker are collected into a common tuple stream and redistributed to local files on this worker according to the distribution function.
At the beginning of the collect2 operator, on each worker a lightweight server is started that is used for file transfer between the workers. After this phase, for each slot of the result darray, a thread is created. This thread collects all files associated to this slot from all other workers. The contents of these files are put into a tuple stream, that is either collected into a relation or into a single file.
The areduce operator works as a combination of collect2 and dmap. The a in the operator name stands for adaptive, meaning that the number of slots processed by a worker depends on its speed. This is realized in the following way. Instead for each slot, for each worker a thread is created performing the collect2-dmap functionality. At the end of a thread, a callback function is used to signal this state. The worker that called the function is assigned to process the next unprocessed slot.

Fault tolerance
Inherent to parallel systems is the possibility of the failure of single parts. The Distributed2Algebra provides some basic mechanisms to handle missing workers. Of course this is possible only if the required data are stored not exclusively at those workers. Conditioned by the two array types, the system must be able to handle files and database objects, i.e., relations. In particular, a redundant storage and a distributed access are required.
In Secondo there are already two algebras implementing these features. The DBService algebra is able to store relations as well as any dependent indexes in a redundant way on several servers. For a redundant storage of files, the functions of the DFSAlgebra are used. If fault tolerance is switched on, created files and relations are stored at the desired worker and additionally given to the corresponding parts of these algebras for replicated storage. In the case of failure of a worker, the created command is sent to another worker and the slot-worker assignment is adapted. In the case the slot content is not available, a worker will get the input from the DFS and the DBService, respectively.
However, at the time of writing fault tolerance does not yet work in a robust way in Secondo and is still under development. It is also beyond the scope of this paper.

Application example: distributed density-based similarity clustering
In this section, we demonstrate how a fairly sophisticated distributed algorithm can be formulated in the framework of the Distributed Algebra. As an example, we consider the problem of density-based clustering as introduced by the classical DBScan algorithm [15]. Whereas the original DBScan algorithm was applied to points in the plane, we consider arbitrary objects together with a distance (similarity) function. Hence the algorithm we propose can be applied to points in the plane, using Euclidean distance as similarity function, but also to sets of images, twitter messages, or sets of trajectories of moving objects with their respective application-specific similarity functions.

Clustering
Let S be a set of objects with distance function d. The distance must be zero for two equal objects; it grows according to the dissimilarity between objects. We recall the basic notions of density-based clustering. It uses two parameters Min Pts and E ps. An object s from S is called a core object if there are at least Min Pts elements of S within distance E ps from s, that is, |N Eps (s)| ≥ Min Pts where N E ps (s) = {t ∈ S|d(s, t) ≤ E ps}. It is called a border object if it is not a core object but within distance E ps of a core object.
An object p is directly density-reachable from an object q if q is a core object and p ∈ N E ps (q). It is density-reachable from q if there is a chain of objects p 1 , p 2 , ..., p n where p 1 = q, p n = p and ∀1 ≤ i < n : p i+1 is directly density-reachable from p i . Two objects p, r are density-connected, if there exists an object q such that both p and r are density-reachable from q. A cluster is a maximal set of objects that are pairwise density-connected. All objects not belonging to any cluster are classified as noise.

Overview of the algorithm
A rough description of the algorithm is as follows.
1. Compute a small subset of S (say, a few hundred elements) as so-called partition centers. 2. Assign each element of S to its closest partition center. In this way, S is decomposed into disjoint partitions. In addition, assign some elements of S not only to the closest partition center but also to partition centers a bit farther away than the closest one. The resulting subsets are not disjoint any more but overlap at the boundaries. Within each subset we can distinguish members of the partition and so-called neighbors. 3. Use a single machine DBScan implementation to compute clusters within each partition. Due to the neighbors available within the subsets, all elements of S can be correctly classified as core, border, or noise objects. 4. Merge clusters that extend across partition boundaries and assign border elements to clusters of a neighbor partition where appropriate. In Step 2, the problem arises how to determine the neighbors of a partition. See Fig. 10a. Here u and v are partition centers; the blue objects are closest to u, the red objects are closest to v; the diagonal line represents equi-distance between u and v. When partition P(u) is processed in Step 3 by a DBScan algorithm, object s needs to be classified as a core or border object. To do this correctly, it is necessary to find object t within a circle of radius E ps around s. But t belongs to partition P(v). It is therefore necessary to include t as a neighbor into the set P (u), the extension of partition P(u).
Hence we need to add elements of P(v) to P (u) that can lie within distance E ps from some object of P(u). Theorem 1 says that such objects can lie only 2 · E ps further away from u than from their own partition center v. The proof is illustrated in Fig. 10b.

Theorem 1 Let s, t ∈ S and T ⊂ S. Let u, v ∈ T be the elements of T with minimal distance to s and t, respectively. Then t ∈ N E ps (s) ⇒ d(u, t) ≤ d(v, t) + 2 · E ps.
Proof t ∈ N E ps (s) implies s ∈ N Eps (t). Let x be a location within N Eps (t) with equal distance to u and v, that is, d(u, x) = d (v, x). Such locations must exist, because s is closer to u and t is closer to v.
Hence to set up the relevant set of neighbors for each partition, we can include an object t into all partitions whose centers are within distance d t + 2 · E ps, where d t is the distance to the partition center closest to t.

The algorithm
In more detail, the main algorithm consists of the following steps. Steps are marked as M if they are executed on the master, MW if they describe interaction between master and workers, and W if they are executed by the workers.
Initially, we assume the set S is present in the form of a distributed array T where elements of S have been assigned to fields in a random manner, but equally distributed (e.g., round robin).
As a result of the algorithm, all elements are assigned a cluster number or they are designated as noise.
algorithm SimilarityClustering input: T -a distributed array containing a set of objects S Min Pts, E ps -parameters for density-based clustering k -integer parameter for placing partition centers output: X -a distributed array containing the elements of S augmented by cluster ids or a characterization as noise. method: 1. MW Collect a sample subset SS ⊂ S from array T to the master, to be used in the following step.

M Based on SS, compute a subset PC ⊂ S as partition centers using algorithm
SimilarityPartitioning (Sect. 6.6). Let PC = {pc 1 , ..., pc n }. Subsequently, S will be partitioned in such a way that each object is assigned to its closest partition center. 3. MW Share PC and some constant values with workers. 4. W Compute for each object s in T i its closest partition center pc j and the distance to it. Add to s attributes N and Dist representing the index j and the distance d(s, pc j ). Further, compute for s all partition centers within distance Dist +2·E ps and add their indices in attribute N 2. Repartition the resulting set of objects (tuples) by attribute N 2, resulting in a distributed array V . The field V j now contains the objects of S closest to pc j (call this set U j ) plus some objects that are closer to other partition centers, but can be within distance E ps from an object in U j according to Theorem 1. The idea is that for each object q ∈ U j we can compute N E ps (q) within V j because N Eps (q) ⊂ V j . So we can determine correctly whether q is a core or a border object, even across the boundaries of partition U . Elements of U j are called members, elements of V j \ U j neighbors of the partition U , respectively. An element of V j is a member iff N 2 = N . 5. W To each set V j apply a DBScan algorithm using parameters Min Pts and E ps.
Objects within subset U j (members) will be correctly classified as core objects and border objects; for the remaining objects in V j \ U j (neighbors) we don't care about their classification. Each object s from V j is extended by an attribute C I D0 for the cluster number (-2 for noise) and a boolean attribute I sCore with value |N E ps (s)| ≥ Min Pts. Cluster identifiers are transformed into globally unique identifiers by setting C I D = C I D0 · n + j. The result is stored as X j . The subset of X j containing the former members of U j is called W j ; X j \ W j contains the neighbors of partition W . The remaining problem is to merge clusters that extend beyond partition boundaries. 6. W For each q ∈ (X j \ W j ) retrieve N Eps (q) ∩ W j . For each p ∈ N E ps (q) ∩ W j , insert tuple ( p, C I D p , I sCore p , N p , q) into a set N eighbors j . Redistribute N eighbors, once by the P and once by the Q attribute into distributed arrays N eighbors By P and N eighbors By Q, respectively, to prepare a join with predicate P = Q. 7. W For each pair of tuples (q, C I D q , I sCore q , N q , p) ∈ N eighbors By Q, ( p, C I D p , I sCore p , N p , q) ∈ N eighbors By P: (a) If both p and q are core objects, generate a task (C I D p , C I D q ) to merge clusters with these numbers; store tasks in a distributed table Merge. (b) If p is a core object, but q is not, generate a task (q, N q , C I D p ) to assign to q the C I D of p, since q is a boundary object of the cluster of p. Store such assignment tasks in a table Assignments. 7 (c) If p is not a core object, but q is, generate a task ( p, N p , C I D q ) to assign the C I D of q to p. end SimilarityClustering.

Tools for implementation
In the Secondo environment, we find the following useful tools for implementing this algorithm: -Main memory relations -A main memory M-tree -A DBScan implementation relying on this M-tree -A data structure for graphs in main memory

Memory Relation
A stream of tuples can be collected by an mconsume operation into a main memory relation which can be read, indexed, or updated. As long as enough memory is available, this is of course faster in query processing than using persistent relations.

M-tree
The M-tree [7] is an index structure supporting similarity search. In contrast to other index structures like R-trees it does not require objects to be embedded into a Euclidean space. Instead, it relies solely on a supplied distance function (which must be a metric). Secondo has persistent as well as main memory data types for M-trees. Operations used in the algorithm are mcreatemtree to create an M-tree index on a main memory relation, -mdistRange to retrieve all objects within a given distance from a query object, and -mdistScan to enumerate objects by increasing distance from a query object.
More precise descriptions of these and following operations can be found in the Appendix. The M-tree is used to support all the neighborhood searches in the algorithm.
DBScan Secondo provides several implemented versions of the DBScan algorithm [15] implementing density-based clustering, using main memory R-trees or M-trees as index structure, with an implicit or explicit (user provided) distance function. An implicit distance function is registered with the type of indexed values. Here we use the version based on M-trees with the operator -dbscanM It performs density-based clustering on a stream of tuples based on some attribute, extending the tuples by a cluster number or a noise identification.
This is used to do the local clustering within each partition.
Graph There exist some variants of graph data structures (adjacency lists) in memory.
Here we use the type mgraph2 with operations: -createmgraph2 Creates a graph from a stream of tuples representing the edges, with integer attributes to identify source and target nodes, and a cost measure. -mg2connectedcomponents Returns the connected components from the graph as a stream of edge tuples extended by a component number attribute. The computation of connected components is needed in the final stage of the algorithm for the global merging of clusters.

Implementation
We now show for each step of the algorithm its implementation based on Distributed Algebra. As an example, we use a set Buildings from OpenStreetMap data with the following schema: Buildings(Osm_id: string, Code: int, Fclass: string, Name: text, Type: string, GeoData: region) The data represent buildings in the German state of North Rhine-Westphalia (NRW); the GeoData attribute contains their polygonal shape. For clustering, we compute the center of the polygon. We assume a dense area if there are at least 10 buildings within a radius of 100 meters. The distributed array T is initially present in the database; also the W orkers relation exists. The database is open already.
We explain the implementation of the first steps in some detail and hope this is sufficient to let the reader understand also the remaining queries. All query processing operators can be looked up in the Appendix.
1. MW Collect a sample subset SS ⊂ S from array T to the master, to be used in the following step. Here the number of fields of T is determined by the size operator and shared with the workers. On each field, a random sample is taken by the some operator. The resulting streams are collected by dsummarize to the master and written there into a relation by the consume operator which is stored as SS.
The second line computes the set of partition centers PC, using SS and parameter k. The contents of the script SimilarityPartitioning.sec are shown in Sect. 6.6.

MW Share PC and some constant values with workers.
query share("PC", TRUE, Workers); query share("MinPts", TRUE, Workers); query share("Eps", TRUE, Workers); query share("wgs84", TRUE, Workers); query share("n", TRUE, Workers);  In lines 1-2, main memory objects on the master and on the workers are deleted and for each worker, a bound of 3600 MB is set for main memory data objects. In lines 4-5, at each worker, the set PC is set up as a main memory relation together with an M-tree index over the Pos attribute. Using the wgs84 geoid, distances can be specified in meters, consistent with the definition of E ps. Note that the distributed array T is only used to specify the set of workers; its field values are not used. These data structures are used in the next step in lines 7-16. For each field of T , for each tuple t representing an element s ∈ S the distance to the nearest partition center is computed (lines 10-11) and added to tuple t in attribute Dist; the index of the partition center is added in attribute N .

W Compute for each object s in T i its closest partition center pc
Tuples are further processed in the next loopjoin, determining for each tuple the elements of PC within Dist + 2 · E ps; the current tuple is joined with all these tuples, keeping only their index in attribute N 2.
Finally the resulting stream of tuples is repartitioned by attribute N 2. Slot sizes are balanced across workers to achieve similar loads per worker in the next step.

W To each set V j apply a DBScan algorithm using parameters Min Pts and E ps.
Objects within subset U j (members) will be correctly classified as core objects and border objects; for the remaining objects in V j \ U j (neighbors) we don't care about their classification. Each object s from V j is extended by an attribute C I D0 for the cluster number (-2 for noise) and a boolean attribute I sCor e with value |N Eps (s)| ≥ Min Pts. Cluster identifiers are transformed into globally unique identifiers by setting C I D = C I D0 · n + j. The result is stored as X j . The subset of X j containing the former members of U j is called W j ; X j \ W j contains the neighbors of partition W . The remaining problem is to merge clusters that extend beyond partition boundaries. 6. W For each q ∈ (X j \ W j ) retrieve N E ps (q) ∩ W j . For each p ∈ N Eps (q) ∩ W j , insert a tuple ( p, C I D p , I sCore p , N p , q) into a set N eighbor s j . An equivalent formulation is: insert a tuple ( p, C I D p , I sCore p , N p , q) into a set N eighbors j . An advantage of the second formulation is that we need to search on the much smaller set (X j \ W j ) instead of W j . As we will use a main memory index for this set, far less memory is needed and larger data sets can be handled. Redistribute N eighbor s, once by the P and once by the Q attribute into distributed arrays N eighbor s By P and N eighbor s By Q, respectively, to prepare a join with predicate P = Q. 7. W For each pair of tuples (q, C I D q , I sCore q , N q , p) ∈ N eighbors By Q, ( p, C I D p , I sCore p , N p , q) ∈ N eighbors By P: (a) If both p and q are core objects, generate a task (C I D p , C I D q ) to merge clusters with these numbers; store tasks in a distributed table Merge (b) If p is a core object, but q is not, generate a task (q, N q , C I D p ) to assign to q the C I D of p, since q is a boundary object of the cluster of p. Store such assignment tasks in a table Assignments. (c) If p is not a core object, but q is, generate a task ( p, N p , C I D q ) to assign the C I D of q to p. (d) If neither p nor q are core objects, leave their cluster numbers unchanged.
Redistribute assignments by the N attribute into distributed array Assignments.

Balanced partitioning
In Step 2 of the algorithm SimilarityClustering, partition centers are determined. Since in parallel processing each partition will be processed in a task by some worker, partition sizes should be as similar as possible. This is the easiest way to balance workloads between workers. As partition sizes are solely determined by the choice of partition centers, a good placement of partition centers is crucial.
To adapt to the density of the data set S to be clustered, there should be more partition centers in dense areas than in sparse areas. We therefore propose the following strategy: Compute for each element of S its radius r (s) as the distance to the k-th nearest neighbor, for some parameter k. We obtain for each s ∈ S a disk with radius r (s). The disk will be small in dense areas, large in sparse areas. Place these disks in some arbitrary order but non-overlapping into the underlying space. That is, a disk can be placed if it does not intersect any disks already present; otherwise it is ignored.
The algorithm is shown in Fig. 11. In practice, it is not necessary to apply the algorithm to the entire data set to be clustered. Instead, a small random sample can be  selected that reflects the density distribution. In our experiments, we use a sample of size 10000. Figure 12 shows the result of the algorithm for the set of buildings in the German state of North-Rhine Westphalia. One can observe that small disks lie in the area of big cities. 8

Implementation
An efficient implementation of this algorithm must rely on a data or index structure supporting k-nearest-neighbor search as well as distance range search. In Secondo, we can again use a main memory M-tree providing such operations.    13 In line 3, a main memory relation SSm is created from the sample SS. Next, a main memory M-tree index SSm_Pos_mtree indexing elements by Pos is built over SSm.
In lines 6-11, a main memory relation Balls is created where each tuple of SS is extended by an attribute Radius containing the distance to the kth-nearest neighbor. The distance is determined by an mdistScan operation which enumerates indexed tuples by increasing distance from the starting point, the position of the current tuple. The head operator stops requesting tuples from its predecessor after k elements; from its output via tail the last element is taken and the position value extracted.
In line 13 we determine the maximum radius of any element. In lines 1-2, an empty main memory relation PCm is created with the same schema as that of Balls. Also an index PCm_Pos_mtree is built over it, initially empty as well.
Lines 4-10 implement the second for each loop of algorithm SimilarityPartitioning. Each tuple from Balls is checked in the filter operator, using the condition that in a distance range search on the already present elements of PCm no tuples are found whose distance to this tuple is less than the sum of their radii. That is, their disks or balls would overlap. If no such tuple is found, the current tuple is inserted into PCm and the index over it.
Finally, from the main memory relation PCm a persistent relation PC with the partition centers is created, adding an index N , used in the main algorithm, and a circle for visualization.

Experimental evaluation
In this section we provide a brief experimental evaluation of the framework, addressing the quality of balanced partitioning, load balancing over workers, and speedup. A detailed evaluation of the clustering algorithm and comparison with competing approaches is left to future work.

Balanced partitioning
We consider the data set introduced in Sect. 6.5 of Buildings in the German state of NRW. There are 7842728 buildings. They are partitioned by the method of Sect. 6.6 yielding 123 partition centers as shown in Fig. 12. Each building is then assigned to its closest partition center (and possibly some more centers as explained in Step 4 of the algorithm). The total number of buildings assigned to slots is 8046065, so there are about 2.6 % duplicates assigned to several centers. The size distribution of the resulting partitions is shown in Fig. 13.
One can see that slot sizes are somewhat balanced in the sense that there are no extremely large or small slots. Nevertheless they vary quite a bit. To describe this variation, we introduce a measure called utilization. The term utilization results from the idea that slots could be processed in parallel on different computers and the total time required is defined by the computer processing the largest slot. Utilization For the slot sizes S shown in Fig. 13, we have U til(S) = 50.75%. Hence assigning these slots directly to different computers would not be very efficient.

Load balancing over workers
Fortunately in our framework slots are distributed over workers so that each worker processes several slots sequentially. By the standard "round robin" assignment of slots to workers, different slot sizes already balance out to some extent. The resulting worker loads are shown in Fig. 14. Here we have U til(WL) = 67.7%.
A still better load balancing between workers can be achieved by the collectB operator. It assigns partitions to workers based on their size (number of tuples) when they are transferred from a dfmatrix. The algebra definition does not prescribe by which algorithm this is done. In our implementation, the following heuristic is used: 3. Traverse list S, assigning slots sequentially to standard workers. In each assignment, select a worker with the minimal load assigned so far. 4. Sort the worker loads descending by size into list A. 5. Traverse list A, removing from each assignment the last slot and assigning it to the reserve worker with the smallest assignment so far, until reserve worker loads get close to the average worker load (computed beforehand).
Here the basic strategy is to assign large slots first, small slots last to the worker with smallest load so far, which lets worker loads fill up equally. This happens in Steps 1 to 3. The last two steps 4 and 5 are motivated by the fact that sometimes in a relatively well balanced distribution there are a few workers with higher loads. The idea is to take away from them the last (small) assigned slots and move these to the reserve workers.
We have evaluated these strategies in a series of experiments on the given example database with Buildings in NRW. We vary the size of the sample SS using sizes 10000, 20000, and 50000; for each size the partitioning and assignment algorithm is run three times. The parameter k is fixed to 50. Note that with increasing sample size the number of partitions grows, because from each point a circle enclosing the closest k neighbors gets smaller. Hence more circles fit into the same space. Due to the randomness of samples, the numbers of partitions and all results vary a bit between experiments. One can observe that we have about 3 slots per worker for sample size 10000 (as there are 40 workers), about 6 for 20000, and about 15 for 50000. The variation in slot sizes and the respective utilization (UtilSizes) remains at around 50% for the increasing number of partitions. However, the round robin utilization (UtilRR) improves from about 70% to about 85%.
Assignment descending by size (UtilS) is clearly better than round robin assignment and reaches already 95% for 6 slots per worker and 98% for 15 slots per worker. Using reserve workers and reassignment (UtilSR) can in some cases still improve utilization by a small percentage.
The fact that the partitioning algorithm returns slots of somewhat varying size is actually an advantage as having small slots allows one to fill up worker loads evenly. At the same time it is crucial not to have single slots that are extremely large.
In any case, by using enough slots per worker (e.g., 6 in this experiment) we can achieve an almost perfect load balancing in terms of the sizes of data to be processed.

Speedup
In this section we describe experiments with a larger data set to examine the speedup behaviour of the framework. Experiments are run on a small cluster consisting of 5 server computers, each with the following configuration: -8 cores, 32 GB main memory, 4 disks, Intel Xeon CPU E5-2630, running Ubuntu 18.04 -(up to) 8 workers, each using one core, 3.6 GB main memory, two workers sharing one disk In addition, the master runs on one of the computers, using all memory, if needed. For the algorithm of this paper, the master uses almost no memory.
The data set to be clustered consists of the nodes of the OpenStreetMap data set for Germany. Each node defines a point in the plane; all geometries (e.g., roads, buildings, etc.) are defined in terms of nodes. There are 315.113.976 nodes. For clustering, we use the same parameters as in Sect. 6.5, namely E ps = 100 meters, Min Pts = 10. In all experiments we use the same sample SS of size 30888 and parameter k = 100 which leads to 188 partitions.
The algorithm of Sect. 6.5 was run 4 times, for sets of 10, 20, 30, and 40 workers denoted W 10, ..., W 40. W 10 is considered as a baseline and we observe the speedup achieved relative to W 10. Table 2 shows the elapsed time for the 11 steps of the algorithm. 9 Due to the fact that the same precomputed sample was used in all 4 experiments, the computation of SS is missing in Step 1, which would add about 53 seconds. One can observe that Steps 1, 2, 3, 8, 9, 10 have negligible running times. Note that the global computation on the master in Steps 8 through 10 is in no way a bottleneck.
The remaining steps we consider in more detail for 10 to 40 workers in Table 3. Here within each step the running times for queries are given by the names of the resulting objects. The right part of the table shows the respective speedups defined as time(W 10)/time(W x). The numbers are visualized in Fig. 15.  Fig. 15a illustrates that by far most of the time is spent in the local DBScans (Step 5, X ) and the initial partitioning of the data (Step 4, V ). Regardless of running times, the right part of the table and Fig. 15a show the speedups for various queries. One can observe that computations involving shuffling of data have a weaker speedup (e.g., Step 6, NeighborsBy...). This is because for more workers there is more data exchange. But for most queries good speedups can be achieved, e.g., by a factor around 3 going from 10 to 40 workers.
The overall running times and speedups are shown in Table 4. Finally, Fig. 16 illustrates the result of the algorithm. The largest 3 clusters discovered have sizes of 158.798.786, 15.279.845, and 7.539.633, respectively. Fig. 16a shows the partition centers for Germany and two clusters at ranks 29 and 30 with 462.800 and 445.079 elements, respectively (of which only a few sample elements are selected for visualization). Figure 16b shows the bottommost cluster in more detail; the four local clusters that have been merged to the global cluster are illustrated by color. The boundaries of local clusters are defined by the Voronoi diagram over partition centers.

Conclusions
In this paper, we have proposed an algebra with formal semantics which allows a precise formulation of distributed algorithms or distributed query processing in general. It is based on the simple and intuitive concept of a distributed array, an array whose fields lie on and are processed by different computers. The algebra focuses on the aspect of distribution and is generic with respect to the possible field types or operations on them. It does, however, provide some specific operations to deal with collections of objects represented as relations. Otherwise, field types and operations are supplied by some single server database system, called the basic engine in this paper. Different such systems may be used in principle.
It would not be satisfactory to present such an algebra without demonstrating its application to formulate distributed algorithms. Therefore, we have included a fairly advanced algorithm for distributed clustering. The algorithm is interesting in its own right: It includes a new technique for purely distance-based partitioning using any metric similarity function and it is the first precise distributed algorithm for densitybased similarity clustering relying only on distance.
The formulation of the algorithm shows a new style of describing distributed algorithms. In addition to a precise mathematical formulation, it is possible to show the  . 16 a Partition centers for Germany and two clusters. b One cluster in detail, composed of four local clusters complete implementation in terms of high level operations of a database system with defined semantics, either of the distributed algebra or of the basic engine. One can see precisely which data structures and algorithms are used. This is in contrast to many published algorithms where certain steps are only vaguely described and hard to understand. The framework has been implemented and is publicly available. In a brief experimental evaluation, we have studied the variation of partition sizes in the distance based partitioning, load balancing over workers, and speedup. The results show that partition sizes vary but are not extreme, and load balancing over workers can provide almost perfect load distribution, using a sufficient number of slots. Here it is crucial that the number of slots of a distributed array can be chosen independently from the number of workers. Finally, a good linear speedup is achieved for most queries.
Future work may address the following aspects: -Provide fault tolerance for the distributed persistent database, for intermediate results in files, and for intermediate results in memory. For the persistent database and memory data, fault tolerance must maintain extensibility, that is, support arbitrary new indexes and other data types that are added to the basic engine. -The presented algebra offers a basic generic layer for distributed query processing.
On top of it more specialized layers may be added. This may include an algebra for distributed relations, providing several partitioning techniques and keeping track of partitioning in the data type, handling duplicates in spatial partitioning, and repartition automatically for joins. Another algebra may handle updates on distributed relations. All of this can be expressed in the Distributed Algebra, but will be easier to use at the higher level algebras. -Provide an SQL level with cost-based optimization, handling of spatial partitioning in at least two and three dimensions (which includes moving objects) and spatial duplicate elimination. -The given distributed arrays are static in their mapping of slots to workers. Provide dynamic distributed arrays which can adapt to a dataset whose density changes under updates, as well as to changing available resources. -Embed other database systems such as PostgreSQL/PostGIS or MySQL in the role of basic engines.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.