Case Study: Cache Efficient Algorithms for Matrix Operations

  • Michael Bader
Chapter
Part of the Texts in Computational Science and Engineering book series (TCSE, volume 9)

Abstract

In Chaps. 10 and 11, we discussed applications of space-filling curves for parallelisation, which were motivated by their locality properties. In the following two chapters, we will discuss further applications, which again exploit the intrinsic locality properties of space-filling curves. As both applications will focus on inherently cache-efficient algorithms, we will start with a short introduction to cache-memory architectures, and discuss the resulting requirements on cache-efficient algorithms.

In Chaps. 10 and 11, we discussed applications of space-filling curves for parallelisation, which were motivated by their locality properties. In the following two chapters, we will discuss further applications, which again exploit the intrinsic locality properties of space-filling curves. As both applications will focus on inherently cache-efficient algorithms, we will start with a short introduction to cache-memory architectures, and discuss the resulting requirements on cache-efficient algorithms.

13.1 Cache Efficient Algorithms and Locality Properties

In computer architecture, a so-called cache (or cache memory) denotes a fast memory component that replicates a certain part of the main memory to allow faster access to the cached (replicated) data. Such caches are necessary, because standard memory hardware is nowadays much slower than the CPUs. The access latency, i.e. the time between a data request and the arrival of the first requested piece of data, is currently1 about 60–70 ns. During that time, a fast CPU core can perform more than 100 floating point operations. This so-called “memory-gap” between CPU speed and main memory is constantly getting worse, because CPU speed is improving much faster (esp. due to using multicore processors) than memory latency. For memory bandwidth (i.e. the maximum rate of data transfer from memory to CPU), the situation is a bit better, but there are already many applications in scientific computing whose performance is limited by memory bandwidth instead of CPU speed.

Cache memory can be made much faster, and can even keep up with CPU speed, but only if it is much smaller than typical main memory. Hence, modern workstations use a hierarchy of cache memory: a very small, so-called first-level cache, running at the same speed as the CPU; a second-level cache that is larger (typically a few megabytes), but only running at half speed, or less; and maybe further levels that are again larger, but slower. Figure 13.1 illustrates such a pyramid of cache memory. The situation is further complicated on multi- and manycore CPUs, because cache memory can then be shared between the CPU cores of one processor, and, in addition, we may have non-uniform memory access (NUMA) to main memory. Figure 13.2 shows a schematic diagram of four quadcore CPUs, where in each CPU, all cores share a common level-3 cache, but have individual level-1 and level-2 caches. Each CPU has a separate memory controller (MC), which is connected to a part of main memory – access to non-local memory runs via an interconnect (IC) between the four cores. As a result, CPUs will have different access speeds to different parts of the main memory.
Fig. 13.1

A typical pyramid of hierarchical cache memory

Fig. 13.2

Four quadcore processors (with shared L3 cache) with a NUMA interconnect (IC)

Memory-Bound Performance and Cache Memory

The size and speed of the individual cache levels will have an interesting effect on the runtime of software. In the classical analysis of the runtime complexity of algorithms, a typical job is to determine how the runtime depends on the input size, which leads to the well-known \(\mathcal{O}(N)\) considerations. There, we often assume that all operations are executed with the same speed, which is no longer true on cache-based systems.

Instead, we often observe a situation as in Fig. 13.3. For very small problem size, all data resides in the first-level cache, and the algorithm runs at top speed. As soon as the data does no longer fit into level-1 cache, the speed is reduced to a level determined by the level-2 cache. The respective, step-like decrease of execution speed is repeated for every level of the memory hierarchy, and leads to the velocity profile given in Fig. 13.3. While the complexity of an algorithm will not change in the \(\mathcal{O}(N)\)-sense, such large differences in execution speed cannot be ignored in practice. Hence, we need to make implementations cache efficient, in order to fully exploit the available performance.
Fig. 13.3

Performance of an optimized vector operation (daxpy-operator: \(y := \alpha x + y\), \(\alpha \in \mathbb{R}\)) and of a naive implementation of matrix multiplication for increasing problem size. Note the substantial performance drops and the different performance levels, which result from data fitting into the cache levels

How Cache Memory Works: Cache Lines and Replacement Strategies

Once we use more data than we can fit into the cache, we need a mechanism to decide what data should be in the cache at what time. For such cache strategies, there are a couple of technical and practical restrictions:
  • For technical reasons, caches do not store individual bytes or words, but small contiguous chunks of memory, so-called cache lines, which are always transferred to and from memory as one block. Hence, cache lines are mapped to corresponding lines in main memory. To simplify (and, thus, speed up) this mapping, lines of memory can often be transferred only to a small subset of lines: we speak of an n-associative cache, if a memory line can be kept in n different cache lines. The most simple case, a 1-associative cache, is called a direct-mapped cache; a fully associative cache, where memory lines may be kept in any cache line, are powerful, but much more expensive to build (if the cache access needs to stay fast).

  • If we want to load a specific cache line from memory, but already have a full cache, we naturally have to remove one cache line from the cache. In an ideal case, we would only remove cache lines that are no longer accessed. As the cache hardware can only guess the future access pattern, certain heuristics are used to replace cache lines. Typical strategies are to remove the cache line that was not used for the longest time (least recently used), or which had the fewest accesses in the recent history (least frequently used).

  • A programmer typically has almost no influence on what data is kept in the cache. Only for loading data into the cache, so-called prefetching commands are sometimes available.

Associativity, replacement strategy, and other hardware properties may, of course, vary between the different levels of caches. And it is also clear that these restrictions have an influence of the efficiency of using caches.

Cache Memory and Locality

Caches lead to a speedup, if repeatedly accessed data is kept in the cache, and is thus accessed faster. Due to the typical cache design, algorithms and implementations will be cache efficient, if their data access pattern has good temporal or spatial locality properties:
  • Temporal locality means that a single piece of data will be repeatedly accessed during a short period of time. Replacement strategies such as least recently used or least frequently used will then reduce the probability of removing this data from the cache to a minimum.

  • Spatial locality means that after an access to a data item, the next access(es) will be to items that are stored in a neighbouring memory address. If it belongs to the same cache line as the previously accessed item, it has been loaded into the cache as well, and can be accessed efficiently.

Hence, the cache efficiency of an algorithm depends on its temporal and spatial locality properties.

Cache-Aware and Cache-Oblivious Algorithms

Cache efficient implementation or algorithms can be classified into two categories, depending on whether they consider the exact cache architecture of a platform:
  • Cache-aware algorithms or implementations use detailed information about the cache architecture. They try to increase the temporal or spatial locality by adapting the access pattern to exactly fit the number and size of the cache levels, the length of the cache line, etc. Hence, such a cache-aware implementation is specific for a particular platform, and at least certain parameters need to be tuned, if the architecture changes.

  • Cache-oblivious algorithms, in contrast, do not use explicit knowledge on a specific architecture. Instead, the algorithms are designed to be inherently cache efficient, and profit from any presence of caches, regardless of their size and number of cache levels. Hence, their data access pattern need to have excellent temporal and spatial locality properties.

We have seen that space-filling curves are an excellent tool to create data structures with good locality properties. Hence, we will discuss some examples of how these properties can be exploited to obtain cache-oblivious algorithms.

13.2 Cache Oblivious Matrix-Vector Multiplication

For a vector \(x = ({x}_{1},\ldots ,{x}_{n}) \in {\mathbb{R}}^{n}\) and an n ×n-matrix A with elements aij, the elements of the matrix-vector product y = Ax are given as
$${y}_{i} := \sum \limits _{j=1}^{n}{a}_{ ij}{x}_{j}.$$
The matrix-vector product is a standard task in linear algebra, and also not difficult to implement, which makes it a popular programming task in introductory lectures on programming. We can assume that most implementations will look similar to the one given in Algorithm 13.1 (omitting the initialisation of the vector y).

To determine the cache efficiency of this algorithm, let’s now examine the temporal and spatial locality properties of this implementation. Regarding the matrix A, we notice that each element is accessed exactly once. Hence, temporal locality of the access will not be an issue, as no element will be reused. The spatial locality depends on the memory layout of the matrix. If the elements are stored in the same sequence as they are traversed by the two nested loops, the spatial locality will be optimal. Hence, for Algorithm 13.1, the matrix elements should be stored row-by-row (so-called row-major layout). However, if our programming language uses column-major layout (as in FORTRAN, for example), we should change Algorithm 13.1: the spatial locality would then be optimal, if we exchange the i- and j-loop. In general, spatial locality is perfect as long as we traverse the matrix element in the same order as they are stored in memory.

Changing the traversal scheme of the matrix elements will, however, also change the access to the two vectors x and y. As both vectors are accessed or updated n times throughout the execution of the algorithm, both temporal and spatial locality are important. For the loop-based access given by Algorithm 13.1, the access to both vectors is spatially local, even if we exchange the two loops. However, the temporal locality is different for the two vectors. The access to vector y is optimal: here, all n accesses to a single element are executed, before we move on to the next element. For the access to x, we have exactly the opposite situation: Before an element of x is re-used, we first access all other elements of the vector. Hence, the temporal locality for this pattern is the worst we can find. A first interesting question is, whether exchanging the loops will improve the situation. Then, x and y will change roles, and which option is faster depends on whether temporal locality is more important for the read access to x or for the write access to y. Even more interesting is the question whether we can derive a traversal of the matrix elements that leads to more local access patterns to both vectors.

Matrix Traversals Using Space-Filling Curves

In Algorithm 13.1, we easily see that the two loops may be exchanged. Actually, there is no restriction at all concerning the order in which we execute the element operations. We should therefore denote Algorithm 13.1 in the following form:

From Algorithm 13.2, we can now consider execution orders without being influenced by loop constructions. We just need to make sure that each matrix element is processed exactly once, i.e. that we perform a traversal, and can then try to optimise the temporal and spatial locality properties. Our previous discussions of the locality properties of space-filling curves therefore provide us with an obvious candidate.

Assuming that the matrix dimension n is a power of two, we can use a Hilbert iteration, for example, to traverse the matrix. Hence, we modify the traversal algorithm of Chap. 3.2 to obtain an algorithm for matrix-vector multiplication. For that purpose, it is sufficient to interpret our direction operations up, down, left, and right as index operations on the matrix and the two vectors. Operators up and down will increase or decrease i, i.e., the current row index in A and vector index in y. Operators left and right will change j, which is the column index of A and the vector index of x, respectively. Algorithm 13.3 outlines the full algorithm for matrix-vector multiplication.

To retain the optimal locality of the access to the matrix elements, the matrix elements need to be stored according to the Hilbert order, as well. For the accesses to the vectors x and y, we obtain an overall improved temporal locality. During the matrix traversal, all elements of a \({2}^{k} \times {2}^{k}\)-block will be processed before the Hilbert order moves on to the next \({2}^{k} \times {2}^{k}\)-block. During the corresponding \({({2}^{k})}^{2}\) element operations, 2k elements of vector x and 2k elements of vector y will be accessed – each of them 2k times. On average, we will therefore execute m2 operations on a working set of only 2m elements – for any m ≤ n. Hence, even if only a small amount of elements will fit in a given cache, we can guarantee re-use of both elements of x and y. Our “naive” implementation in Algorithm 13.1 will guarantee this only for one of the two vectors.

13.3 Matrix Multiplication Using Peano Curves

For two \(n \times n\)-matrices A and B, the elements of the matrix product \(C = \mathit{AB}\) are given as
$${C}_{ik} := \sum \limits _{j=1}^{n}{A}_{ ij}{B}_{jk}.$$
Similar to Algorithm 13.1 for the matrix-vector product, we could implement the computation of all Cik via three nested loops – one over j to compute an individual element Cik, and two loops over i and k, respectively – compare Algorithm 13.4

Again, we can exchange the three loops arbitrarily. However, for large matrices, the cache efficiency will be less than optimal for any order. In library implementations, multiple optimisation methods, such as loop unrolling or blocking and tiling, are applied to improve the performance of matrix multiplication. The respective methods carefully change the execution order to match the cache levels – in particular the sizes of the individual caches (see the references section at the end of this chapter). In the following, we will discuss an approach based on Peano curves, which leads to a cache oblivious algorithm, instead.

As for matrix-vector multiplication, we start with Algorithm 13.5, where we stress that we simply have to execute n3 updates \({C}_{ik} = {C}_{ik} + {A}_{ij}{B}_{jk}\) for all triples (i, j, k). Thus, matrix multiplication is again an instance of a traversal problem.

The n3 element operations correspond to a 3D traversal of the triple space. For the access to the matrix elements, in contrast, only two out of three indices are needed for each of the involved matrices. Hence, the indices are obtained via respective projections. We will therefore use a Peano curve for the traversal, because the projections of the classical 3D Peano curve to 2D will again lead to 2D Peano curves. In Fig. 13.4, this property is illustrated for the vertical direction. It holds for the other two coordinate directions, as well. Figure 13.4 also tells us that if we execute the matrix operations \({C}_{ik} = {C}_{ik} + {A}_{ij}{B}_{jk}\) according to a 3D Peano order, the matrix elements will be accessed according to 2D Peano orders. As a consequence, we should store the elements in that order, if possible. We will first try this for the simple case of a 3 ×3 matrix multiplication:
$${ \left (\begin{array}{ccc} {c}_{0} & {c}_{5} & {c}_{6} \\ {c}_{1} & {c}_{4} & {c}_{7} \\ {c}_{2} & {c}_{3} & {c}_{8} \end{array} \right )}_{=: C}\,+=\,{\left (\begin{array}{ccc} {a}_{0} & {a}_{5} & {a}_{6} \\ {a}_{1} & {a}_{4} & {a}_{7} \\ {a}_{2} & {a}_{3} & {a}_{8} \end{array} \right )}_{=: A}{\left (\begin{array}{ccc} {b}_{0} & {b}_{5} & {b}_{6} \\ {b}_{1} & {b}_{4} & {b}_{7} \\ {b}_{2} & {b}_{3} & {b}_{8} \end{array} \right )}_{=: B} $$
(13.1)
(here, the element indices indicate the 2D Peano element order). Then, the 3D Peano traversal of the element operations will lead to the following sequence of element updates:
$$ \begin{array}{lllll} {c}_{0}\,+=\,{a}_{0}{b}_{0} & {c}_{5}\,+=\,{a}_{6}{b}_{3}\rightarrow &{c}_{5}\,+=\,{a}_{5}{b}_{4} & {c}_{6}\,+=\,{a}_{5}{b}_{7}\rightarrow &{c}_{6}\,+=\,{a}_{6}{b}_{8} \\ \qquad \downarrow &\qquad \uparrow &\qquad \downarrow &\qquad \uparrow &\qquad \downarrow \\ {c}_{1}\,+=\,{a}_{1}{b}_{0} & {c}_{4}\,+=\,{a}_{7}{b}_{3} & {c}_{4}\,+=\,{a}_{4}{b}_{4} & {c}_{7}\,+=\,{a}_{4}{b}_{7} & {c}_{7}\,+=\,{a}_{7}{b}_{8} \\ \qquad \downarrow &\qquad \uparrow &\qquad \downarrow &\qquad \uparrow &\qquad \downarrow \\ {c}_{2}\,+=\,{a}_{2}{b}_{0} & {c}_{3}\,+=\,{a}_{8}{b}_{3} & {c}_{3}\,+=\,{a}_{3}{b}_{4} & {c}_{8}\,+=\,{a}_{3}{b}_{7} & {c}_{8}\,+=\,{a}_{8}{b}_{8} \\ \qquad \downarrow &\qquad \uparrow &\qquad \downarrow &\qquad \uparrow & \\ {c}_{2}\,+=\,{a}_{3}{b}_{1} & {c}_{2}\,+=\,{a}_{8}{b}_{2} & {c}_{3}\,+=\,{a}_{2}{b}_{5} & {c}_{8}\,+=\,{a}_{2}{b}_{6} & \\ \qquad \downarrow &\qquad \uparrow &\qquad \downarrow &\qquad \uparrow & \\ {c}_{1}\,+=\,{a}_{4}{b}_{1} & {c}_{1}\,+=\,{a}_{7}{b}_{2} & {c}_{4}\,+=\,{a}_{1}{b}_{5} & {c}_{7}\,+=\,{a}_{1}{b}_{6} & \\ \qquad \downarrow &\qquad \uparrow &\qquad \downarrow &\qquad \uparrow & \\ {c}_{0}\,+=\,{a}_{5}{b}_{1}\rightarrow &{c}_{0}\,+=\,{a}_{6}{b}_{2} & {c}_{5}\,+=\,{a}_{0}{b}_{5}\rightarrow &{c}_{6}\,+=\,{a}_{0}{b}_{6} & \\ \end{array} $$
(13.2)
Fig. 13.4

Illustration of the projection property of the 3D Peano curve

Note that this scheme follows an inherently local access pattern to the elements: after each element operation, the next operation will re-use one element and access two elements that are direct neighbours of the elements accessed in the previous operation. To extend this simple 3 ×3-scheme to a multiplication algorithm for larger matrices, we need
  • A 2D Peano order that defines the data structure for the matrix elements;

  • A recursive extension of the 3 ×3-scheme in Eq. 13.2, which is basically obtained by using matrix blocks instead of elements;

  • A concept for matrices of arbitrary size, as the standard Peano order will only work for matrices of size \({3}^{k} \times {3}^{k}\).

The 2D Peano order for the elements is derived in a straightforward manner from the iterations of the Peano curve. The respective construction is illustrated in Fig. 13.5. The pattern symbols P, Q, R, and S now denote a numbering scheme for the corresponding subblock in the matrix.
Fig. 13.5

Recursive construction of the Peano element order to store matrices

13.3.1 Block-Recursive Peano Matrix Multiplication

Let’s now formulate the 3 ×3 multiplication scheme of Eq. (13.2) to a scheme for matrices in Peano order. In a first step, we write the matrices as 3 ×3 block matrices:
$$\left (\begin{array}{ccc} {P}_{A0} & {R}_{A5} & {P}_{A6} \\ {Q}_{A1} & {S}_{A4} & {Q}_{A7} \\ {P}_{A2} & {R}_{A3} & {P}_{A8} \end{array} \right )\left (\begin{array}{ccc} {P}_{B0} & {R}_{B5} & {P}_{B6} \\ {Q}_{B1} & {S}_{B4} & {Q}_{B7} \\ {P}_{B2} & {R}_{B3} & {P}_{B8} \end{array} \right ) = \left (\begin{array}{ccc} {P}_{C0} & {R}_{C5} & {P}_{C6} \\ {Q}_{C1} & {S}_{C4} & {Q}_{C7} \\ {P}_{C2} & {R}_{C3} & {P}_{C8} \end{array} \right ).$$
(13.3)
Here, we named each matrix block according to its numbering scheme (see Fig. 13.5), and indicated the name of the global matrix and the relative position of the block in the Peano order as indices. The element operations of Eq. 13.2 now lead to multiplication of matrix blocks, as in \({P}_{C0}\,+=\,{P}_{A0}{P}_{B0}\), \({Q}_{C1}\,+=\,{Q}_{A1}{P}_{B0}\), \({P}_{C2}\,+=\,{P}_{A2}{P}_{B0}\), etc.
If we just examine the numbering patterns of the matrix blocks, we see that there are exactly eight different types of block multiplications:
$$\begin{array}{lclclcl} P\,+=\,\mathit{PP} &\qquad \qquad &Q\,+=\,\mathit{QP}&\qquad \qquad &R\,+=\,\mathit{PR}&\qquad \qquad &S\,+=\,\mathit{QR} \\ P\,+=\,\mathit{RQ}&&Q\,+=\,\mathit{SQ} &&R\,+=\,\mathit{RS} &&S\,+=\,\mathit{SS}.\end{array}$$
(13.4)
For the block operation \(P\,+=\,\mathit{PP}\), we have already derived the necessary block operations and their optimal sequence of execution in Eq. (13.2). For the other seven types of block multiplications, it turns out that we obtain similar execution patterns, and that no further block operations arise besides those already given in Eq. (13.4). Hence, we obtain a closed system of eight recursive multiplication schemes.

13.3.2 Memory Access Patterns During the Peano Matrix Multiplication

Our next task is to derive execution sequences for the other seven multiplication schemes: \(Q\,+=\,\mathit{QP}\), \(R\,+=\,\mathit{PR}\), etc. We will leave the details for Exercise 13.2, and just state that each multiplication scheme leads to an execution order similar to that given in Eq. (13.2). In addition, all eight execution orders follow the same structure as the scheme for \(P\,+=\,\mathit{PP}\), except that for one, two, or even all three of the involved matrices, the access pattern runs backwards. Table 13.1 lists for the eight different schemes which of the access patterns run backwards.
Table 13.1

Execution orders for the eight different block multiplication schemes. ‘ + ’ indicates that the access pattern for the respective matrix A, B, or C is executed in forward direction (from element 0 to 8). ‘ − ’ indicates backward direction (starting with element 8)

Block scheme

\(P\,+=\,\mathit{PP}\)

\(P\,+=\,\mathit{RQ}\)

\(Q\,+=\,\mathit{QP}\)

\(Q\,+=\,\mathit{SQ}\)

\(R\,+=\,\mathit{PR}\)

\(R\,+=\,\mathit{RS}\)

\(S\,+=\,\mathit{QR}\)

\(S\,+=\,\mathit{SS}\)

 

A

 + 

 + 

 + 

 + 

 − 

 − 

 − 

 − n

 

Access to B

 + 

 + n

 − 

 − 

 + n

 + 

 − 

 − 

 

C

 + 

 − 

 + 

 − n

 + 

 − 

 + 

 − 

 
All eight schemes can be implemented by using only increments and decrements by 1 on the Peano-ordered element indices. Hence, jumps in memory are completely avoided throughout the computation of a 3 ×3 block. Our next step is therefore to make sure that jumps are also avoided between consecutive block schemes. As example, we examine the transfer between the first two block operations in a \(P\,+=\,\mathit{PP}\) scheme: we assume that we just finished the operation \({P}_{C0}\,+=\,{P}_{A0}{P}_{B0}\), and now would go on with \({Q}_{C1}\,+=\,{Q}_{A1}{P}_{B0}\).
  • On matrix A, we have traversed the P-ordered block 0, which means that our last access was to the last element in this block. The \(Q\,+=\,\mathit{QP}\) scheme runs in ‘ + ’ direction on A (see Table 13.1), such that the first access to the Q-ordered block 1 in A will be to the first element, which is of course a direct neighbour of the last element of block 0.

  • In matrix C, we have the identical situation: the P-ordered block 0 has been traversed in ‘ + ’ direction, and the next access will be to the first element in the Q-ordered block 1.

  • In matrix B, both execution sequences work on the P-ordered block 0. However, while the \(P\,+=\,\mathit{PP}\) scheme accesses B in ‘ + ’ direction, the \(Q\,+=\,\mathit{QP}\) scheme will run in ‘ − ’ direction on B. Hence, the last access of the \(P\,+=\,\mathit{PP}\) scheme is to the last element of the block, which will also be the start of the \(Q\,+=\,\mathit{QP}\) scheme.

Hence, the increment/decrement property stays valid between the two block operations. A careful analysis (which is also too tedious to be discussed here) reveals that this is true for all recursive block operations occurring in our multiplication scheme.

As a result, the Peano matrix multiplication can be implemented as illustrated in Algorithm 13.6. There, the schemes \(P\,+=\,\mathit{PP}\), \(Q\,+=\,\mathit{QP}\), etc., are coded via the ‘ + ’- or ‘ − ’-directions of the execution orders for the three matrices, as listed in Table 13.1. The change of direction throughout the recursive calls is coded in the parameters phsA, phsB, and phsC.

The increment/decrement property leads to an inherently local memory access pattern during the Peano matrix multiplication, which is illustrated in Fig. 13.6. There, we can observe certain locality properties of the memory access:
  • A given range of contiguous operations (highlighted in the left image) will access only a certain contiguous subset of matrix elements.

  • Vice versa, a contiguous subset of matrix elements will be accessed by a set of operations that consists of only a few contiguous sequences.

Fig. 13.6

Memory access pattern for all three matrices during a 9 ×9 Peano matrix multiplication. The highlighted areas are obtained by partitioning the operations (left image) or the accessed elements of C (right image)

Both, the operator subsets and the range of elements, define partitions that can be used to parallelise the Peano matrix multiplication (following a work-oriented or an owner-computes partitioning, respectively). However, the underlying locality properties also make the algorithm inherently cache efficient, which we will examine in the following section.

13.3.3 Locality Properties and Cache Efficiency

Figure 13.6 illustrates the spatial locality properties of the new matrix multiplication. From the respective chart, we can infer a so-called access locality function, LM(n), which we can define as the maximal possible distance (in memory) between two elements of a matrix M that are accessed within n contiguous operations.

For a loop-based implementation of matrix multiplication, we will typically access a range of k successive elements during k operations – which already requires some optimisations in order to avoid stride-n accesses. For the naive implementation given in Algorithm 13.4, we thus have \({L}_{M}(n) \geq n\) for all involved matrices. For an improved algorithm that operates on matrix blocks of size k ×k, we will typically obtain loops over k contiguous elements, so the worst case will reduce to \({L}_{M}(k) \geq k\) with k ≪ n. However, as long as we stay with a k ×k block, we will perform k3 operations on only k2 elements of a matrix. Hence, if our storage scheme for the matrices uses k ×k-blocks, as well, and stores such elements contiguously, we perform k3 operations on k2 contiguous elements. The best case that could be achieved for the access locality function LM should therefore be \({L}_{A}(k) \approx {k}^{2/3}\). Thus, even a blocked approach has the following limitations on LM:
  • We have only one block size, k0, that can lead to the optimal locality – this could be cured, if we change to a recursive blocking scheme (a first step towards space-filling curves).

  • The locality will still be LM(k) ≥ k, if we are within the smallest block.

  • Between two blocks, the locality will only be obtained, if two successively accessed blocks are stored contiguously in memory.

The recursive blocking and the last property (contiguity) is exactly what is achieved by the Peano multiplication scheme. We can therefore obtain \(L(k) \in \mathcal{O}({k}^{2/3})\) as an upper bound of the extent of the index range for any value of k.

While we have \(L(k) = {k}^{2/3}\), if we exactly hit the block boundaries, i.e., compute the distances for the first element operations of two consecutive block multiplications, we obtain a slightly worse ratio for arbitrary element operations. For a tight estimate, we need to determine the longest streak of not reusing matrix blocks in the Peano algorithm. For matrix B, which is always reused three times, the longest streak is two such block multiplications. For three block multiplications, either the first two or the last two work on the same B-block. In the scheme for matrix A, up to nine consecutive block multiplications can occur until a block is re-used. For matrix C, three contiguous block operations occur. During recursion, though, two such streaks can occur right after each other. For matrix A, we therefore obtain 18 as the length of the longest streak. In the worst case, we can therefore do 18n3 block operations on matrix A on 18n2 contiguous elements of A. Thus, for matrix A, we get that
$${L}_{A}(n) \approx \frac{18} {1{8}^{2/3}}{n}^{2/3} = \root{3}\of{18}{n}^{2/3}.$$
(13.5)
For matrix C, we get a maximum streak length of 6, and therefore
$${L}_{B}(n) \approx \root{3}\of{2}{n}^{2/3},\qquad {L}_{ C}(n) \approx \root{3}\of{6}{n}^{2/3}.$$
(13.6)
If we only consider \(\mathcal{O}({n}^{3})\)-algorithms for matrix multiplication, i.e., if we disregard Strassen’s algorithm and similar approaches, then a locality of \({L}_{B}(n) \in \mathcal{O}({n}^{2/3})\) is the optimum we can achieve. The locality functions \({L}_{A}(n) \leq 3{n}^{2/3}\), \({L}_{B}(n) \leq 2{n}^{2/3}\), and \({L}_{C}(n) \leq 2{n}^{2/3}\) are therefore asymptotically optimal. LB and LC, in addition, involve very low constants.

13.3.4 Cache Misses on an Ideal Cache

The access locality functions provide us with an estimate on how many operations will be performed on a given set of matrix elements. If a set of elements is given by a cache line or even the entire cache, we can estimate the number of cache hits and cache misses of the algorithm. To simplify that calculation, we use the model of an ideal cache, which obeys to the following assumptions:
  • The cache consists of M words that are organized as cache lines of L words each. Hence, we have M ∕ L cache lines. The external memory can be of arbitrary size and is structured into memory lines of L words.

  • The cache is fully associative, i.e., can load any memory line into any cache line.

  • If lines need to be evicted from the cache, we assume that the cache can “foresee the future”, and will always evict a line that is no longer needed, or accessed farthest in the future.

Assume that we are about to start a k ×k block multiplication, where k is the largest power of 3, such that three k ×k matrices fit into the cache. Hence, \(3 \cdot {k}^{2} < M\), but \(3 \cdot {(3k)}^{2} > M\), or
$$\frac{1} {3}\sqrt{\frac{M} {3}} < k < \sqrt{\frac{M} {3}}.$$
(13.7)
The cache will usually (except at start) be occupied by blocks required from previous multiplications; however, one of the three involved blocks will be reused and thus already be stored in the cache. To perform the following element operations, we will successively have to fetch all cache lines that contain the elements of the two new matrix blocks. The ideal cache strategy will ensure that the matrix block that is reused from the previous block multiplication will not be evicted from the cache. Instead, cache lines from the other two blocks will be replaced by the elements for the new block operation. We can also be sure that elements of these new blocks will not be evicted during the entire block operation. Hence, there will be only \(2\left \lceil \frac{{k}^{2}} {L} \right \rceil \) cache line transfers for these two k ×k blocks – to simplify the following computation, we assume that \(\left \lceil \frac{{k}^{2}} {L} \right \rceil = \frac{{k}^{2}} {L}\).
As we will perform (n ∕ k)3 such block operations, the total number of cache line transfers throughout an n ×n multiplication will be
$$T(n) ={ \left (\frac{n} {k}\right )}^{3} \cdot 2 \cdot \left (\frac{{k}^{2}} {L} \right ) = \frac{2n} {kL} \leq \left ( \frac{2n} {\frac{1} {3}\sqrt{\frac{M} {3}} \cdot L}\right ) = 6\sqrt{3} \frac{{n}^{3}} {L\sqrt{M}}.$$
(13.8)
For arbitrary size of the cache lines, we still have \(T(N) \in \mathcal{O}\left ( \frac{{n}^{3}} {L\sqrt{M}}\right )\), with a constant close to \(6\sqrt{3}\). Note that the respective calculation also works, if we have multiple levels of cache memory. Then, the estimate of cache line transfers refers to the respective size M of the different caches. For realistic caches, we might obtain more cache transfers because of a bad replacement strategy. However, a cache that always evicts the cache line that was used longest ago (“least recently used” strategy) will expel those cache lines first that contain matrix elements that are farthest away in terms of location in memory, because the access pattern of the multiplication will make sure that all matrix elements that are “closer” in memory have been accessed at a later time. What we cannot consider at this point is effects due to limited associativity of the cache, i.e., if a cache can place memory lines only into a specific set of cache lines.

13.3.5 Multiplying Matrices of Arbitrary Size

For practical implementations, it turns out that regular CPUs will not run at full performance, if we use a fully recursive implementation. For small matrix blocks, loop-based implementations are much more efficient. One of the reasons is that vector-computing extensions of CPUs can then be exploited. We should therefore stop the recursion on a certain block size k ×k, which is chosen to respect such hardware properties. To extend our algorithm for matrices of arbitrary size, we then have three options:
  1. 1.

    We can embed the given matrices into larger matrices of size \({3}^{p} \times {3}^{p}\) (or, with blocks as leaves: \({3}^{p}k \times {3}^{p}k\)). The additional zeros should, of course, not be stored. A respective approach is described in Sect. 13.4, where such a padding approach is given for band matrices or sparse matrices in general. In such an approach, we will typically stop the recursion on matrix blocks, as soon as these fit into the innermost cache of a CPU.

     
  2. 2.

    In Sect. 8.2.3, we introduced Peano iterations on 3D grids of size \(k \times l \times m\), where k, l, and m may be arbitrary odd numbers. The Peano matrix multiplication also works on such Peano iterations. For the leaf-block operations, we have to introduce schemes for matrix multiplication on \({n}_{1} \times {n}_{2}\), where n1 and n2 may be 3, 5, or 7, respectively. Actually, the scheme will work for leaf blocks of any odd size.

     
  3. 3.

    We can stick to the classical 3 ×3 recursion, if we stop the recursion on larger block sizes. If these are stored in regular row- or column-major order (compare the approach in Sect. 13.4), we just need to specify an upper limit for the size of these blocks, and can then use an implementation that is optimised for small (but arbitrary) matrix sizes.

     

What approach performs best will depend on the specific scenario. The first approach is interesting, for example, if we can choose the size of the leaf-level matrix block such that three such blocks fit exactly into the innermost cache. The constant factor for the number of cache line transfers, as estimated in Eq. (13.8) will then be further reduced. The third approach is especially interesting for coarse-grain parallel computations. There, the leaf-level blocks will be implemented by a call to a sequential library, where the size of the leaf blocks is of reduced influence. See the works indicated in the references section for more details.

13.4 Sparse Matrices and Space-Filling Curves

Sparse matrices are matrices that contain so many zero elements that it becomes worthwhile to change to different data structures and algorithms to store and process them.2 Considering our previous experience with space-filling-curve data structures for matrices and arrays, but also for adaptively refined grids, we can try to use a quadtree-type structure to store matrices that contain a lot of zeros. As illustrated in Fig. 13.7, we recursively subdivide a matrix into smaller and smaller blocks. The substructuring scheme can be a quadtree scheme, but due to our previously introduced Peano algorithm, we again use a 3 ×3 refinement, i.e., a 32-spacetree. Once a matrix block consists entirely of zero elements, we can stop refinement, mark the respective tree with a zero-block leaf, and thus not store the individual elements. For blocks that contain elements, we stop the recursion on blocks of a certain minimal sizes. On such blocks, we either store a dense matrix (could be even in row- or column-major order) or a small sparse matrix (using a respective simple storage scheme). The tree structure for such a storage scheme is also illustrated in Fig. 13.7.
Fig. 13.7

Illustration of a spacetree storage scheme for sparse matrices. The tree representation of the sparsity pattern is sequentialised according to modified depth-first scheme, where information on the child tree is stored within a parent node

The matrix blocks, either sparse blocks or dense blocks, are stored in the sequential order given by the Peano curve. However, in contrast to the Peano scheme for dense blocks, the matrix blocks will now have varying size – because of the different sparsity patterns of the individual blocks, but also because the zero blocks are not stored at all. Hence, to access a specified block, it will usually be necessary to traverse the sparsity tree from its root. To issue recursive calls on one or several child matrix blocks, a parent node needs information on the exact position of all child blocks in the data structure. Thus, in a node, we will store the start address of all nine child blocks of the matrix. Zero blocks are automatically considered by this scheme by just storing two consecutive identical start addresses. As we typically need both start and end addresses for the children, we also need to store the end address of the last block. Hence, for every parent node, we will store a sequence of ten integers, as indicated by the data stream in Fig. 13.7. We thus obtain a data structure that follows the same idea as the modified depth-first traversal that was introduced for the refinement trees of adaptive grids, in Sect. 10.5.

The sparsity structure information and the matrix elements of the leaf blocks can either be stored in a single data stream or in two separate streams. If we choose to store the elements together with the structure, we will only need to change the data structure for the leaf blocks. These blocks will then require information on the block size and type (dense or sparse), as well as on the extent of the block in terms of bytes in memory. If matrix elements are stored in a separate stream, the pre-leaf-level nodes need to store respective pointers to the start of the respective blocks in the elements stream.

As already indicated in Sect. 13.3.5, we can also use the presented sparse-matrix data structure for matrices that are only dense (or comparably dense) in certain parts of the matrix – one simple example would be band matrices. In that case, we can further simplify the data structure by allowing only dense matrix blocks in the leaves.

13.5 References and Further Readings

The Peano-curve algorithm for matrix multiplication was introduced in [24], where we particularly discussed the cache properties of the algorithm. The locality properties, as given in Eqs. 13.5 and 13.8 are asymptotically optimal – respective lower bounds were proven by Hong and Kung [132]. Hardware-oriented, efficient implementations of the algorithms, including the parallelisation on shared-memory multicore platforms, were presented in [21, 127]. In [19], we discussed the parallelisation of the algorithm using message passing on distributed-memory. There, the result for cache line transfers in the ideal-cache model can be used to estimate the number of transferred matrix blocks in a distributed-memory setting. The extension for sparse matrices was presented in [22], which also includes a discussion of LU-decomposition based on the Peano matrix multiplication.

Since the advent of cache-based architectures, improving the cache efficiency of linear algebra operations has been an active area of research. Blocking approaches to improve the cache-efficiency were already applied in the first level-3 BLAS routines, when introduced by Dongarra et al. [78]. Blocking approaches and the block-oriented matrix operations were also a driving factor in the development of LAPACK [13], whose implementation was consequently based on exploiting the higher performance of BLAS 3 routines. Since then, blocking and tiling of matrix operations has become a standard technique for high performance libraries. The ATLAS project [267] uses automatic tuning of blocking sizes to the available memory hierarchy, and GotoBLAS [104] explicitly considers the translation lookaside buffer (TLB) and even virtual memory as further cache levels [103].

Block matrix layouts to improve cache efficiency were introduced by Chatterjee et al. [65], who studied Morton order and 4D tiled arrays (i.e., 2D arrays of matrix blocks), and by Gustavson [116], who demonstrated that recursive blocking automatically leads to cache-efficient algorithms. Frens and Wise [90] used quadtree decompositions of dense matrices and respective recursive implementations of matrix multiplication and of QR decomposition [91]. The term cache-oblivious for such inherently cache-efficient algorithms has been introduced by Frigo et al. [92]. For a review of both cache-oblivious and cache-aware algorithms in linear algebra, see Elmroth et al. [81]. For further, recent work, see [106, 255], e.g. Yotov et al. [112, 281] compared the performance of cache-oblivious algorithms with carefully tuned cache-aware approaches, and identified efficient prefetching of matrix blocks as a crucial question for recursive algorithms. For the Peano algorithm, the increment/decrement access to blocks apparently solves this problem. In any case, block-oriented data structures and algorithms are more and more considered a necessity in the design of linear algebra routines. The designated LAPACK-successor PLASMA [56, 57], for example, aims for a stronger block orientation, and similar considerations drive the FLAME project [113]. As libraries will often depend on row-major or column-major storage, changing the matrix storage to blocked layouts on-the-fly becomes necessary; respective in-place format conversions were studied in [115].

For sparse matrices, quadtree data structures were, for example, examined by Wise et al. [271, 272]. Hybrid “hypermatrices” that mix dense and sparse blocks in recursively structured matrices were discussed by Herrero et al. [130]. Haase [117] used a Hilbert-order data structure for sparse-matrix-vector multiplication to improve cache efficiency.

Chapter 14 will present a cache-oblivious approach to solve partial differential equations on adaptive discretisation grids – hence, we save all further references to work on cache oblivious algorithms, be it grid-based or other simulation approaches, for the references section of Chap. 14.

13.6 Exercises

Try to determine the number of cache misses caused by Algorithm 13.3, following the same cache model as in Sect. 13.3.4. Focus, in particular, on the misses caused by the access to the vectors x and y.

Use the graph-based illustration of Eq. (13.2) to derive the execution orders for the matrix multiplication schemes \(Q\,+=\,\mathit{QP}\), \(R\,+=\,\mathit{PR}\), \(S\,+=\,\mathit{QR}\), etc.

Consider an algorithm that uses 2D Morton order to store the matrix elements, and 3D Morton order to execute the individual element operations of matrix multiplication. Analyse the number of cache misses causes by this algorithm, using a cache model and computation as in Sect. 13.3.4.

As a data structure for sparse matrices, we could also refine the sparsity tree up to single elements, i.e., not stop the recursion on larger blocks. Give an estimate on how much memory we will require to store a sparse matrix (make suitable assumptions on the sparsity pattern or sparsity tree, if necessary).

Footnotes

  1. 1.

    Currently, here, means in the year 2010.

  2. 2.

    Following a definition given by Wilkinson.

References

  1. 1.
    D. J. Abel and D. M. Mark. A comparative analysis of some two-dimensional orderings. International Journal of Geographical Information Systems, 4(1):21–31, 1990.Google Scholar
  2. 2.
    D. J. Abel and J. L. Smith. A data structure and algorithm based on a linear key for a rectangle retrieval problem. Computer Vision, Graphics, and Image Processing, 24:1–13, 1983.Google Scholar
  3. 3.
    K. Abend, T. J. Harley, and L. N. Kanal. Classification of binary random patterns. IEEE Transactions on Information Theory, IT-11(4):538–544, 1965.Google Scholar
  4. 4.
    M. Aftosmis, M. Berger, and J. Melton. Robust and efficient Cartesian mesh generation for component-based geometry. In 35 th Aerospace Sciences Meeting and Exhibit, 1997. AIAA 1997-0196.Google Scholar
  5. 5.
    M. J. Aftosmis, M. J. Berger, and S. M. Murman. Applications of space-filling curves to Cartesian methods for CFD. AIAA Paper 2004-1232, 2004.Google Scholar
  6. 6.
    J. Akiyama, H. Fukuda, H. Ito, and G. Nakamura. Infinite series of generalized Gosper space filling curves. In CJCDGCGT 2005, volume 4381 of Lecture Notes in Computer Science, pages 1–9, 2007.Google Scholar
  7. 7.
    I. Al-Furaih and S. Ranka. Memory hierarchy management for iterative graph structures. In Parallel Processing Symposium, IPPS/SPDP 1998, pages 298–302. IEEE Computer Society, 1998.Google Scholar
  8. 8.
    F. Alauzet and A. Loseille. On the use of space filling curves for parallel anisotropic mesh adaptation. In Proceedings of the 18th International Meshing Roundtable, pages 337–357. Springer, 2009.Google Scholar
  9. 9.
    J. Alber and R. Niedermeier. On multidimensional curves with Hilbert property. Theory of Computing Systems, 33:295–312, 2000.Google Scholar
  10. 10.
    S. Aluru and F. E. Sevilgen. Parallel domain decomposition and load balancing using space-filling curves. In Proceedings of the Fourth International Conference on High-Performance Computing, pages 230–235. IEEE Computer Society, 1997.Google Scholar
  11. 11.
    N. Amenta, S. Choi, and G. Rote. Incremental constructions con BRIO. In Proceedings of the Nineteenth ACM Symposium on Computational Geometry, pages 211–219, 2003.Google Scholar
  12. 12.
    M. Amor, F. Arguello, J. López, O. Plata, and E. L. Zapata. A data-parallel formulation for divide and conquer algorithms. The Computer Journal, 44(4):303–320, 2001.Google Scholar
  13. 13.
    E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: a portable linear algebra library for high-performance computers. Technical Report CS-90-105, LAPACK Working Note #20, University of Tennessee, Knoxville, TN, 1990.Google Scholar
  14. 14.
    J. A. Anderson, C. D. Lorenz, and A. Travesset. General purpose molecular dynamics simulations fully implemented on graphics processing units. Journal of Computational Physics, 227(10):5342–5359, 2008.Google Scholar
  15. 15.
    A. Ansari and A. Fineberg. Image data ordering and compression using Peano scan and LOT. IEEE Transactions on Consumer Electronics, 38(3):436–445, 1992.Google Scholar
  16. 16.
    L. Arge, M. De Berg, H. Haverkort, and K. Yi. The priority R-tree: A practically efficient and worst-case optimal R-tree. ACM Transactions on Algorithms, 4(1):9:1–9:29, 2008.Google Scholar
  17. 17.
    D. N. Arnold, A. Mukherjee, and L. Pouly. Locally adapted tetrahedral meshes using bisection. SIAM Journal on Scientific Computing, 22(2):431–448, 2000.Google Scholar
  18. 18.
    T. Asano, D. Ranjan, T. Roos, E. Welzl, and P. Widmayer. Space-filling curves and their use in the design of geometric data structures. Theoretical Computer Science, 181(1):3–15, 1997.Google Scholar
  19. 19.
    M. Bader. Exploiting the locality properties of Peano curves for parallel matrix multiplication. In Proceedings of the Euro-Par 2008, volume 5168 of Lecture Notes in Computer Science, pages 801–810, 2008.Google Scholar
  20. 20.
    M. Bader, C. Böck, J. Schwaiger, and C. Vigh. Dynamically adaptive simulations with minimal memory requirement – solving the shallow water equations using Sierpinksi curves. SIAM Journal on Scientific Computing, 32(1):212–228, 2010.Google Scholar
  21. 21.
    M. Bader, R. Franz, S. Guenther, and A. Heinecke. Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves. In Parallel Processing and Applied Mathematics, 7th International Conference, PPAM 2007, volume 4967 of Lecture Notes in Computer Science, pages 628–638, 2008.Google Scholar
  22. 22.
    M. Bader and A. Heinecke. Cache oblivious dense and sparse matrix multiplication based on Peano curves. In Proceedings of the PARA ’08, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing, volume 6126/6127 of Lecture Notes in Computer Science, 2010. In print.Google Scholar
  23. 23.
    M. Bader, S. Schraufstetter, C. A Vigh, and J. Behrens. Memory efficient adaptive mesh generation and implementation of multigrid algorithms using Sierpinski curves. International Journal of Computational Science and Engineering, 4(1):12–21, 2008.Google Scholar
  24. 24.
    M. Bader and C. Zenger. Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra and Its Applications, 417(2–3):301–313, 2006.Google Scholar
  25. 25.
    M. Bader and C. Zenger. Efficient storage and processing of adaptive triangular grids using Sierpinski curves. In Computational Science – ICCS 2006, volume 3991 of Lecture Notes in Computer Science, pages 673–680, 2006.Google Scholar
  26. 26.
    Y. Bandou and S.-I. Kamata. An address generator for an n-dimensional pseudo-Hilbert scan in a hyper-rectangular parallelepiped region. In International Conference on Image Processing, ICIP 2000, pages 707–714, 2000.Google Scholar
  27. 27.
    R. Bar-Yehuda and C. Gotsman. Time/space tradeoffs for polygon mesh rendering. ACM Transactions on Graphics, 15(2):141–152, 1996.Google Scholar
  28. 28.
    J. Barnes and P. Hut. A hierarchical O(NlogN) force-calculation algorithm. Nature, 324:446–449, 1986.Google Scholar
  29. 29.
    J.J. Bartholdi III and P. Goldsman. Vertex-labeling algorithms for the Hilbert spacefilling curve. Software: Practice and Experience, 31(5):395–408, 2001.Google Scholar
  30. 30.
    J. J. Bartholdi III and P. Goldsman. Multiresolution indexing of triangulated irregular networks. IEEE Transactions on Visualization and Computer Graphics, 10(3):1–12, 2004.Google Scholar
  31. 31.
    J. J. Bartholdi III and L. K. Platzman. An O(NlogN) planar travelling salesman heuristic based on spacefilling curves. Operations Research Letters, 1(4):121–125, 1982.Google Scholar
  32. 32.
    J. J. Bartholdi III and L. K. Platzman. Heuristics based on spacefilling curves for combinatorial problems in the Euclidian space. Management Science, 34(3):291–305, 1988.Google Scholar
  33. 33.
    A. C. Bauer and A. K. Patra. Robust and efficient domain decomposition preconditioners for adaptive hp finite element approximations of linear elasticity with and without discontinuous coefficients. International Journal for Numerical Methods in Engineering, 59:337–364, 2004.Google Scholar
  34. 34.
    K.E. Bauman. The dilation factor of the Peano–Hilbert curve. Mathematical Notes, 80(5):609–620, 2006. Translated from Matematicheskie Zametki, vol. 80, no. 5, 2006, pp. 643–656. Original Russian Text Copyright 2006 by K. E. Bauman.Google Scholar
  35. 35.
    R. Bayer. The universal B-tree for multidimensional indexing. Technical Report TUM-I9637, Institut für Informatik, Technische Universität München, 1996.Google Scholar
  36. 36.
    J. Behrens. Multilevel optimization by space-filling curves in adaptive atmospheric modeling. In F. Hülsemann, M. Kowarschik, and U. Rüde, editors, Frontiers in Simulation - 18th Symposium on Simulation Techniques, pages 186–196. SCS Publishing House, Erlangen, 2005.Google Scholar
  37. 37.
    J. Behrens. Adaptive atmospheric modeling: key techniques in grid generation, data structures, and numerical operations with applications, volume 54 of Lecture Notes in Computational Science and Engineering. Springer, 2006.Google Scholar
  38. 38.
    J. Behrens and M. Bader. Efficiency considerations in triangular adaptive mesh refinement. Philosophical Transactions of the Royal Society A, 367:4577–4589, 2009. Theme Issue ’Mesh generation and mesh adaptation for large-scale Earth-system modelling’.Google Scholar
  39. 39.
    J. Behrens, N. Rakowsky, W. Hiller, D. Handorf, M. Läuter, J. Päpke, and K. Dethloff. Parallel adaptive mesh generator for atmospheric and oceanic simulation. Ocean Modelling, 10:171–183, 2005.Google Scholar
  40. 40.
    J. Behrens and J. Zimmermann. Parallelizing an unstructured grid generator with a space-filling curve approach. In Euro-Par 2000 Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 815–823, 2000.Google Scholar
  41. 41.
    D. Bertsimas and M. Grigni. Worst-case example for the spacefilling curve heuristics for the Euclidian traveling salesman problem. Operations Research Letters, 8:241–244, 1989.Google Scholar
  42. 42.
    T. Bially. Space-filling curves: Their generation and their application to bandwidth reduction. IEEE Transactions on Information Theory, IT-15(6):658–664, 1969.Google Scholar
  43. 43.
    E. Bänsch. Local mesh refinement in 2 and 3 dimensions. IMPACT of Computing in Science and Engineering, 3(3):181–191, 1991.Google Scholar
  44. 44.
    A. Bogomjakov and C. Gotsman. Universal rendering sequences for transparent vertex caching of progressive meshes. Computer Graphics forum, 21(2):137–148, 2002.Google Scholar
  45. 45.
    E. Borel. Elements de la Theorie des Ensembles. Editions Albin Michel, Paris, 1949. Note IV: La courbe de Péano.Google Scholar
  46. 46.
    V. Brázdová and D. R. Bowler. Automatic data distribution and load balancing with space-filling curves: implementation in CONQUEST. Journal of Physics: Condensed Matter, 20, 2008.Google Scholar
  47. 47.
    G. Breinholt and Ch. Schierz. Algorithm 781: Generating Hilbert’s space-filling curve by recursion. ACM Transactions on Mathematical Software, 24(2):184–189, 1998.Google Scholar
  48. 48.
    M. Brenk, H.-J. Bungartz, M. Mehl, I. L. Muntean, T. Neckel, and T. Weinzierl. Numerical simulation of particle transport in a drift ratchet. SIAM Journal on Scientific Computing, 30(6):2777–2798, 2008.Google Scholar
  49. 49.
    K. Brix, S. Sorana Melian, S. Müller, and G. Schieffer. Parallelisation of multiscale-based grid adaption using space-filling curves. ESAIM: Proceedings, 29:108–129, 2009.Google Scholar
  50. 50.
    K. Buchin. Constructing Delauney triangulations along space-filling curves. In ESA 2009, volume 5757, pages 119–130, 2009.Google Scholar
  51. 51.
    E. Bugnion, T. Roos, R. Wattenhofer, and P. Widmayer. Space filling curves versus random walks. In Algorithmic Foundations of Geographic Information Systems, volume 1340, pages 199–211, 1997.Google Scholar
  52. 52.
    H.-J. Bungartz, M. Mehl, T. Neckel, and T. Weinzierl. The PDE framework Peano applied to fluid dynamics: an efficient implementation of a parallel multiscale fluid dynamics solver on octree-like adaptive Cartesian grids. Computational Mechanics, 46(1):103–114, 2010.Google Scholar
  53. 53.
    H.-J. Bungartz, M. Mehl, and T. Weinzierl. A parallel adaptive Cartesian PDE solver using space-filling curves. In E.W. Nagel, V.W. Walter, and W. Lehner, editors, Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, volume 4128 of Lecture Notes in Computer Science, pages 1064–1074, 2006.Google Scholar
  54. 54.
    C. Burstedde, O. Ghattas, M. Gurnis, G. Stadler, E. Tan, T. Tu, L. C. Wilcox, and S. Zhong. Scalable adaptive mantle convection simulation on petascale supercomputers. In SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pages 1–15. IEEE Press, 2008.Google Scholar
  55. 55.
    C. Burstedde, L. C. Wilcox, and O. Ghattas. p4est: Scalable algorithms for parallel adaptive mesh refinement on forests of octrees. SIAM Journal on Scientific Computing, 33(3):1103–1133, 2011.Google Scholar
  56. 56.
    A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov. The impact of multicore on math software. In Applied Parallel Computing, State of the Art in Scientific Computing, volume 4699 of Lecture Notes in Computer Science, pages 1–10, 2007.Google Scholar
  57. 57.
    A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS-07-600, LAPACK Working Note #191, ICL, University Tennessee, 2007.Google Scholar
  58. 58.
    A. R. Butz. Convergence with Hilbert’s space filling curve. Journal of Computer and System Sciences, 3:128–146, 1969.Google Scholar
  59. 59.
    A. R. Butz. Alternative algorithm for Hilbert’s space-filling curve. IEEE Transactions on Computers, C-20(4):424–426, 1971.Google Scholar
  60. 60.
    A. C. Calder, B. C. Curtis, L. J. Dursi, B. Fryxell, P. MacNeice, K. Olson, P. Ricker, R. Rosner, F. X. Timmes, H. M. Tufo, J. W. Turan, M. Zingale, and G. Henry. High performance reactive fluid flow simulations using adaptive mesh refinement on thousands of processors. In Supercomputing ’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, page 56. IEEE Computer Society, 2000.Google Scholar
  61. 61.
    P. M. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, and J. D. Terescoy. Dynamic octree load balancing using space-filling curves. Technical Report CS-03-01, Williams College Department of Computer Science, 2003.Google Scholar
  62. 62.
    X. Cao and Z. Mo. A new scalable parallel method for molecular dynamics based on cell-block data structure. In Parallel and Distributed Processing and Applications, 3358, pages 757–764, 2005.Google Scholar
  63. 63.
    J. Castro, M. Georgiopoulos, R. Demara, and A. Gonzalez. Data-partitioning using the hilbert space filling curves: Effect on the speed of convergence of Fuzzy ARTMAP for large database problems. Neural Networks, 18:967–984, 2005.Google Scholar
  64. 64.
    E. Cesaro. Remarques sur la courbe de von Koch. Atti della R. Accad. della Scienze fisiche e matem. Napoli, 12(15):1–12, 1905. Reprinted in: Opere scelte, a cura dell’Unione matematica italiana e col contributo del Consiglio nazionale delle ricerche, Vol. 2: Geometria, analisi, fisica matematica. Rome: Edizioni Cremonese, pp. 464-479, 1964.Google Scholar
  65. 65.
    S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In International Conference on Supercomputing (ICS’99). ACM, New York, 1999.Google Scholar
  66. 66.
    H.-L. Chen and Y.-I. Chang. Neighbor-finding based on space-filling curves. Information Systems, 30(3):205–226, 2005.Google Scholar
  67. 67.
    G. Chochia, M. Cole, and T. Heywood. Implementing the hierarchical PRAM on the 2d mesh: analyses and experiments. IEEE Symposium on Parallel and Distributed Processing, 0:587–595, 1995. Preprint on http://homepages.inf.ed.ac.uk/mic/Pubs/ECS-CSG-10-95.ps.gz.
  68. 68.
    W. J. Coirier and K. G. Powell. Solution-adaptive Cartesian cell approach for viscous and inviscid fluid flows. AIAA Journal, 34(5):938–945, 1996.Google Scholar
  69. 69.
    A. J. Cole. A note on space-filling curves. Software: Practice and Experience, 13:1181–1189, 1983.Google Scholar
  70. 70.
    S. Craver, B.-L. Yeo, and M. Yeung. Multilinearization data structure for image browsing. In Storage and Retrieval for Image and Video Databases VII, volume 3656, pages 155–166, 1998.Google Scholar
  71. 71.
    R. Dafner, D. Cohen-Or, and Y. Matias. Context-based space filling curves. Computer Graphics Forum, 19(3):209–218, 2000.Google Scholar
  72. 72.
    K. Daubner. Geometrische Modellierung mittels Oktalbäumen und Visualisierung von Simulationsdaten aus der Strömungsmechanik. Institut für Informatik, Technische Universität München, 2005.Google Scholar
  73. 73.
    L. De Floriani and E. Puppo. Hierarchical triangulation for multiresolution surface description. ACM Transactions on Graphics, 14(4):363–411, 1995.Google Scholar
  74. 74.
    J. M. Dennis. Partitioning with space-filling curves on the cubed-sphere. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, pages 269–275. IEEE Computer Society, 2003.Google Scholar
  75. 75.
    J. M. Dennis. Inverse space-filling curve partitioning of a global ocean model. In Parallel and Distributed Processing Symposium, IPDPS 2007, pages 1–10. IEEE International, 2007.Google Scholar
  76. 76.
    K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Terescoy, J. Falk, J. E. Flaherty, and L. G. Gervasio. New challenges in dynamic load balancing. Applied Numerical Mathematics, 52:133–152, 2005.Google Scholar
  77. 77.
    R. Diekmann, J. Hungershöfer, M. Lux, M. Taenzer, and J.-M. Wierum. Using space filling curves for efficient contact searching. In 16th IMACS World Congress, 2000.Google Scholar
  78. 78.
    J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1):1–28, 1990.Google Scholar
  79. 79.
    J. Dreher and R. Grauer. Racoon: A parallel mesh-adaptive framework for hyperbolic conservation laws. Parallel Computing, 31(8–9):913–932, 2005.Google Scholar
  80. 80.
    M. Duchaineau, M. Wolinsky, D. E. Sigeti, M. C. Miller, C. Aldrich, and M. B. Mineev-Weinstein. ROAMing terrain: real-time optimally adapting meshes. In VIS ’97: Proceedings of the 8th Conference on Visualization ’97, pages 81–88. IEEE Computer Society Press, 1997.Google Scholar
  81. 81.
    E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review, 46(1):3–45, 2004.Google Scholar
  82. 82.
    M. Elshafei and M. S. Ahmed. Fuzzification using space-filling curves. Intelligent Automation and Soft Computing, 7(2):145–157, 2001.Google Scholar
  83. 83.
    M. Elshafei-Ahmed. Fast methods for split codebooks. Signal Processing, 80:2553–2565, 2000.Google Scholar
  84. 84.
    W. Evans, D. Kirkpatrick, and G. Townsend. Right-triangulated irregular networks. Algorithmica, 30(2):264–286, 2001.Google Scholar
  85. 85.
    C. Faloutsos. Analytical results on the quadtree decomposition of arbitrary rectangles. Pattern Recognition Letters, 13:31–40, 1992.Google Scholar
  86. 86.
    C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In Proceedings of the Eighth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 247–252, 1989.Google Scholar
  87. 87.
    R. Finkel and Bentley J. L. Quad trees: a data structure for retrieval on composite keys. Acta Informatica, 4(1):1–9, 1974.Google Scholar
  88. 88.
    J. E. Flaherty, R. M. Loy, M. S. Shephard, B. K. Szymanski, J. D. Terescoy, and L. H. Ziantz. Adaptive local refinement with octree load balancing for the parallel solution of three-dimensional conservation laws. Journal of Parallel and Distributed Computing, 47:139–152, 1997.Google Scholar
  89. 89.
    A. C. Frank. Organisationsprinzipien zur Integration von geometrischer Modellierung, numerischer Simulation und Visualisierung. Herbert Utz Verlag, Dissertation, Institut für Informatik, Technische Universität München, 2000.Google Scholar
  90. 90.
    J. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 206–216, 1997.Google Scholar
  91. 91.
    J. Frens and D. S. Wise. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism. In Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 144–154, 2003.Google Scholar
  92. 92.
    M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, pages 285–297, 1999.Google Scholar
  93. 93.
    H. Fukuda, M. Shimizu, and G. Nakamura. New Gosper space filling curves. In Proceedings of the International Conference on Computer Graphics and Imaging (CGIM2001), pages 34–38, 2001.Google Scholar
  94. 94.
    J. Gao and J. M. Steele. General spacefilling curve heuristics and limit theory for the traveling salesman problem. Journal of Complexity, 10:230–245, 1994.Google Scholar
  95. 95.
    M. Gardner. Mathematical games – in which “monster” curves force redefinition of the word “curve”. Scientific American, 235:124–133, Dec. 1976.Google Scholar
  96. 96.
    I. Gargantini. An effective way to represent quadtrees. Communications of the ACM, 25(12):905–910, 1982.Google Scholar
  97. 97.
    T. Gerstner. Multiresolution Compression and Visualization of Global Topographic Data. GeoInformatica, 7(1):7–32, 2003. (shortened version in Proc. Spatial Data Handling 2000, P. Forer, A.G.O. Yeh, J. He (eds.), pp. 14–27, IGU/GISc, 2000, also as SFB 256 report 29, Univ. Bonn, 1999).Google Scholar
  98. 98.
    P. Gibbon, W. Frings, S. Dominiczak, and B. Mohr. Performance analysis and visualization of the n-body tree code PEPC on massively parallel computers. In Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the International Conference ParCo 2005, volume 33 of NIC Series, pages 367–374, 2006.Google Scholar
  99. 99.
    W. Gilbert. A cube-filling Hilbert curve. The Mathematical Intelligencer, 6(3):78–79, 1984.Google Scholar
  100. 100.
    J. Gips. Shape Grammars and their Uses. Interdisciplinary Systems Research. Birkhäuser Verlag, 1975.Google Scholar
  101. 101.
    L. M. Goldschlager. Short algorithms for space-filling curves. Software: Practice and Experience, 11:99–100, 1981.Google Scholar
  102. 102.
    M. F. Goodchild and D. M. Mark. The fractal nature of geographic phenomena. Annals of the Association of American Geographers, 77(2):265–278, 1987.Google Scholar
  103. 103.
    K. Goto and R. van de Geijn. On reducing TLB misses in matrix multiplication. FLAME working note #9. Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences, 2002.Google Scholar
  104. 104.
    K. Goto and R. A. van de Geijn. Anatomy of a high-performance matrix multiplication. ACM Transactions on Mathematical Software, 34(3):12:1–12:25, 2008.Google Scholar
  105. 105.
    C. Gotsman and M. Lindenbaum. On the metric properties of space-filling curves. IEEE Transactions on Image Processing, 5(5):794–797, 1996.Google Scholar
  106. 106.
    P. Gottschling, D. S. Wise, and A. Joshi. Generic support of algorithmic and structural recursion for scientific computing. International Journal of Parallel, Emergent and Distributed Systems, 24(6):479–503, 2009.Google Scholar
  107. 107.
    M. Griebel and M. A. Schweitzer. A particle-partition of unity method—part IV: Parallelization. In Meshfree Methods for Partial Differential Equations, volume 26 of Lecture Notes in Computational Science and Engineering, pages 161–192, 2002.Google Scholar
  108. 108.
    M. Griebel and G. Zumbusch. Parallel multigrid in an adaptive PDE solver based on hashing and space-filling curves. Parallel Computing, 27(7):827–843, 1999.Google Scholar
  109. 109.
    M. Griebel and G. Zumbusch. Hash based adaptive parallel multilevel methods with space-filling curves. In H. Rollnik and D. Wolf, editors, NIC Symposium 2001, volume 9 of NIC Series, pages 479–492. Forschungszentrum Jülich, 2002.Google Scholar
  110. 110.
    J. G. Griffiths. Table-driven algorithm for generating space-filling curves. Computer Aided Design, 17(1):37–41, 1985.Google Scholar
  111. 111.
    J. G. Griffiths. An algorithm for displaying a class of space-filling curves. Software: Practice and Experience, 16(5):403–411, 1986.Google Scholar
  112. 112.
    J. Gunnels, F. Gustavson, K. Pingali, and K. Yotov. Is cache-oblivious DGEMM viable? In Applied Parallel Computing. State of the Art in Scientific Computing, volume 4699 of Lecture Notes in Computer Science, pages 919–928, 2007.Google Scholar
  113. 113.
    J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn. FLAME: formal linear algebra methods environment. ACM Transactions on Mathematical Software, 27(4):422–455, 2001.Google Scholar
  114. 114.
    F. Günther, M. Mehl, M. Pögl, and C. Zenger. A cache-aware algorithm for PDEs on hierarchical data structures based on space-filling curves. SIAM Journal on Scientific Computing, 28(5):1634–1650, 2006.Google Scholar
  115. 115.
    F. Gustavson, L. Karlsson, and B. Kågström. Parallel and cache-efficient in-place matrix storage format conversion. Transactions on Mathematical Software, 38(3):17:1–17:32, 2012.Google Scholar
  116. 116.
    F. G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6), 1999.Google Scholar
  117. 117.
    G. Haase, M. Liebmann, and G. Plank. A Hilbert-order multiplication scheme for unstructured sparse matrices. International Journal of Parallel, Emergent and Distributed Systems, 22(4):213–220, 2007.Google Scholar
  118. 118.
    C. H. Hamilton and A. Rau-Chaplin. Compact Hilbert indices: Space-filling curves for domains with unequal side lengths. Information Processing Letters, 105:155–163, 2008.Google Scholar
  119. 119.
    H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems, 17(7):606–618, 2006.Google Scholar
  120. 120.
    J. Hartmann, A. Krahnke, and C. Zenger. Cache efficient data structures and algorithms for adaptive multidimensional multilevel finite element solvers. Applied Numerical Mathematics, 58(4):435–448, 2008.Google Scholar
  121. 121.
    A. Haug. Sierpinski-Kurven zur speichereffizienten numerischen Simulation auf adaptiven Tetraedergittern. Diplomarbeit, Fakultät für Informatik, Technische Universität München, 2006.Google Scholar
  122. 122.
    H. Haverkort and F. van Walderveen. Four-dimensional Hilbert curves for R-trees. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX), pages 63–73, 2009.Google Scholar
  123. 123.
    H. Haverkort and F. van Walderveen. Locality and bounding-box quality of two-dimensional space-filling curves. Computational Geometry: Theory and Applications, 43(2):131–147, 2010.Google Scholar
  124. 124.
    G. Heber, R. Biswas, and G. R. Gao. Self-avoiding walks over adaptive unstructured grids. Concurrency: Practice and Experience, 12:85–109, 2000.Google Scholar
  125. 125.
    D. J. Hebert. Symbolic local refinement of tetrahedral grids. Journal of Symbolic Computation, 17(5):457–472, 1994.Google Scholar
  126. 126.
    D. J. Hebert. Cyclic interlaced quadtree algorithms for quincunx multiresolution. Journal of Algorithms, 27:97–128, 1998.Google Scholar
  127. 127.
    A. Heinecke and M. Bader. Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms. In Proceedings of the 2008 Computing Frontiers Conference and co-located workshops: MAW’08 and WRFT’08, pages 385–392. ACM, 2008.Google Scholar
  128. 128.
    B. Hendrickson. Load balancing fictions, falsehoods and fallacies. Applied Mathematical Modelling, 25:99–108, 2000.Google Scholar
  129. 129.
    B. Hendrickson and K. Devine. Dynamic load balancing in computational mechanics. Computer Methods in Applied Mechanical Engineering, 184:485–500, 2000.Google Scholar
  130. 130.
    J. R. Herrero and J. J. Navarro. Analysis of a sparse hypermatrix Cholesky with fixed-sized blocking. Applicable Algebra in Engineering, Communication and Computing, 18(3):279–295, 2007.Google Scholar
  131. 131.
    D. Hilbert. Über die stetige Abbildung einer Linie auf ein Flächenstück. Mathematische Annalen, 38:459–460, 1891. Available online on the Göttinger Digitalisierungszentrum.Google Scholar
  132. 132.
    J.-W. Hong and H. T. Kung. I/O complexity: the red-blue pebble game. In Proceedings of ACM Symposium on Theory of Computing, pages 326–333, 1981.Google Scholar
  133. 133.
    H. Hoppe. Optimization of mesh locality for transparent vertex caching. In SIGGRAPH ’99: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pages 269–276. ACM Press/Addison-Wesley Publishing Co., 1999.Google Scholar
  134. 134.
    Y. C. Hu, A. Cox, and W. Zwaenepoel. Improving fine-grained irregular shared-memory benchmarks by data reordering. In Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, pages # 33. IEEE Computer Society, 2000.Google Scholar
  135. 135.
    J. Hungershöfer and J.-M. Wierum. On the quality of partitions based on space-filling curves. In ICCS 2002, volume 2331 of Lecture Notes in Computer Science, pages 36–45, 2002.Google Scholar
  136. 136.
    G. M. Hunter and K. Steiglitz. Operations on images using quad trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):145–154, 1979.Google Scholar
  137. 137.
    L. M. Hwa, M. A. Duchaineau, and K. i. Joy. Adaptive 4-8 texture hierarchies. In VIS ’04: Proceedings of the Conference on Visualization ’04, pages 219–226. IEEE Computer Society, 2004.Google Scholar
  138. 138.
    C. Jackings and S. L. Tanimoto. Octrees and their use in representing three-dimensional objects. Computer Graphics and Image Processing, 14(31):249–270, 1980.Google Scholar
  139. 139.
    H. V. Jagadish. Linear clustering of objects with multiple attributes. ACM SIGMOD Record, 19(2):332–342, 1990.Google Scholar
  140. 140.
    H. V. Jagadish. Analysis of the Hilbert curve for representing two-dimensional space. Information Processing Letters, 62(1):17–22, 1997.Google Scholar
  141. 141.
    G. Jin and J. Mellor-Crummey. SFCGen: a framework for efficient generation of multi-dimensional space-filling curves by recursion. ACM Transactions on Mathematical Software, 31(1):120–148, 2005.Google Scholar
  142. 142.
    Bentley J. L. and D. F. Stanat. Analysis of range searches in quad trees. Information Processing Letters, 3(6):170–173, 1975.Google Scholar
  143. 143.
    M. Kaddoura, C.-W. Ou, and S. Ranka. Partitioning unstructured computational graphs for nonuniform and adaptive environments. IEEE Concurrency, 3(3):63–69, 1995.Google Scholar
  144. 144.
    C. Kaklamanis and G. Persiano. Branch-and-bound and backtrack search on mesh-connected arrays of processors. Mathematical Systems Theory, 27:471–489, 1994.Google Scholar
  145. 145.
    S. Kamata, R. O. Eason, and Y. Bandou. A new algorithm for n-dimensional Hilbert scanning. IEEE Transactions on Image Processing, 8(7):964–973, 1999.Google Scholar
  146. 146.
    I. Kamel and C. Faloutsos. On packing R-trees. In Proceedings of the Second International ACM Conference on Information and Knowledge Management, pages 490–499. ACM New York, 1993.Google Scholar
  147. 147.
    E. Kawaguchi and T. Endo. On a method of binary-picture representation and its application to data compression. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2(1):27–35, 1980.Google Scholar
  148. 148.
    A. Klinger. Data structures and pattern recognition. In Proceedings of the First International Joint Conference on Pattern Recognition, pages 497–498. IEEE, 1973.Google Scholar
  149. 149.
    A. Klinger and C. R. Dyer. Experiments on picture representation using regular decomposition. Computer Graphics and Image Processing, 5:68–106, 1976.Google Scholar
  150. 150.
    K. Knopp. Einheitliche Erzeugung und Darstellung der Kurven von Peano, Osgood und von Koch. Archiv der Mathematik und Physik, 26:103–115, 1917.Google Scholar
  151. 151.
    I. Kossaczky. A recursive approach to local mesh refinement in two and three dimensions. Journal of Computational and Applied Mathematics, 55:275–288, 1994.Google Scholar
  152. 152.
    R. Kriemann. Parallel -matrix arithmetics on shared memory systems. Computing, 74:273–297, 2005.Google Scholar
  153. 153.
    J. P. Lauzon, D. M. Mark, L. Kikuchi, and J. A. Guevara. Two-dimensional run-encoding for quadtree representation. Computer Vision, Graphics, and Image Processing, 30(1):56–69, 1985.Google Scholar
  154. 154.
    J. K. Lawder and P. J. H. King. Querying multi-dimensional data indexed using the Hilbert space-filling curve. ACM SIGMOD Record, 30(1):19–24, 2001.Google Scholar
  155. 155.
    J. K. Lawder and P. J. H. King. Using state diagrams for Hilbert curve mappings. International Journal of Computer Mathematics, 78:327–342, 2001.Google Scholar
  156. 156.
    D. Lea. Digital and Hilbert k-d-trees. Information Processing Letters, 27:35–41, 1988.Google Scholar
  157. 157.
    H. L. Lebesgue. Leçons sur l’intégration et la recherche des fonctions primitives. Gauthier-Villars, Paris, 1904. Available online on the University of Michigan Historical Math Collection.Google Scholar
  158. 158.
    J.-H. Lee and Y.-C. Hsueh. Texture classification method using multiple space filling curves. Pattern Recognition Letters, 15:1241–1244, 1994.Google Scholar
  159. 159.
    M. Lee and H. Samet. Navigating through triangle meshes implemented as linear quadtrees. ACM Transactions on Graphics, 19(2):79–121, 2000.Google Scholar
  160. 160.
    A. Lempel and J. Ziv. Compression of two-dimensional data. IEEE Transactions on Information Theory, IT-32(1):2–8, 1986.Google Scholar
  161. 161.
    S. Liao, M. A. Lopez, and S. T. Leutenegger. High dimensional similarity search with space filling curves. In Proceedings of the 17th International Conference on Data Engineering, pages 615–622. IEEE Computer Society, 2000.Google Scholar
  162. 162.
    A. Lindenmayer. Mathematical models for cellular interactions in development. Journal of Theoretical Biology, 18:280–299, 1968.Google Scholar
  163. 163.
    P. Lindstrom, D. Koller, W. Ribarsky, L. F. Hodges, N. Faust, and G. A. Turner. Real-time, continuous level of detail rendering of height fields. In SIGGRAPH ’96: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pages 109–118. ACM, 1996.Google Scholar
  164. 164.
    P. Lindstrom and V. Pascucci. Terrain simplification simplified: A general framework for view-dependent out-of-core visualization. Technical Report UCRL-JC-147847, 2002.Google Scholar
  165. 165.
    A Liu and B. Joe. On the shape of tetrahedra from bisection. Mathematics of Computation, 63:141–154, 1994.Google Scholar
  166. 166.
    A Liu and B. Joe. Quality local refinement of tetrahedral meshes based on bisection. SIAM Journal on Scientific Computing, 16:1269–1291, 1995.Google Scholar
  167. 167.
    P. Liu and S. N. Bhatt. Experiences with parallel n-body simulation. IEEE Transactions on Parallel and Distributed Systems, 11(12):1306–1323, 2000.Google Scholar
  168. 168.
    X. Liu. Four alternative patterns of the Hilbert curve. Applied Mathematics and Communication, 147:741–752, 2004.Google Scholar
  169. 169.
    X. Liu and G. Schrack. Encoding and decoding the Hilbert order. Software: Practice and Experience, 26(12):1335–1346, 1996.Google Scholar
  170. 170.
    X. Liu and G. F. Schrack. A new ordering strategy applied to spatial data processing. International Journal of Geographical Information Science, 12(1):3–22, 1998.Google Scholar
  171. 171.
    Y. Liu and J. Snoeyink. A comparison of five implementations of 3d Delaunay tessellation. Combinatorial and Computational Geometry, 52:439–458, 2005.Google Scholar
  172. 172.
    J. Luitjens, M. Berzins, and T. Henderson. Parallel space-filling curve generation through sorting. Concurrency and Computation: Practice and Experience, 19:1387–1402, 2007.Google Scholar
  173. 173.
    G. Mainar-Ruiz and J.-C. Perez-Cortes. Approximate nearest neighbor search using a single space-filling curve and multiple representations of the data points. In 18th International Conference on Pattern Recognition, 2006 – ICPR 2006, pages 502–505, 2006.Google Scholar
  174. 174.
    B. Mandelbrot. The Fractal Geometry of Nature. Freeman and Company, 1977, 1982, 1983.Google Scholar
  175. 175.
    Y. Matias and A. Shamir. A video scrambling technique based on space filling curves. In Advances in Cryptology – CRYPTO ’87, volume 293, pages 398–417, 1987.Google Scholar
  176. 176.
    J. M. Maubach. Local bisection refinement for n-simplicial grids generated by reflection. SIAM Journal on Scientific Computing, 16(1):210–227, 1995.Google Scholar
  177. 177.
    J. M. Maubach. Space-filling curves for 2-simplicial meshes created with bisections and reflections. Applications of Mathematics, 3:309–321, 2005.Google Scholar
  178. 178.
    D. Meagher. Geometric modelling using octree encoding. Computer Graphics and Image Processing, 19:129–147, 1980.Google Scholar
  179. 179.
    D. Meagher. Octree encoding: A new technique for the representation, manipulation and display of arbitrary 3d objects by computer. Technical report, Image Processing Laboratory, Rensselaer Polytechnic Institute, 1980.Google Scholar
  180. 180.
    M. Mehl, M. Brenk, H.-J. Bungartz, K. Daubner, I. L. Muntean, and T. Neckel. An Eulerian approach for partitioned fluid-structure simulations on Cartesian grids. Computational Mechanics, 43(1):115–124, 2008.Google Scholar
  181. 181.
    M. Mehl, T. Neckel, and P. Neumann. Navier-stokes and lattice-boltzmann on octree-like grids in the Peano framework. International Journal for Numerical Methods in Fluids, 65(1):67–86, 2010.Google Scholar
  182. 182.
    M. Mehl, T. Neckel, and T. Weinzierl. Concepts for the efficient implementation of domain decomposition approaches for fluid-structure interactions. In U. Langer, M. Discacciati, D.E. Keyes, O.B. Widlund, and W. Zulehner, editors, Domain Decomposition Methods in Science and Engineering XVII, volume 60 of Lecture Notes in Science an Enginnering, 2008.Google Scholar
  183. 183.
    M. Mehl, T. Weinzierl, and C. Zenger. A cache-oblivious self-adaptive full multigrid method. Numerical Linear Algebra with Applications, 13(2–3):275–291, 2006.Google Scholar
  184. 184.
    J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance for irregular applications using data and computation reorderings. International Journal of Parallel Programming, 29(3):217–247, 2001.Google Scholar
  185. 185.
    N. Memon, D. L. Neuhoff, and S. Shende. An analysis of some common scanning techniques for lossless image coding. IEEE Transactions on Image Processing, 9(11):1837–1848, 2000.Google Scholar
  186. 186.
    R. Miller and Q. F. Stout. Mesh computer algorithms for computational geometry. IEEE Transactions on Computers, 38(3):321–340, 1989.Google Scholar
  187. 187.
    W. F. Mitchell. Adaptive refinement for arbitrary finite-element spaces with hierarchical bases. Journal of Computational and Applied Mathematics, 36:65–78, 1991.Google Scholar
  188. 188.
    W. F. Mitchell. Hamiltonian paths through two- and three-dimensional grids. Journal of Research of the National Institute of Standards and Technology, 110(2):127–136, 2005.Google Scholar
  189. 189.
    W. F. Mitchell. A refinement-tree based partitioning method for dynamic load balancing with adaptively refined grids. Journal of Parallel and Distributed Computing, 67(4):417–429, 2007.Google Scholar
  190. 190.
    G. Mitchison and R. Durbin. Optimal numberings of an N ×N-array. SIAM Journal on Algebraic and Discrete Methods, 7(4):571–582, 1986.Google Scholar
  191. 191.
    B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis of the clustering properties of the Hilbert space-filling curve. IEEE Transactions on Knowledge and Data Engineering, 13(1):124–141, 2001.Google Scholar
  192. 192.
    A. Mooney, J. G. Keating, and D. M. Heffernan. A detailed study of the generation of optically detectable watermarks using the logistic map. Chaos, Solitons and Fractals, 30(5):1088–1097, 2006.Google Scholar
  193. 193.
    E. H. Moore. On certain crinkly curves. Transactions of the American Mathematical Society, 1(1):72–90, 1900. Available online on JSTOR.Google Scholar
  194. 194.
    G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, 1966.Google Scholar
  195. 195.
    R. D. Nair, H.-W. Choi, and H. M. Tufo. Computational aspects of a scalable high-order discontinuous Galerkin atmospheric dynamical core. Computers & Fluids, 38(2):309–319, 2009.Google Scholar
  196. 196.
    R. Niedermeier, K. Reinhardt, and P. Sanders. Towards optimal locality in mesh-indexings. Discrete Applied Mathematics, 117:211–237, 2002.Google Scholar
  197. 197.
    M. G. Norman and P. Moscato. The Euclidian traveling salesman problem and a space-filling curve. Chaos, Solitons & Fractals, 6:389–397, 1995.Google Scholar
  198. 198.
    A. Null. Space-filling curves, or how to waste time with a plotter. Software: Practice and Experience, 1:403–410, 1971.Google Scholar
  199. 199.
    Y. Ohno and K. Ohyama. A catalog of non-symmetric self-similar space-filling curves. Journal of Recreational Mathematics, 23(4):247–254, 1991.Google Scholar
  200. 200.
    Y. Ohno and K. Ohyama. A catalog of symmetric self-similar space-filling curves. Journal of Recreational Mathematics, 23(3):161–174, 1991.Google Scholar
  201. 201.
    M. A. Oliver and N. E. Wiseman. Operations on quadtree encoded images. The Computer Journal, 26(1):83–91, 1983.Google Scholar
  202. 202.
    J. A. Orenstein. Spatial query processing in an object-oriented database system. ACM SIGMOD Record, 15(2):326–336, 1986.Google Scholar
  203. 203.
    J. A. Orenstein and F. A. Manola. PROBE spatial data modeling in an image database and query processing application. IEEE Transactions on Software Engineering, 14(5):611–629, 1988.Google Scholar
  204. 204.
    J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. In Proceedings of the 3rd ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, pages 181–190. ACM, 1984.Google Scholar
  205. 205.
    C.-W. Ou, M. Gunwani, and S. Ranka. Architecture-independent locality-improving transformations of computational graphs embedded in k-dimensions. In ICS ’95: Proceedings of the 9th International Conference on Supercomputing, pages 289–298, 1995.Google Scholar
  206. 206.
    C.-W. Ou, S. Ranka, and G. Fox. Fast and parallel mapping algorithms for irregular problems. Journal of Supercomputing, 10:119–140, 1996.Google Scholar
  207. 207.
    R. Pajarola. Large scale terrain visualization using the restricted quadtree triangulation. In VIS ’98: Proceedings of the Conference on Visualization ’98, pages 19–26. IEEE Computer Society Press, 1998.Google Scholar
  208. 208.
    S. Papadomanolakis, A. Ailamaki, J. C. Lopez, T. Tu, D. R. O’Hallaron, and G. Heber. Efficient query processing on unstructured tetrahedral meshes. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pages 551–562, 2006.Google Scholar
  209. 209.
    M. Parashar, J. C. Browne, C. Edwards, and K. Klimkowski. A common data management infrastructure for parallel adaptive algorithms for PDE solutions. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing, pages 1–22. ACM Press, 1997.Google Scholar
  210. 210.
    M. Parashar and J. C. Browne. On partitioning dynamic adaptive grid hierarchies. In Proceedings of the 29th Annual Hawaii International Conference on System Sciences, pages 604–613, 1996.Google Scholar
  211. 211.
    V. Pascucci. Isosurface computation made simple: Hardware acceleration, adaptive refinement and tetrahedral stripping. In Joint Eurographics-IEEE TVCG Symposium on Visualization (VisSym), pages 293–300, 2004.Google Scholar
  212. 212.
    A. Patra and J. T. Oden. Problem decomposition for adaptive hp finite element methods. Computing Systems in Engineering, 6(2):97–109, 1995.Google Scholar
  213. 213.
    A. K. Patra, A. Laszloffy, and J. Long. Data structures and load balancing for parallel adaptive hp finite-element methods. Computers & Mathematics with Applications, 46(1):105–123, 2003.Google Scholar
  214. 214.
    G. Peano. Sur une courbe, qui remplit toute une aire plane. Mathematische Annalen, 36:157–160, 1890. Available online on the Göttinger Digitalisierungszentrum.Google Scholar
  215. 215.
    A. Pérez, S. Kamata, and E. Kawagutchi. Peano scanning of arbitrary size images. In 11th IAPR International Conference on Pattern Recognition, 1992. Vol.III. Conference C: Image, Speech and Signal Analysis, pages 565–568. IEEE, 1992.Google Scholar
  216. 216.
    E. Perlman, R. Burns, Y. Li, and C. Meneveau. Data exploration of turbulence simulations using a database cluster. In SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pages 1–11, 2007.Google Scholar
  217. 217.
    J. R. Pilkington and S. B. Baden. Partitioning with spacefilling curves. CSE Technical Report Number CS94–349, University of California, San Diego, 1994.Google Scholar
  218. 218.
    J. R. Pilkington and S. B. Baden. Dynamic partitioning of non-uniform structured workloads with spacefilling curves. IEEE Transactions on Parallel and Distributed Systems, 7(3):288–300, 1996.Google Scholar
  219. 219.
    L. K. Platzman and J. J. Bartholdi III. Spacefilling curves and the planar travelling salesman problem. Journal of the ACM, 36(4):719–737, 1989.Google Scholar
  220. 220.
    G. Polya. Über eine Peanosche Kurve. Bulletin de l’Académie des Sciences de Cracovie, Série A, pages 1–9, 1913.Google Scholar
  221. 221.
    P. Prusinkiewicz. Graphical applications of L-systems. In Proceedings of Graphics Interface ’86/Vision Interface ’86, pages 247–253, 1986.Google Scholar
  222. 222.
    P. Prusinkiewicz and A. Lindenmayer. The Algorithmic Beauty of Plants. Springer, 1990.Google Scholar
  223. 223.
    P. Prusinkiewicz, A. Lindenmayer, and F. D. Fracchia. Synthesis of space-filling curves on the square grid. In Fractals in the Fundamental and Applied Sciences, pages 341–366. North Holland, Elsevier Science Publisher B.V., 1991.Google Scholar
  224. 224.
    J. Quinqueton and M. Berthod. A locally adaptive Peano scanning algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-3(4):403–412, 1981.Google Scholar
  225. 225.
    A. Rahimian, I. Lashuk, S. K. Veerapaneni, A. Chandramowlishwaran, D. Malhotra, L. Moon, R. Sampath, A. Shringarpure, J. Vetter, R. Vuduc, D. Zorin, and G. Biros. Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In ACM/IEEE Conference on Supercomputing, 2010, pages 1–11, 2010.Google Scholar
  226. 226.
    F. Ramsak, V. Markl, R. Fenk, Elhardt K., and R. Bayer. Integrating the UB-tree into a database system kernel. In Proceedings of the 26th International Conference on Very Large Databases, pages 263–272, 2000.Google Scholar
  227. 227.
    M.-C. Rivara and Ch. Levin. A 3-d refinement algorithm suitable for adaptive and multi-grid techniques. Communications in Applied Numerical Methods, 8:281–290, 1992.Google Scholar
  228. 228.
    S. Roberts, S. Kalyanasundaram, M. Cardew-Hall, and W. Clarke. A key based parallel adaptive refinement technique for finite element methods. In Computational Techniques and Applications: CTAC 97, pages 577–584. World Scientific Press, 1998.Google Scholar
  229. 229.
    B. Roychoudhury and J. F. Muth. The solution of travelling salesman problems based on industrial data. Journal of Complexity, 46(3):347–353, 1995.Google Scholar
  230. 230.
    H. Sagan. Some reflections on the emergence of space-filling curves. Journal of the Franklin Institute, 328:419–430, 1991.Google Scholar
  231. 231.
    H. Sagan. On the geometrization of the Peano curve and the arithmetization of the Hilbert curve. International Journal of Mathematical Education in Science and Technology, 23(3):403–411, 1992.Google Scholar
  232. 232.
    H. Sagan. A three-dimensional Hilbert curve. International Journal of Mathematical Education in Science and Technology, 24(4):541–545, 1993.Google Scholar
  233. 233.
    H. Sagan. Space-Filling Curves. Universitext. Springer, 1994.Google Scholar
  234. 234.
    J. K. Salmon, M. S. Warren, and G. S. Winckelmans. Fast parallel tree codes for gravitational and fluid dynamical n-body problems. International Journal of Supercomputer Applications, 8(2):129–142, 1994.Google Scholar
  235. 235.
    H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys, 16(2):187–260, 1984.Google Scholar
  236. 236.
    P. Sanders and T. Hansch. Efficient massively parallel quicksort. In Proceedings of the 4th International Symposium on Solving Irregularly Structured Problems in Parallel, volume 1253 of Lecture Notes in Computer Science, pages 13–24, 1997.Google Scholar
  237. 237.
    M. Saxena, P. M. Finnigan, C. M. Graichen, A. F. Hathaway, and V. N. Parthasarathy. Octree-based automatic mesh generation for non-manifold domains. Engineering with Computers, 11(1):1–14, 1995.Google Scholar
  238. 238.
    S. Schamberger and J.-M. Wierum. A locality preserving graph ordering approach for implicit partitioning: Graph-filling curves. In Proceedings of the 17th International Conference on Parallel and Distributed Computing Systems, (PDCS’04), pages 51–57. ISCA, 2004.Google Scholar
  239. 239.
    S. Schamberger and J.-M. Wierum. Partitioning finite element meshes using space-filling curves. Future Generation Computer Systems, 21:759–766, 2005.Google Scholar
  240. 240.
    K. Schloegel, G. Karypis, and V. Kumar. Graph partitioning for high-performance scientific simulations, pages 491–541. Morgan Kaufmann Publishers Inc., 2003.Google Scholar
  241. 241.
    W. J. Schroeder and M. S. Shephard. A combined octree/Delaunay method for fully automatic 3-d mesh generation. International Journal for Numerical Methods in Engineering, 26(1):37–55, 1988.Google Scholar
  242. 242.
    E. G. Sewell. Automatic generation of triangulation for piecewise polynomial approximation. Technical Report CSD-TR83, Purdue University, 1972. PhD Thesis.Google Scholar
  243. 243.
    M. S. Shephard and M. K. Georges. Automatic three-dimensional mesh generation by the finite octree technique. International Journal for Numerical Methods in Engineering, 32(4):709–749, 1991.Google Scholar
  244. 244.
    W. Sierpinski. Sur une nouvelle courbe continue qui remplit toute une aire plane. Bulletin de l’Académie des Sciences de Cracovie, Série A, pages 462–478, 1912.Google Scholar
  245. 245.
    J. P. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy. Load balancing and data locality in adaptive hierarchical n-body methods: Barnes–Hut, fast multipole, and radiosity. Journal of Parallel and Distributed Computing, 27:118–141, 1995.Google Scholar
  246. 246.
    R. Siromoney and K. G. Subramanian. Space-filling curves and infinite graphs. In Graph-Grammars and Their Application to Computer Science, volume 153 of Lecture Notes in Computer Science, pages 380–391, 1983.Google Scholar
  247. 247.
    B. Smith, P. Bjørstad, and W. Gropp. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 1996.Google Scholar
  248. 248.
    V. Springel. The cosmological simulation code GADGET-2. Monthly Notices of the Royal Astronomical Society, 364:1105–1134, 2005.Google Scholar
  249. 249.
    J. Steensland, S. Chandra, and M. Parashar. An application-centric characterization of domain-based SFC partitioners for parallel SAMR. IEEE Transactions on Parallel and Distributed Systems, 13(12):1275–1289, 2002.Google Scholar
  250. 250.
    R. J. Stevens, A. F. Lehar, and F. H. Preston. Manipulation and presentation of multidimensional image data using the Peano scan. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(5):520–526, 1983.Google Scholar
  251. 251.
    Q. F. Stout. Topological matching. In Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing, pages 24–31, 1983.Google Scholar
  252. 252.
    H. Sundar, R. S. Sampath, S. S. Adavani, C. Davatzikos, and G. Biros. Low-constant parallel algorithms for finite element simulations using linear octrees. In SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pages 1–12. ACM, 2007.Google Scholar
  253. 253.
    H. Sundar, R. S. Sampath, and G. Biros. Bottom-up construction and 2:1 balance refinement of linear octrees in parallel. SIAM Journal on Scientific Computing, 30(5):2675–2708, 2008.Google Scholar
  254. 254.
    S. Tanimoto and T. Pavlidis. A hierarchical data structure for picture processing. Computer Graphics and Image Processing, 4:104–119, 1975.Google Scholar
  255. 255.
    J. Thiyagalingam, O. Beckmann, and P. H. J. Kelly. Is Morton layout competitive for large two-dimensional arrays yet? Concurrency and Computation: Practice and Experience, 18:1509–1539, 2006.Google Scholar
  256. 256.
    S. Tirthapura, S. Seal, and W. Aluru. A formal analysis of space filling curves for parallel domain decomposition. In Proceedings of the 2006 International Conference on Parallel Processing (ICPP’06), pages 505–512. IEEE Computer Society, 2006.Google Scholar
  257. 257.
    H. Tropf and H. Herzog. Multidimensional range search in dynamically balanced trees. Angewandte Informatik (Applied Informatics), 2:71–77, 1981.Google Scholar
  258. 258.
    T. Tu, D. R. O’Hallaron, and O. Ghattas. Scalable parallel octree meshing for terascale applications. In SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, page 4. IEEE Computer Society, 2005.Google Scholar
  259. 259.
    L. Velho and J. de Miranda Gomes. Digital halftoning with space filling curves. ACM SIGGRAPH Computer Graphics, 25(4):81–90, 1991.Google Scholar
  260. 260.
    L. Velho and J. Gomes. Variable resolution 4-k meshes: Concepts and applications. Computer Graphics forum, 19(4):195–212, 2000.Google Scholar
  261. 261.
    B. Von Herzen and A. H. Barr. Accurate triangulations of deformed, intersecting surfaces. ACM SIGGRAPH Computer Graphics, 21(4):103–110, 1987.Google Scholar
  262. 262.
    J. Wang and J. Shan. Space filling curve based point clouds index. In Proceedings of the 8th International Conference on GeoComputation, pages 551–562, 2005.Google Scholar
  263. 263.
    J. Warnock. A hidden surface algorithm for computer generated halftone pictures. Technical Report TR 4-15,, Computer Science Department, University of Utah, 1969.Google Scholar
  264. 264.
    M. S. Warren and J. K. Salmon. A parallel hashed oct-tree n-body algorithm. In Conference on High Performance Networking and Computing, Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, pages 12–21. ACM, 1993.Google Scholar
  265. 265.
    T. Weinzierl. A Framework for Parallel PDE Solvers on Multiscale Adaptive Cartesian Grids. Dissertation, Institut für Informatik, Technische Universität München, 2009.Google Scholar
  266. 266.
    T. Weinzierl and M. Mehl. Peano – a traversal and storage scheme for octree-like adaptive Cartesian multiscale grids. SIAM Journal on Scientific Computing, 33(5):2732–2760, 2011.Google Scholar
  267. 267.
    R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1–2):3–35, 2001.Google Scholar
  268. 268.
    J.-M. Wierum. Definition of a new circular space-filling curve – βΩ-indexing. Technical Report TR-001-02, Paderborn Center for Parallel Computing, PC2, 2002.Google Scholar
  269. 269.
    N. Wirth. Algorithmen und Datenstrukturen. Teubner, 1975.Google Scholar
  270. 270.
    N. Wirth. Algorithms + Data Structures = Programs. Prentice Hall, 1976.Google Scholar
  271. 271.
    D. S. Wise. Representing matrices as quadtrees for parallel processors. Information Processing Letters, 20(4):195–199, 1985.Google Scholar
  272. 272.
    D. S. Wise and S. Franco. Costs of quadtree representation of nondense matrices. Journal of Parallel and Distributed Computing, 9(3):282–296, 1990.Google Scholar
  273. 273.
    I. H. Witten and R. M. Neal. Using Peano curves for bilevel display of continuous tone images. IEEE Computer Graphics and Applications, pages 47–52, May 1982.Google Scholar
  274. 274.
    I. H. Witten and B. Wyvill. On the generation and use of space-filling curves. Software: Practice and Experience, 13:519–525, 1983.Google Scholar
  275. 275.
    Dawes W. N., S. A. Harvey, S. Fellows, N. Eccles, D. Jaeggi, and W. P Kellar. A practical demonstration of scalable, parallel mesh generation. In 47th AIAA Aerospace Sciences Meeting & Exhibit, 2009. AIAA-2009-0981.Google Scholar
  276. 276.
    W. Wunderlich. Irregular curves and functional equations. Ganita, 5:215–230, 1954.Google Scholar
  277. 277.
    W. Wunderlich. Über Peano-Kurven. Elemente der Mathematik, 28(1):1–24, 1973.Google Scholar
  278. 278.
    K. Yang and M. Mills. Fractal based image coding scheme using Peano scan. In Proceedings of ISCAS ’88, volume 1470, pages 2301–2304, 1988.Google Scholar
  279. 279.
    M.-M. Yau and S. N. Srihari. A hierarchical data structure for multidimensional digital images. Communications of the ACM, 26(7):504–515, 1983.Google Scholar
  280. 280.
    L. Ying, G. Biros, D. Zorin, and H. Langston. A new parallel kernel-independent fast multipole method. In SC ’03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 2003.Google Scholar
  281. 281.
    K. Yotov, T. Roeder, K. Pingali, J. Gunnels, and F. Gustavson. An experimental comparison of cache-oblivious and cache-conscious programs. In Proceedings of the 19th annual ACM symposium on Parallel algorithms and architectures, pages 93–104, 2007.Google Scholar
  282. 282.
    Y. Zhang and R. E. Webber. Space diffusion: an improved parallel halftoning technique using space-filling curves. In Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques, pages 305–312. ACM New York, 1993.Google Scholar
  283. 283.
    S. Zhou and C. B. Jones. HCPO: an efficient insertion order for incremental delaunay triangulation. Information Processing Letters, 93:37–42, 2005.Google Scholar
  284. 284.
    U. Ziegler. The NIRVANA code: Parallel computational MHD with adaptive mesh refinement. Computer Physics Communications, 179(4):227–244, 2008.Google Scholar
  285. 285.
    G. Zumbusch. On the quality of space-filling curve induced partitions. Zeitschrift für Angewandte Mathematik und Mechanik, 81, Suppl. 1:25–28, 2001.Google Scholar
  286. 286.
    G. Zumbusch. Load balancing for adaptively refined grids. Proceedings in Applied Mathematics and Mechanics, 1:534–537, 2002.Google Scholar
  287. 287.
    G. Zumbusch. Parallel Multilevel Methods: Adaptive Mesh Refinement and Loadbalancing. Vieweg+Teubner, 2003.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Michael Bader
    • 1
  1. 1.Department of InformaticsTechnische Universität MünchenMunichGermany

Personalised recommendations