Space-Filling Curves pp 195-214 | Cite as

# Case Study: Cache Efficient Algorithms for Matrix Operations

## Abstract

In Chaps. 10 and 11, we discussed applications of space-filling curves for parallelisation, which were motivated by their locality properties. In the following two chapters, we will discuss further applications, which again exploit the intrinsic locality properties of space-filling curves. As both applications will focus on inherently cache-efficient algorithms, we will start with a short introduction to cache-memory architectures, and discuss the resulting requirements on cache-efficient algorithms.

In Chaps. 10 and 11, we discussed applications of space-filling curves for parallelisation, which were motivated by their locality properties. In the following two chapters, we will discuss further applications, which again exploit the intrinsic locality properties of space-filling curves. As both applications will focus on inherently cache-efficient algorithms, we will start with a short introduction to cache-memory architectures, and discuss the resulting requirements on cache-efficient algorithms.

## 13.1 Cache Efficient Algorithms and Locality Properties

In computer architecture, a so-called *cache* (or cache memory) denotes a fast memory component that replicates a certain part of the main memory to allow faster access to the cached (replicated) data. Such caches are necessary, because standard memory hardware is nowadays much slower than the CPUs. The access latency, i.e. the time between a data request and the arrival of the first requested piece of data, is currently^{1} about 60–70 ns. During that time, a fast CPU core can perform more than 100 floating point operations. This so-called “memory-gap” between CPU speed and main memory is constantly getting worse, because CPU speed is improving much faster (esp. due to using multicore processors) than memory latency. For memory bandwidth (i.e. the maximum rate of data transfer from memory to CPU), the situation is a bit better, but there are already many applications in scientific computing whose performance is limited by memory bandwidth instead of CPU speed.

*Memory-Bound Performance and Cache Memory*

The size and speed of the individual cache levels will have an interesting effect on the runtime of software. In the classical analysis of the runtime complexity of algorithms, a typical job is to determine how the runtime depends on the input size, which leads to the well-known \(\mathcal{O}(N)\) considerations. There, we often assume that all operations are executed with the same speed, which is no longer true on cache-based systems.

*How Cache Memory Works: Cache Lines and Replacement Strategies*

For technical reasons, caches do not store individual bytes or words, but small contiguous chunks of memory, so-called

*cache lines*, which are always transferred to and from memory as one block. Hence, cache lines are mapped to corresponding lines in main memory. To simplify (and, thus, speed up) this mapping, lines of memory can often be transferred only to a small subset of lines: we speak of an*n-associative*cache, if a memory line can be kept in*n*different cache lines. The most simple case, a 1-associative cache, is called a*direct-mapped cache*; a fully associative cache, where memory lines may be kept in any cache line, are powerful, but much more expensive to build (if the cache access needs to stay fast).If we want to load a specific cache line from memory, but already have a full cache, we naturally have to remove one cache line from the cache. In an ideal case, we would only remove cache lines that are no longer accessed. As the cache hardware can only guess the future access pattern, certain heuristics are used to replace cache lines. Typical strategies are to remove the cache line that was not used for the longest time (

*least recently used*), or which had the fewest accesses in the recent history (*least frequently used*).A programmer typically has almost no influence on what data is kept in the cache. Only for loading data into the cache, so-called prefetching commands are sometimes available.

Associativity, replacement strategy, and other hardware properties may, of course, vary between the different levels of caches. And it is also clear that these restrictions have an influence of the efficiency of using caches.

*Cache Memory and Locality*

*temporal*or

*spatial*locality properties:

*Temporal locality*means that a single piece of data will be repeatedly accessed during a short period of time. Replacement strategies such as*least recently used*or*least frequently used*will then reduce the probability of removing this data from the cache to a minimum.*Spatial locality*means that after an access to a data item, the next access(es) will be to items that are stored in a neighbouring memory address. If it belongs to the same cache line as the previously accessed item, it has been loaded into the cache as well, and can be accessed efficiently.

Hence, the cache efficiency of an algorithm depends on its temporal and spatial locality properties.

*Cache-Aware* and *Cache-Oblivious* Algorithms

*Cache-aware*algorithms or implementations use detailed information about the cache architecture. They try to increase the temporal or spatial locality by adapting the access pattern to exactly fit the number and size of the cache levels, the length of the cache line, etc. Hence, such a cache-aware implementation is specific for a particular platform, and at least certain parameters need to be tuned, if the architecture changes.*Cache-oblivious*algorithms, in contrast, do not use explicit knowledge on a specific architecture. Instead, the algorithms are designed to be inherently cache efficient, and profit from any presence of caches, regardless of their size and number of cache levels. Hence, their data access pattern need to have excellent temporal and spatial locality properties.

We have seen that space-filling curves are an excellent tool to create data structures with good locality properties. Hence, we will discuss some examples of how these properties can be exploited to obtain *cache-oblivious* algorithms.

## 13.2 Cache Oblivious Matrix-Vector Multiplication

*n*×

*n*-matrix

*A*with elements

*a*

_{ij}, the elements of the matrix-vector product

*y*=

*Ax*are given as

*y*).

To determine the cache efficiency of this algorithm, let’s now examine the temporal and spatial locality properties of this implementation. Regarding the matrix *A*, we notice that each element is accessed exactly once. Hence, temporal locality of the access will not be an issue, as no element will be reused. The spatial locality depends on the memory layout of the matrix. If the elements are stored in the same sequence as they are traversed by the two nested loops, the spatial locality will be optimal. Hence, for Algorithm 13.1, the matrix elements should be stored row-by-row (so-called *row-major* layout). However, if our programming language uses *column-major* layout (as in FORTRAN, for example), we should change Algorithm 13.1: the spatial locality would then be optimal, if we exchange the *i*- and *j*-loop. In general, spatial locality is perfect as long as we traverse the matrix element in the same order as they are stored in memory.

Changing the traversal scheme of the matrix elements will, however, also change the access to the two vectors *x* and *y*. As both vectors are accessed or updated *n* times throughout the execution of the algorithm, both temporal and spatial locality are important. For the loop-based access given by Algorithm 13.1, the access to both vectors is spatially local, even if we exchange the two loops. However, the temporal locality is different for the two vectors. The access to vector *y* is optimal: here, all *n* accesses to a single element are executed, before we move on to the next element. For the access to *x*, we have exactly the opposite situation: Before an element of *x* is re-used, we first access all other elements of the vector. Hence, the temporal locality for this pattern is the worst we can find. A first interesting question is, whether exchanging the loops will improve the situation. Then, *x* and *y* will change roles, and which option is faster depends on whether temporal locality is more important for the read access to *x* or for the write access to *y*. Even more interesting is the question whether we can derive a traversal of the matrix elements that leads to more local access patterns to both vectors.

*Matrix Traversals Using Space-Filling Curves*

In Algorithm 13.1, we easily see that the two loops may be exchanged. Actually, there is no restriction at all concerning the order in which we execute the element operations. We should therefore denote Algorithm 13.1 in the following form:

From Algorithm 13.2, we can now consider execution orders without being influenced by loop constructions. We just need to make sure that each matrix element is processed exactly once, i.e. that we perform a *traversal*, and can then try to optimise the temporal and spatial locality properties. Our previous discussions of the locality properties of space-filling curves therefore provide us with an obvious candidate.

Assuming that the matrix dimension *n* is a power of two, we can use a Hilbert iteration, for example, to traverse the matrix. Hence, we modify the traversal algorithm of Chap. 3.2 to obtain an algorithm for matrix-vector multiplication. For that purpose, it is sufficient to interpret our direction operations up, down, left, and right as index operations on the matrix and the two vectors. Operators up and down will increase or decrease *i*, i.e., the current row index in *A* and vector index in *y*. Operators left and right will change *j*, which is the column index of *A* and the vector index of *x*, respectively. Algorithm 13.3 outlines the full algorithm for matrix-vector multiplication.

To retain the optimal locality of the access to the matrix elements, the matrix elements need to be stored according to the Hilbert order, as well. For the accesses to the vectors *x* and *y*, we obtain an overall improved temporal locality. During the matrix traversal, all elements of a \({2}^{k} \times {2}^{k}\)-block will be processed before the Hilbert order moves on to the next \({2}^{k} \times {2}^{k}\)-block. During the corresponding \({({2}^{k})}^{2}\) element operations, 2^{k} elements of vector *x* and 2^{k} elements of vector *y* will be accessed – each of them 2^{k} times. On average, we will therefore execute *m*^{2} operations on a working set of only 2*m* elements – for any *m* ≤ *n*. Hence, even if only a small amount of elements will fit in a given cache, we can guarantee re-use of both elements of *x* and *y*. Our “naive” implementation in Algorithm 13.1 will guarantee this only for one of the two vectors.

## 13.3 Matrix Multiplication Using Peano Curves

*A*and

*B*, the elements of the matrix product \(C = \mathit{AB}\) are given as

*C*

_{ik}via three nested loops – one over

*j*to compute an individual element

*C*

_{ik}, and two loops over

*i*and

*k*, respectively – compare Algorithm 13.4

Again, we can exchange the three loops arbitrarily. However, for large matrices, the cache efficiency will be less than optimal for any order. In library implementations, multiple optimisation methods, such as loop unrolling or blocking and tiling, are applied to improve the performance of matrix multiplication. The respective methods carefully change the execution order to match the cache levels – in particular the sizes of the individual caches (see the references section at the end of this chapter). In the following, we will discuss an approach based on Peano curves, which leads to a cache oblivious algorithm, instead.

As for matrix-vector multiplication, we start with Algorithm 13.5, where we stress that we simply have to execute *n*^{3} updates \({C}_{ik} = {C}_{ik} + {A}_{ij}{B}_{jk}\) for all triples (*i*, *j*, *k*). Thus, matrix multiplication is again an instance of a traversal problem.

*n*

^{3}element operations correspond to a 3D traversal of the triple space. For the access to the matrix elements, in contrast, only two out of three indices are needed for each of the involved matrices. Hence, the indices are obtained via respective projections. We will therefore use a Peano curve for the traversal, because the projections of the classical 3D Peano curve to 2D will again lead to 2D Peano curves. In Fig. 13.4, this property is illustrated for the vertical direction. It holds for the other two coordinate directions, as well. Figure 13.4 also tells us that if we execute the matrix operations \({C}_{ik} = {C}_{ik} + {A}_{ij}{B}_{jk}\) according to a 3D Peano order, the matrix elements will be accessed according to 2D Peano orders. As a consequence, we should store the elements in that order, if possible. We will first try this for the simple case of a 3 ×3 matrix multiplication:

A 2D Peano order that defines the data structure for the matrix elements;

A recursive extension of the 3 ×3-scheme in Eq. 13.2, which is basically obtained by using matrix blocks instead of elements;

A concept for matrices of arbitrary size, as the standard Peano order will only work for matrices of size \({3}^{k} \times {3}^{k}\).

*P*,

*Q*,

*R*, and

*S*now denote a numbering scheme for the corresponding subblock in the matrix.

### 13.3.1 Block-Recursive Peano Matrix Multiplication

### 13.3.2 Memory Access Patterns During the Peano Matrix Multiplication

Execution orders for the eight different block multiplication schemes. ‘ + ’ indicates that the access pattern for the respective matrix *A*, *B*, or *C* is executed in forward direction (from element 0 to 8). ‘ − ’ indicates backward direction (starting with element 8)

Block scheme | \(P\,+=\,\mathit{PP}\) | \(P\,+=\,\mathit{RQ}\) | \(Q\,+=\,\mathit{QP}\) | \(Q\,+=\,\mathit{SQ}\) | \(R\,+=\,\mathit{PR}\) | \(R\,+=\,\mathit{RS}\) | \(S\,+=\,\mathit{QR}\) | \(S\,+=\,\mathit{SS}\) | |
---|---|---|---|---|---|---|---|---|---|

| + | + | + | + | − | − | − | − n | |

Access to | + | + n | − | − | + n | + | − | − | |

| + | − | + | − n | + | − | + | − |

On matrix

*A*, we have traversed the*P*-ordered block 0, which means that our last access was to the last element in this block. The \(Q\,+=\,\mathit{QP}\) scheme runs in ‘ + ’ direction on*A*(see Table 13.1), such that the first access to the*Q*-ordered block 1 in*A*will be to the first element, which is of course a direct neighbour of the last element of block 0.In matrix

*C*, we have the identical situation: the*P*-ordered block 0 has been traversed in ‘ + ’ direction, and the next access will be to the first element in the*Q*-ordered block 1.In matrix

*B*, both execution sequences work on the*P*-ordered block 0. However, while the \(P\,+=\,\mathit{PP}\) scheme accesses*B*in ‘ + ’ direction, the \(Q\,+=\,\mathit{QP}\) scheme will run in ‘ − ’ direction on*B*. Hence, the last access of the \(P\,+=\,\mathit{PP}\) scheme is to the last element of the block, which will also be the start of the \(Q\,+=\,\mathit{QP}\) scheme.

Hence, the increment/decrement property stays valid between the two block operations. A careful analysis (which is also too tedious to be discussed here) reveals that this is true for all recursive block operations occurring in our multiplication scheme.

As a result, the Peano matrix multiplication can be implemented as illustrated in Algorithm 13.6. There, the schemes \(P\,+=\,\mathit{PP}\), \(Q\,+=\,\mathit{QP}\), etc., are coded via the ‘ + ’- or ‘ − ’-directions of the execution orders for the three matrices, as listed in Table 13.1. The change of direction throughout the recursive calls is coded in the parameters phsA, phsB, and phsC.

A given range of contiguous operations (highlighted in the left image) will access only a certain contiguous subset of matrix elements.

Vice versa, a contiguous subset of matrix elements will be accessed by a set of operations that consists of only a few contiguous sequences.

Both, the operator subsets and the range of elements, define partitions that can be used to parallelise the Peano matrix multiplication (following a work-oriented or an owner-computes partitioning, respectively). However, the underlying locality properties also make the algorithm inherently cache efficient, which we will examine in the following section.

### 13.3.3 Locality Properties and Cache Efficiency

Figure 13.6 illustrates the spatial locality properties of the new matrix multiplication. From the respective chart, we can infer a so-called *access locality function*, *L*_{M}(*n*), which we can define as the maximal possible distance (in memory) between two elements of a matrix *M* that are accessed within *n* contiguous operations.

*k*successive elements during

*k*operations – which already requires some optimisations in order to avoid stride-

*n*accesses. For the naive implementation given in Algorithm 13.4, we thus have \({L}_{M}(n) \geq n\) for all involved matrices. For an improved algorithm that operates on matrix blocks of size

*k*×

*k*, we will typically obtain loops over

*k*contiguous elements, so the worst case will reduce to \({L}_{M}(k) \geq k\) with

*k*≪

*n*. However, as long as we stay with a

*k*×

*k*block, we will perform

*k*

^{3}operations on only

*k*

^{2}elements of a matrix. Hence, if our storage scheme for the matrices uses

*k*×

*k*-blocks, as well, and stores such elements contiguously, we perform

*k*

^{3}operations on

*k*

^{2}contiguous elements. The best case that could be achieved for the access locality function

*L*

_{M}should therefore be \({L}_{A}(k) \approx {k}^{2/3}\). Thus, even a blocked approach has the following limitations on

*L*

_{M}:

We have only one block size,

*k*_{0}, that can lead to the optimal locality – this could be cured, if we change to a recursive blocking scheme (a first step towards space-filling curves).The locality will still be

*L*_{M}(*k*) ≥*k*, if we are within the smallest block.Between two blocks, the locality will only be obtained, if two successively accessed blocks are stored contiguously in memory.

The recursive blocking and the last property (contiguity) is exactly what is achieved by the Peano multiplication scheme. We can therefore obtain \(L(k) \in \mathcal{O}({k}^{2/3})\) as an upper bound of the extent of the index range for any value of *k*.

*B*, which is always reused three times, the longest streak is two such block multiplications. For three block multiplications, either the first two or the last two work on the same

*B*-block. In the scheme for matrix

*A*, up to nine consecutive block multiplications can occur until a block is re-used. For matrix

*C*, three contiguous block operations occur. During recursion, though, two such streaks can occur right after each other. For matrix

*A*, we therefore obtain 18 as the length of the longest streak. In the worst case, we can therefore do 18

*n*

^{3}block operations on matrix

*A*on 18

*n*

^{2}contiguous elements of

*A*. Thus, for matrix

*A*, we get that

*C*, we get a maximum streak length of 6, and therefore

*L*

_{B}and

*L*

_{C}, in addition, involve very low constants.

### 13.3.4 Cache Misses on an Ideal Cache

*ideal cache*, which obeys to the following assumptions:

The cache consists of

*M*words that are organized as cache lines of*L*words each. Hence, we have*M*∕*L*cache lines. The external memory can be of arbitrary size and is structured into memory lines of*L*words.The cache is fully associative, i.e., can load any memory line into any cache line.

If lines need to be evicted from the cache, we assume that the cache can “foresee the future”, and will always evict a line that is no longer needed, or accessed farthest in the future.

*k*×

*k*block multiplication, where

*k*is the largest power of 3, such that three

*k*×

*k*matrices fit into the cache. Hence, \(3 \cdot {k}^{2} < M\), but \(3 \cdot {(3k)}^{2} > M\), or

*k*×

*k*blocks – to simplify the following computation, we assume that \(\left \lceil \frac{{k}^{2}} {L} \right \rceil = \frac{{k}^{2}} {L}\).

*n*∕

*k*)

^{3}such block operations, the total number of cache line transfers throughout an

*n*×

*n*multiplication will be

*M*of the different caches. For realistic caches, we might obtain more cache transfers because of a bad replacement strategy. However, a cache that always evicts the cache line that was used longest ago (“least recently used” strategy) will expel those cache lines first that contain matrix elements that are farthest away in terms of location in memory, because the access pattern of the multiplication will make sure that all matrix elements that are “closer” in memory have been accessed at a later time. What we cannot consider at this point is effects due to limited associativity of the cache, i.e., if a cache can place memory lines only into a specific set of cache lines.

### 13.3.5 Multiplying Matrices of Arbitrary Size

*k*×

*k*, which is chosen to respect such hardware properties. To extend our algorithm for matrices of arbitrary size, we then have three options:

- 1.
We can embed the given matrices into larger matrices of size \({3}^{p} \times {3}^{p}\) (or, with blocks as leaves: \({3}^{p}k \times {3}^{p}k\)). The additional zeros should, of course, not be stored. A respective approach is described in Sect. 13.4, where such a

*padding*approach is given for band matrices or sparse matrices in general. In such an approach, we will typically stop the recursion on matrix blocks, as soon as these fit into the innermost cache of a CPU. - 2.
In Sect. 8.2.3, we introduced Peano iterations on 3D grids of size \(k \times l \times m\), where

*k*,*l*, and*m*may be arbitrary odd numbers. The Peano matrix multiplication also works on such Peano iterations. For the leaf-block operations, we have to introduce schemes for matrix multiplication on \({n}_{1} \times {n}_{2}\), where*n*_{1}and*n*_{2}may be 3, 5, or 7, respectively. Actually, the scheme will work for leaf blocks of any odd size. - 3.
We can stick to the classical 3 ×3 recursion, if we stop the recursion on larger block sizes. If these are stored in regular row- or column-major order (compare the approach in Sect. 13.4), we just need to specify an upper limit for the size of these blocks, and can then use an implementation that is optimised for small (but arbitrary) matrix sizes.

What approach performs best will depend on the specific scenario. The first approach is interesting, for example, if we can choose the size of the leaf-level matrix block such that three such blocks fit exactly into the innermost cache. The constant factor for the number of cache line transfers, as estimated in Eq. (13.8) will then be further reduced. The third approach is especially interesting for coarse-grain parallel computations. There, the leaf-level blocks will be implemented by a call to a sequential library, where the size of the leaf blocks is of reduced influence. See the works indicated in the references section for more details.

## 13.4 Sparse Matrices and Space-Filling Curves

^{2}Considering our previous experience with space-filling-curve data structures for matrices and arrays, but also for adaptively refined grids, we can try to use a quadtree-type structure to store matrices that contain a lot of zeros. As illustrated in Fig. 13.7, we recursively subdivide a matrix into smaller and smaller blocks. The substructuring scheme can be a quadtree scheme, but due to our previously introduced Peano algorithm, we again use a 3 ×3 refinement, i.e., a 3

^{2}-spacetree. Once a matrix block consists entirely of zero elements, we can stop refinement, mark the respective tree with a zero-block leaf, and thus not store the individual elements. For blocks that contain elements, we stop the recursion on blocks of a certain minimal sizes. On such blocks, we either store a dense matrix (could be even in row- or column-major order) or a small sparse matrix (using a respective simple storage scheme). The tree structure for such a storage scheme is also illustrated in Fig. 13.7.

The matrix blocks, either sparse blocks or dense blocks, are stored in the sequential order given by the Peano curve. However, in contrast to the Peano scheme for dense blocks, the matrix blocks will now have varying size – because of the different sparsity patterns of the individual blocks, but also because the zero blocks are not stored at all. Hence, to access a specified block, it will usually be necessary to traverse the sparsity tree from its root. To issue recursive calls on one or several child matrix blocks, a parent node needs information on the exact position of all child blocks in the data structure. Thus, in a node, we will store the start address of all nine child blocks of the matrix. Zero blocks are automatically considered by this scheme by just storing two consecutive identical start addresses. As we typically need both start and end addresses for the children, we also need to store the end address of the last block. Hence, for every parent node, we will store a sequence of ten integers, as indicated by the data stream in Fig. 13.7. We thus obtain a data structure that follows the same idea as the modified depth-first traversal that was introduced for the refinement trees of adaptive grids, in Sect. 10.5.

The sparsity structure information and the matrix elements of the leaf blocks can either be stored in a single data stream or in two separate streams. If we choose to store the elements together with the structure, we will only need to change the data structure for the leaf blocks. These blocks will then require information on the block size and type (dense or sparse), as well as on the extent of the block in terms of bytes in memory. If matrix elements are stored in a separate stream, the pre-leaf-level nodes need to store respective pointers to the start of the respective blocks in the elements stream.

As already indicated in Sect. 13.3.5, we can also use the presented sparse-matrix data structure for matrices that are only dense (or comparably dense) in certain parts of the matrix – one simple example would be band matrices. In that case, we can further simplify the data structure by allowing only dense matrix blocks in the leaves.

## 13.5 References and Further Readings

The Peano-curve algorithm for matrix multiplication was introduced in [24], where we particularly discussed the cache properties of the algorithm. The locality properties, as given in Eqs. 13.5 and 13.8 are asymptotically optimal – respective lower bounds were proven by Hong and Kung [132]. Hardware-oriented, efficient implementations of the algorithms, including the parallelisation on shared-memory multicore platforms, were presented in [21, 127]. In [19], we discussed the parallelisation of the algorithm using message passing on distributed-memory. There, the result for cache line transfers in the ideal-cache model can be used to estimate the number of transferred matrix blocks in a distributed-memory setting. The extension for sparse matrices was presented in [22], which also includes a discussion of LU-decomposition based on the Peano matrix multiplication.

Since the advent of cache-based architectures, improving the cache efficiency of linear algebra operations has been an active area of research. Blocking approaches to improve the cache-efficiency were already applied in the first level-3 BLAS routines, when introduced by Dongarra et al. [78]. Blocking approaches and the block-oriented matrix operations were also a driving factor in the development of LAPACK [13], whose implementation was consequently based on exploiting the higher performance of BLAS 3 routines. Since then, blocking and tiling of matrix operations has become a standard technique for high performance libraries. The ATLAS project [267] uses automatic tuning of blocking sizes to the available memory hierarchy, and GotoBLAS [104] explicitly considers the *translation lookaside buffer* (TLB) and even virtual memory as further cache levels [103].

Block matrix layouts to improve cache efficiency were introduced by Chatterjee et al. [65], who studied Morton order and 4D tiled arrays (i.e., 2D arrays of matrix blocks), and by Gustavson [116], who demonstrated that recursive blocking automatically leads to cache-efficient algorithms. Frens and Wise [90] used quadtree decompositions of dense matrices and respective recursive implementations of matrix multiplication and of QR decomposition [91]. The term *cache-oblivious* for such inherently cache-efficient algorithms has been introduced by Frigo et al. [92]. For a review of both cache-oblivious and cache-aware algorithms in linear algebra, see Elmroth et al. [81]. For further, recent work, see [106, 255], e.g. Yotov et al. [112, 281] compared the performance of cache-oblivious algorithms with carefully tuned cache-aware approaches, and identified efficient prefetching of matrix blocks as a crucial question for recursive algorithms. For the Peano algorithm, the increment/decrement access to blocks apparently solves this problem. In any case, block-oriented data structures and algorithms are more and more considered a necessity in the design of linear algebra routines. The designated LAPACK-successor PLASMA [56, 57], for example, aims for a stronger block orientation, and similar considerations drive the FLAME project [113]. As libraries will often depend on row-major or column-major storage, changing the matrix storage to blocked layouts on-the-fly becomes necessary; respective in-place format conversions were studied in [115].

For sparse matrices, quadtree data structures were, for example, examined by Wise et al. [271, 272]. Hybrid “hypermatrices” that mix dense and sparse blocks in recursively structured matrices were discussed by Herrero et al. [130]. Haase [117] used a Hilbert-order data structure for sparse-matrix-vector multiplication to improve cache efficiency.

Chapter 14 will present a cache-oblivious approach to solve partial differential equations on adaptive discretisation grids – hence, we save all further references to work on cache oblivious algorithms, be it grid-based or other simulation approaches, for the references section of Chap. 14.

## 13.6 Exercises

Try to determine the number of cache misses caused by Algorithm 13.3, following the same cache model as in Sect. 13.3.4. Focus, in particular, on the misses caused by the access to the vectors *x* and *y*.

Use the graph-based illustration of Eq. (13.2) to derive the execution orders for the matrix multiplication schemes \(Q\,+=\,\mathit{QP}\), \(R\,+=\,\mathit{PR}\), \(S\,+=\,\mathit{QR}\), etc.

Consider an algorithm that uses 2D Morton order to store the matrix elements, and 3D Morton order to execute the individual element operations of matrix multiplication. Analyse the number of cache misses causes by this algorithm, using a cache model and computation as in Sect. 13.3.4.

As a data structure for sparse matrices, we could also refine the sparsity tree up to single elements, i.e., not stop the recursion on larger blocks. Give an estimate on how much memory we will require to store a sparse matrix (make suitable assumptions on the sparsity pattern or sparsity tree, if necessary).

## Footnotes

### References

- 1.D. J. Abel and D. M. Mark. A comparative analysis of some two-dimensional orderings.
*International Journal of Geographical Information Systems*, 4(1):21–31, 1990.Google Scholar - 2.D. J. Abel and J. L. Smith. A data structure and algorithm based on a linear key for a rectangle retrieval problem.
*Computer Vision, Graphics, and Image Processing*, 24:1–13, 1983.Google Scholar - 3.K. Abend, T. J. Harley, and L. N. Kanal. Classification of binary random patterns.
*IEEE Transactions on Information Theory*, IT-11(4):538–544, 1965.Google Scholar - 4.M. Aftosmis, M. Berger, and J. Melton. Robust and efficient Cartesian mesh generation for component-based geometry. In
*35*^{th}*Aerospace Sciences Meeting and Exhibit*, 1997. AIAA 1997-0196.Google Scholar - 5.M. J. Aftosmis, M. J. Berger, and S. M. Murman. Applications of space-filling curves to Cartesian methods for CFD. AIAA Paper 2004-1232, 2004.Google Scholar
- 6.J. Akiyama, H. Fukuda, H. Ito, and G. Nakamura. Infinite series of generalized Gosper space filling curves. In
*CJCDGCGT 2005*, volume 4381 of*Lecture Notes in Computer Science*, pages 1–9, 2007.Google Scholar - 7.I. Al-Furaih and S. Ranka. Memory hierarchy management for iterative graph structures. In
*Parallel Processing Symposium, IPPS/SPDP 1998*, pages 298–302. IEEE Computer Society, 1998.Google Scholar - 8.F. Alauzet and A. Loseille. On the use of space filling curves for parallel anisotropic mesh adaptation. In
*Proceedings of the 18th International Meshing Roundtable*, pages 337–357. Springer, 2009.Google Scholar - 9.J. Alber and R. Niedermeier. On multidimensional curves with Hilbert property.
*Theory of Computing Systems*, 33:295–312, 2000.Google Scholar - 10.S. Aluru and F. E. Sevilgen. Parallel domain decomposition and load balancing using space-filling curves. In
*Proceedings of the Fourth International Conference on High-Performance Computing*, pages 230–235. IEEE Computer Society, 1997.Google Scholar - 11.N. Amenta, S. Choi, and G. Rote. Incremental constructions con BRIO. In
*Proceedings of the Nineteenth ACM Symposium on Computational Geometry*, pages 211–219, 2003.Google Scholar - 12.M. Amor, F. Arguello, J. López, O. Plata, and E. L. Zapata. A data-parallel formulation for divide and conquer algorithms.
*The Computer Journal*, 44(4):303–320, 2001.Google Scholar - 13.E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: a portable linear algebra library for high-performance computers. Technical Report CS-90-105, LAPACK Working Note #20, University of Tennessee, Knoxville, TN, 1990.Google Scholar
- 14.J. A. Anderson, C. D. Lorenz, and A. Travesset. General purpose molecular dynamics simulations fully implemented on graphics processing units.
*Journal of Computational Physics*, 227(10):5342–5359, 2008.Google Scholar - 15.A. Ansari and A. Fineberg. Image data ordering and compression using Peano scan and LOT.
*IEEE Transactions on Consumer Electronics*, 38(3):436–445, 1992.Google Scholar - 16.L. Arge, M. De Berg, H. Haverkort, and K. Yi. The priority R-tree: A practically efficient and worst-case optimal R-tree.
*ACM Transactions on Algorithms*, 4(1):9:1–9:29, 2008.Google Scholar - 17.D. N. Arnold, A. Mukherjee, and L. Pouly. Locally adapted tetrahedral meshes using bisection.
*SIAM Journal on Scientific Computing*, 22(2):431–448, 2000.Google Scholar - 18.T. Asano, D. Ranjan, T. Roos, E. Welzl, and P. Widmayer. Space-filling curves and their use in the design of geometric data structures.
*Theoretical Computer Science*, 181(1):3–15, 1997.Google Scholar - 19.M. Bader. Exploiting the locality properties of Peano curves for parallel matrix multiplication. In
*Proceedings of the Euro-Par 2008*, volume 5168 of*Lecture Notes in Computer Science*, pages 801–810, 2008.Google Scholar - 20.M. Bader, C. Böck, J. Schwaiger, and C. Vigh. Dynamically adaptive simulations with minimal memory requirement – solving the shallow water equations using Sierpinksi curves.
*SIAM Journal on Scientific Computing*, 32(1):212–228, 2010.Google Scholar - 21.M. Bader, R. Franz, S. Guenther, and A. Heinecke. Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves. In
*Parallel Processing and Applied Mathematics, 7th International Conference, PPAM 2007*, volume 4967 of*Lecture Notes in Computer Science*, pages 628–638, 2008.Google Scholar - 22.M. Bader and A. Heinecke. Cache oblivious dense and sparse matrix multiplication based on Peano curves. In
*Proceedings of the PARA ’08, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing*, volume 6126/6127 of*Lecture Notes in Computer Science*, 2010. In print.Google Scholar - 23.M. Bader, S. Schraufstetter, C. A Vigh, and J. Behrens. Memory efficient adaptive mesh generation and implementation of multigrid algorithms using Sierpinski curves.
*International Journal of Computational Science and Engineering*, 4(1):12–21, 2008.Google Scholar - 24.M. Bader and C. Zenger. Cache oblivious matrix multiplication using an element ordering based on a Peano curve.
*Linear Algebra and Its Applications*, 417(2–3):301–313, 2006.Google Scholar - 25.M. Bader and C. Zenger. Efficient storage and processing of adaptive triangular grids using Sierpinski curves. In
*Computational Science – ICCS 2006*, volume 3991 of*Lecture Notes in Computer Science*, pages 673–680, 2006.Google Scholar - 26.Y. Bandou and S.-I. Kamata. An address generator for an n-dimensional pseudo-Hilbert scan in a hyper-rectangular parallelepiped region. In
*International Conference on Image Processing, ICIP 2000*, pages 707–714, 2000.Google Scholar - 27.R. Bar-Yehuda and C. Gotsman. Time/space tradeoffs for polygon mesh rendering.
*ACM Transactions on Graphics*, 15(2):141–152, 1996.Google Scholar - 28.J. Barnes and P. Hut. A hierarchical
*O*(*N*log*N*) force-calculation algorithm.*Nature*, 324:446–449, 1986.Google Scholar - 29.J.J. Bartholdi III and P. Goldsman. Vertex-labeling algorithms for the Hilbert spacefilling curve.
*Software: Practice and Experience*, 31(5):395–408, 2001.Google Scholar - 30.J. J. Bartholdi III and P. Goldsman. Multiresolution indexing of triangulated irregular networks.
*IEEE Transactions on Visualization and Computer Graphics*, 10(3):1–12, 2004.Google Scholar - 31.J. J. Bartholdi III and L. K. Platzman. An
*O*(*N*log*N*) planar travelling salesman heuristic based on spacefilling curves.*Operations Research Letters*, 1(4):121–125, 1982.Google Scholar - 32.J. J. Bartholdi III and L. K. Platzman. Heuristics based on spacefilling curves for combinatorial problems in the Euclidian space.
*Management Science*, 34(3):291–305, 1988.Google Scholar - 33.A. C. Bauer and A. K. Patra. Robust and efficient domain decomposition preconditioners for adaptive hp finite element approximations of linear elasticity with and without discontinuous coefficients.
*International Journal for Numerical Methods in Engineering*, 59:337–364, 2004.Google Scholar - 34.K.E. Bauman. The dilation factor of the Peano–Hilbert curve.
*Mathematical Notes*, 80(5):609–620, 2006. Translated from Matematicheskie Zametki, vol. 80, no. 5, 2006, pp. 643–656. Original Russian Text Copyright 2006 by K. E. Bauman.Google Scholar - 35.R. Bayer. The universal B-tree for multidimensional indexing. Technical Report TUM-I9637, Institut für Informatik, Technische Universität München, 1996.Google Scholar
- 36.J. Behrens. Multilevel optimization by space-filling curves in adaptive atmospheric modeling. In F. Hülsemann, M. Kowarschik, and U. Rüde, editors,
*Frontiers in Simulation - 18th Symposium on Simulation Techniques*, pages 186–196. SCS Publishing House, Erlangen, 2005.Google Scholar - 37.J. Behrens.
*Adaptive atmospheric modeling: key techniques in grid generation, data structures, and numerical operations with applications*, volume 54 of*Lecture Notes in Computational Science and Engineering*. Springer, 2006.Google Scholar - 38.J. Behrens and M. Bader. Efficiency considerations in triangular adaptive mesh refinement.
*Philosophical Transactions of the Royal Society A*, 367:4577–4589, 2009. Theme Issue ’Mesh generation and mesh adaptation for large-scale Earth-system modelling’.Google Scholar - 39.J. Behrens, N. Rakowsky, W. Hiller, D. Handorf, M. Läuter, J. Päpke, and K. Dethloff. Parallel adaptive mesh generator for atmospheric and oceanic simulation.
*Ocean Modelling*, 10:171–183, 2005.Google Scholar - 40.J. Behrens and J. Zimmermann. Parallelizing an unstructured grid generator with a space-filling curve approach. In
*Euro-Par 2000 Parallel Processing*, volume 1900 of*Lecture Notes in Computer Science*, pages 815–823, 2000.Google Scholar - 41.D. Bertsimas and M. Grigni. Worst-case example for the spacefilling curve heuristics for the Euclidian traveling salesman problem.
*Operations Research Letters*, 8:241–244, 1989.Google Scholar - 42.T. Bially. Space-filling curves: Their generation and their application to bandwidth reduction.
*IEEE Transactions on Information Theory*, IT-15(6):658–664, 1969.Google Scholar - 43.E. Bänsch. Local mesh refinement in 2 and 3 dimensions.
*IMPACT of Computing in Science and Engineering*, 3(3):181–191, 1991.Google Scholar - 44.A. Bogomjakov and C. Gotsman. Universal rendering sequences for transparent vertex caching of progressive meshes.
*Computer Graphics forum*, 21(2):137–148, 2002.Google Scholar - 45.E. Borel.
*Elements de la Theorie des Ensembles.*Editions Albin Michel, Paris, 1949. Note IV: La courbe de Péano.Google Scholar - 46.V. Brázdová and D. R. Bowler. Automatic data distribution and load balancing with space-filling curves: implementation in CONQUEST.
*Journal of Physics: Condensed Matter*, 20, 2008.Google Scholar - 47.G. Breinholt and Ch. Schierz. Algorithm 781: Generating Hilbert’s space-filling curve by recursion.
*ACM Transactions on Mathematical Software*, 24(2):184–189, 1998.Google Scholar - 48.M. Brenk, H.-J. Bungartz, M. Mehl, I. L. Muntean, T. Neckel, and T. Weinzierl. Numerical simulation of particle transport in a drift ratchet.
*SIAM Journal on Scientific Computing*, 30(6):2777–2798, 2008.Google Scholar - 49.K. Brix, S. Sorana Melian, S. Müller, and G. Schieffer. Parallelisation of multiscale-based grid adaption using space-filling curves.
*ESAIM: Proceedings*, 29:108–129, 2009.Google Scholar - 50.K. Buchin. Constructing Delauney triangulations along space-filling curves. In
*ESA 2009*, volume 5757, pages 119–130, 2009.Google Scholar - 51.E. Bugnion, T. Roos, R. Wattenhofer, and P. Widmayer. Space filling curves versus random walks. In
*Algorithmic Foundations of Geographic Information Systems*, volume 1340, pages 199–211, 1997.Google Scholar - 52.H.-J. Bungartz, M. Mehl, T. Neckel, and T. Weinzierl. The PDE framework Peano applied to fluid dynamics: an efficient implementation of a parallel multiscale fluid dynamics solver on octree-like adaptive Cartesian grids.
*Computational Mechanics*, 46(1):103–114, 2010.Google Scholar - 53.H.-J. Bungartz, M. Mehl, and T. Weinzierl. A parallel adaptive Cartesian PDE solver using space-filling curves. In E.W. Nagel, V.W. Walter, and W. Lehner, editors,
*Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference*, volume 4128 of*Lecture Notes in Computer Science*, pages 1064–1074, 2006.Google Scholar - 54.C. Burstedde, O. Ghattas, M. Gurnis, G. Stadler, E. Tan, T. Tu, L. C. Wilcox, and S. Zhong. Scalable adaptive mantle convection simulation on petascale supercomputers. In
*SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing*, pages 1–15. IEEE Press, 2008.Google Scholar - 55.C. Burstedde, L. C. Wilcox, and O. Ghattas. p4est: Scalable algorithms for parallel adaptive mesh refinement on forests of octrees.
*SIAM Journal on Scientific Computing*, 33(3):1103–1133, 2011.Google Scholar - 56.A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov. The impact of multicore on math software. In
*Applied Parallel Computing, State of the Art in Scientific Computing*, volume 4699 of*Lecture Notes in Computer Science*, pages 1–10, 2007.Google Scholar - 57.A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS-07-600, LAPACK Working Note #191, ICL, University Tennessee, 2007.Google Scholar
- 58.A. R. Butz. Convergence with Hilbert’s space filling curve.
*Journal of Computer and System Sciences*, 3:128–146, 1969.Google Scholar - 59.A. R. Butz. Alternative algorithm for Hilbert’s space-filling curve.
*IEEE Transactions on Computers*, C-20(4):424–426, 1971.Google Scholar - 60.A. C. Calder, B. C. Curtis, L. J. Dursi, B. Fryxell, P. MacNeice, K. Olson, P. Ricker, R. Rosner, F. X. Timmes, H. M. Tufo, J. W. Turan, M. Zingale, and G. Henry. High performance reactive fluid flow simulations using adaptive mesh refinement on thousands of processors. In
*Supercomputing ’00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing*, page 56. IEEE Computer Society, 2000.Google Scholar - 61.P. M. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, and J. D. Terescoy. Dynamic octree load balancing using space-filling curves. Technical Report CS-03-01, Williams College Department of Computer Science, 2003.Google Scholar
- 62.X. Cao and Z. Mo. A new scalable parallel method for molecular dynamics based on cell-block data structure. In
*Parallel and Distributed Processing and Applications*, 3358, pages 757–764, 2005.Google Scholar - 63.J. Castro, M. Georgiopoulos, R. Demara, and A. Gonzalez. Data-partitioning using the hilbert space filling curves: Effect on the speed of convergence of Fuzzy ARTMAP for large database problems.
*Neural Networks*, 18:967–984, 2005.Google Scholar - 64.E. Cesaro. Remarques sur la courbe de von Koch.
*Atti della R. Accad. della Scienze fisiche e matem. Napoli*, 12(15):1–12, 1905. Reprinted in: Opere scelte, a cura dell’Unione matematica italiana e col contributo del Consiglio nazionale delle ricerche, Vol. 2: Geometria, analisi, fisica matematica. Rome: Edizioni Cremonese, pp. 464-479, 1964.Google Scholar - 65.S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In
*International Conference on Supercomputing (ICS’99)*. ACM, New York, 1999.Google Scholar - 66.H.-L. Chen and Y.-I. Chang. Neighbor-finding based on space-filling curves.
*Information Systems*, 30(3):205–226, 2005.Google Scholar - 67.G. Chochia, M. Cole, and T. Heywood. Implementing the hierarchical PRAM on the 2d mesh: analyses and experiments.
*IEEE Symposium on Parallel and Distributed Processing*, 0:587–595, 1995. Preprint on http://homepages.inf.ed.ac.uk/mic/Pubs/ECS-CSG-10-95.ps.gz. - 68.W. J. Coirier and K. G. Powell. Solution-adaptive Cartesian cell approach for viscous and inviscid fluid flows.
*AIAA Journal*, 34(5):938–945, 1996.Google Scholar - 69.A. J. Cole. A note on space-filling curves.
*Software: Practice and Experience*, 13:1181–1189, 1983.Google Scholar - 70.S. Craver, B.-L. Yeo, and M. Yeung. Multilinearization data structure for image browsing. In
*Storage and Retrieval for Image and Video Databases VII*, volume 3656, pages 155–166, 1998.Google Scholar - 71.R. Dafner, D. Cohen-Or, and Y. Matias. Context-based space filling curves.
*Computer Graphics Forum*, 19(3):209–218, 2000.Google Scholar - 72.K. Daubner. Geometrische Modellierung mittels Oktalbäumen und Visualisierung von Simulationsdaten aus der Strömungsmechanik. Institut für Informatik, Technische Universität München, 2005.Google Scholar
- 73.L. De Floriani and E. Puppo. Hierarchical triangulation for multiresolution surface description.
*ACM Transactions on Graphics*, 14(4):363–411, 1995.Google Scholar - 74.J. M. Dennis. Partitioning with space-filling curves on the cubed-sphere. In
*Proceedings of the 17th International Symposium on Parallel and Distributed Processing*, pages 269–275. IEEE Computer Society, 2003.Google Scholar - 75.J. M. Dennis. Inverse space-filling curve partitioning of a global ocean model. In
*Parallel and Distributed Processing Symposium, IPDPS 2007*, pages 1–10. IEEE International, 2007.Google Scholar - 76.K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Terescoy, J. Falk, J. E. Flaherty, and L. G. Gervasio. New challenges in dynamic load balancing.
*Applied Numerical Mathematics*, 52:133–152, 2005.Google Scholar - 77.R. Diekmann, J. Hungershöfer, M. Lux, M. Taenzer, and J.-M. Wierum. Using space filling curves for efficient contact searching. In
*16th IMACS World Congress*, 2000.Google Scholar - 78.J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff. A set of level 3 basic linear algebra subprograms.
*ACM Transactions on Mathematical Software*, 16(1):1–28, 1990.Google Scholar - 79.J. Dreher and R. Grauer. Racoon: A parallel mesh-adaptive framework for hyperbolic conservation laws.
*Parallel Computing*, 31(8–9):913–932, 2005.Google Scholar - 80.M. Duchaineau, M. Wolinsky, D. E. Sigeti, M. C. Miller, C. Aldrich, and M. B. Mineev-Weinstein. ROAMing terrain: real-time optimally adapting meshes. In
*VIS ’97: Proceedings of the 8th Conference on Visualization ’97*, pages 81–88. IEEE Computer Society Press, 1997.Google Scholar - 81.E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström. Recursive blocked algorithms and hybrid data structures for dense matrix library software.
*SIAM Review*, 46(1):3–45, 2004.Google Scholar - 82.M. Elshafei and M. S. Ahmed. Fuzzification using space-filling curves.
*Intelligent Automation and Soft Computing*, 7(2):145–157, 2001.Google Scholar - 83.M. Elshafei-Ahmed. Fast methods for split codebooks.
*Signal Processing*, 80:2553–2565, 2000.Google Scholar - 84.W. Evans, D. Kirkpatrick, and G. Townsend. Right-triangulated irregular networks.
*Algorithmica*, 30(2):264–286, 2001.Google Scholar - 85.C. Faloutsos. Analytical results on the quadtree decomposition of arbitrary rectangles.
*Pattern Recognition Letters*, 13:31–40, 1992.Google Scholar - 86.C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In
*Proceedings of the Eighth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems*, pages 247–252, 1989.Google Scholar - 87.R. Finkel and Bentley J. L. Quad trees: a data structure for retrieval on composite keys.
*Acta Informatica*, 4(1):1–9, 1974.Google Scholar - 88.J. E. Flaherty, R. M. Loy, M. S. Shephard, B. K. Szymanski, J. D. Terescoy, and L. H. Ziantz. Adaptive local refinement with octree load balancing for the parallel solution of three-dimensional conservation laws.
*Journal of Parallel and Distributed Computing*, 47:139–152, 1997.Google Scholar - 89.A. C. Frank.
*Organisationsprinzipien zur Integration von geometrischer Modellierung, numerischer Simulation und Visualisierung*. Herbert Utz Verlag, Dissertation, Institut für Informatik, Technische Universität München, 2000.Google Scholar - 90.J. Frens and D. S. Wise. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. In
*Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming*, pages 206–216, 1997.Google Scholar - 91.J. Frens and D. S. Wise. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism. In
*Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming*, pages 144–154, 2003.Google Scholar - 92.M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In
*Proceedings of the 40th Annual Symposium on Foundations of Computer Science*, pages 285–297, 1999.Google Scholar - 93.H. Fukuda, M. Shimizu, and G. Nakamura. New Gosper space filling curves. In
*Proceedings of the International Conference on Computer Graphics and Imaging (CGIM2001)*, pages 34–38, 2001.Google Scholar - 94.J. Gao and J. M. Steele. General spacefilling curve heuristics and limit theory for the traveling salesman problem.
*Journal of Complexity*, 10:230–245, 1994.Google Scholar - 95.M. Gardner. Mathematical games – in which “monster” curves force redefinition of the word “curve”.
*Scientific American*, 235:124–133, Dec. 1976.Google Scholar - 96.I. Gargantini. An effective way to represent quadtrees.
*Communications of the ACM*, 25(12):905–910, 1982.Google Scholar - 97.T. Gerstner. Multiresolution Compression and Visualization of Global Topographic Data.
*GeoInformatica*, 7(1):7–32, 2003. (shortened version in Proc. Spatial Data Handling 2000, P. Forer, A.G.O. Yeh, J. He (eds.), pp. 14–27, IGU/GISc, 2000, also as SFB 256 report 29, Univ. Bonn, 1999).Google Scholar - 98.P. Gibbon, W. Frings, S. Dominiczak, and B. Mohr. Performance analysis and visualization of the n-body tree code PEPC on massively parallel computers. In
*Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the International Conference ParCo 2005*, volume 33 of*NIC Series*, pages 367–374, 2006.Google Scholar - 99.W. Gilbert. A cube-filling Hilbert curve.
*The Mathematical Intelligencer*, 6(3):78–79, 1984.Google Scholar - 100.J. Gips.
*Shape Grammars and their Uses*. Interdisciplinary Systems Research. Birkhäuser Verlag, 1975.Google Scholar - 101.L. M. Goldschlager. Short algorithms for space-filling curves.
*Software: Practice and Experience*, 11:99–100, 1981.Google Scholar - 102.M. F. Goodchild and D. M. Mark. The fractal nature of geographic phenomena.
*Annals of the Association of American Geographers*, 77(2):265–278, 1987.Google Scholar - 103.K. Goto and R. van de Geijn. On reducing TLB misses in matrix multiplication. FLAME working note #9. Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences, 2002.Google Scholar
- 104.K. Goto and R. A. van de Geijn. Anatomy of a high-performance matrix multiplication.
*ACM Transactions on Mathematical Software*, 34(3):12:1–12:25, 2008.Google Scholar - 105.C. Gotsman and M. Lindenbaum. On the metric properties of space-filling curves.
*IEEE Transactions on Image Processing*, 5(5):794–797, 1996.Google Scholar - 106.P. Gottschling, D. S. Wise, and A. Joshi. Generic support of algorithmic and structural recursion for scientific computing.
*International Journal of Parallel, Emergent and Distributed Systems*, 24(6):479–503, 2009.Google Scholar - 107.M. Griebel and M. A. Schweitzer. A particle-partition of unity method—part IV: Parallelization. In
*Meshfree Methods for Partial Differential Equations*, volume 26 of*Lecture Notes in Computational Science and Engineering*, pages 161–192, 2002.Google Scholar - 108.M. Griebel and G. Zumbusch. Parallel multigrid in an adaptive PDE solver based on hashing and space-filling curves.
*Parallel Computing*, 27(7):827–843, 1999.Google Scholar - 109.M. Griebel and G. Zumbusch. Hash based adaptive parallel multilevel methods with space-filling curves. In H. Rollnik and D. Wolf, editors,
*NIC Symposium 2001*, volume 9 of*NIC Series*, pages 479–492. Forschungszentrum Jülich, 2002.Google Scholar - 110.J. G. Griffiths. Table-driven algorithm for generating space-filling curves.
*Computer Aided Design*, 17(1):37–41, 1985.Google Scholar - 111.J. G. Griffiths. An algorithm for displaying a class of space-filling curves.
*Software: Practice and Experience*, 16(5):403–411, 1986.Google Scholar - 112.J. Gunnels, F. Gustavson, K. Pingali, and K. Yotov. Is cache-oblivious DGEMM viable? In
*Applied Parallel Computing. State of the Art in Scientific Computing*, volume 4699 of*Lecture Notes in Computer Science*, pages 919–928, 2007.Google Scholar - 113.J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn. FLAME: formal linear algebra methods environment.
*ACM Transactions on Mathematical Software*, 27(4):422–455, 2001.Google Scholar - 114.F. Günther, M. Mehl, M. Pögl, and C. Zenger. A cache-aware algorithm for PDEs on hierarchical data structures based on space-filling curves.
*SIAM Journal on Scientific Computing*, 28(5):1634–1650, 2006.Google Scholar - 115.F. Gustavson, L. Karlsson, and B. Kågström. Parallel and cache-efficient in-place matrix storage format conversion.
*Transactions on Mathematical Software*, 38(3):17:1–17:32, 2012.Google Scholar - 116.F. G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms.
*IBM Journal of Research and Development*, 41(6), 1999.Google Scholar - 117.G. Haase, M. Liebmann, and G. Plank. A Hilbert-order multiplication scheme for unstructured sparse matrices.
*International Journal of Parallel, Emergent and Distributed Systems*, 22(4):213–220, 2007.Google Scholar - 118.C. H. Hamilton and A. Rau-Chaplin. Compact Hilbert indices: Space-filling curves for domains with unequal side lengths.
*Information Processing Letters*, 105:155–163, 2008.Google Scholar - 119.H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes.
*IEEE Transactions on Parallel and Distributed Systems*, 17(7):606–618, 2006.Google Scholar - 120.J. Hartmann, A. Krahnke, and C. Zenger. Cache efficient data structures and algorithms for adaptive multidimensional multilevel finite element solvers.
*Applied Numerical Mathematics*, 58(4):435–448, 2008.Google Scholar - 121.A. Haug.
*Sierpinski-Kurven zur speichereffizienten numerischen Simulation auf adaptiven Tetraedergittern*. Diplomarbeit, Fakultät für Informatik, Technische Universität München, 2006.Google Scholar - 122.H. Haverkort and F. van Walderveen. Four-dimensional Hilbert curves for R-trees. In
*Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX)*, pages 63–73, 2009.Google Scholar - 123.H. Haverkort and F. van Walderveen. Locality and bounding-box quality of two-dimensional space-filling curves.
*Computational Geometry: Theory and Applications*, 43(2):131–147, 2010.Google Scholar - 124.G. Heber, R. Biswas, and G. R. Gao. Self-avoiding walks over adaptive unstructured grids.
*Concurrency: Practice and Experience*, 12:85–109, 2000.Google Scholar - 125.D. J. Hebert. Symbolic local refinement of tetrahedral grids.
*Journal of Symbolic Computation*, 17(5):457–472, 1994.Google Scholar - 126.D. J. Hebert. Cyclic interlaced quadtree algorithms for quincunx multiresolution.
*Journal of Algorithms*, 27:97–128, 1998.Google Scholar - 127.A. Heinecke and M. Bader. Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms. In
*Proceedings of the 2008 Computing Frontiers Conference and co-located workshops: MAW’08 and WRFT’08*, pages 385–392. ACM, 2008.Google Scholar - 128.B. Hendrickson. Load balancing fictions, falsehoods and fallacies.
*Applied Mathematical Modelling*, 25:99–108, 2000.Google Scholar - 129.B. Hendrickson and K. Devine. Dynamic load balancing in computational mechanics.
*Computer Methods in Applied Mechanical Engineering*, 184:485–500, 2000.Google Scholar - 130.J. R. Herrero and J. J. Navarro. Analysis of a sparse hypermatrix Cholesky with fixed-sized blocking.
*Applicable Algebra in Engineering, Communication and Computing*, 18(3):279–295, 2007.Google Scholar - 131.D. Hilbert. Über die stetige Abbildung einer Linie auf ein Flächenstück.
*Mathematische Annalen*, 38:459–460, 1891. Available online on the Göttinger Digitalisierungszentrum.Google Scholar - 132.J.-W. Hong and H. T. Kung. I/O complexity: the red-blue pebble game. In
*Proceedings of ACM Symposium on Theory of Computing*, pages 326–333, 1981.Google Scholar - 133.H. Hoppe. Optimization of mesh locality for transparent vertex caching. In
*SIGGRAPH ’99: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques*, pages 269–276. ACM Press/Addison-Wesley Publishing Co., 1999.Google Scholar - 134.Y. C. Hu, A. Cox, and W. Zwaenepoel. Improving fine-grained irregular shared-memory benchmarks by data reordering. In
*Proceedings of the 2000 ACM/IEEE Conference on Supercomputing*, pages # 33. IEEE Computer Society, 2000.Google Scholar - 135.J. Hungershöfer and J.-M. Wierum. On the quality of partitions based on space-filling curves. In
*ICCS 2002*, volume 2331 of*Lecture Notes in Computer Science*, pages 36–45, 2002.Google Scholar - 136.G. M. Hunter and K. Steiglitz. Operations on images using quad trees.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, PAMI-1(2):145–154, 1979.Google Scholar - 137.L. M. Hwa, M. A. Duchaineau, and K. i. Joy. Adaptive 4-8 texture hierarchies. In
*VIS ’04: Proceedings of the Conference on Visualization ’04*, pages 219–226. IEEE Computer Society, 2004.Google Scholar - 138.C. Jackings and S. L. Tanimoto. Octrees and their use in representing three-dimensional objects.
*Computer Graphics and Image Processing*, 14(31):249–270, 1980.Google Scholar - 139.H. V. Jagadish. Linear clustering of objects with multiple attributes.
*ACM SIGMOD Record*, 19(2):332–342, 1990.Google Scholar - 140.H. V. Jagadish. Analysis of the Hilbert curve for representing two-dimensional space.
*Information Processing Letters*, 62(1):17–22, 1997.Google Scholar - 141.G. Jin and J. Mellor-Crummey. SFCGen: a framework for efficient generation of multi-dimensional space-filling curves by recursion.
*ACM Transactions on Mathematical Software*, 31(1):120–148, 2005.Google Scholar - 142.Bentley J. L. and D. F. Stanat. Analysis of range searches in quad trees.
*Information Processing Letters*, 3(6):170–173, 1975.Google Scholar - 143.M. Kaddoura, C.-W. Ou, and S. Ranka. Partitioning unstructured computational graphs for nonuniform and adaptive environments.
*IEEE Concurrency*, 3(3):63–69, 1995.Google Scholar - 144.C. Kaklamanis and G. Persiano. Branch-and-bound and backtrack search on mesh-connected arrays of processors.
*Mathematical Systems Theory*, 27:471–489, 1994.Google Scholar - 145.S. Kamata, R. O. Eason, and Y. Bandou. A new algorithm for
*n*-dimensional Hilbert scanning.*IEEE Transactions on Image Processing*, 8(7):964–973, 1999.Google Scholar - 146.I. Kamel and C. Faloutsos. On packing R-trees. In
*Proceedings of the Second International ACM Conference on Information and Knowledge Management*, pages 490–499. ACM New York, 1993.Google Scholar - 147.E. Kawaguchi and T. Endo. On a method of binary-picture representation and its application to data compression.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, PAMI-2(1):27–35, 1980.Google Scholar - 148.A. Klinger. Data structures and pattern recognition. In
*Proceedings of the First International Joint Conference on Pattern Recognition*, pages 497–498. IEEE, 1973.Google Scholar - 149.A. Klinger and C. R. Dyer. Experiments on picture representation using regular decomposition.
*Computer Graphics and Image Processing*, 5:68–106, 1976.Google Scholar - 150.K. Knopp. Einheitliche Erzeugung und Darstellung der Kurven von Peano, Osgood und von Koch.
*Archiv der Mathematik und Physik*, 26:103–115, 1917.Google Scholar - 151.I. Kossaczky. A recursive approach to local mesh refinement in two and three dimensions.
*Journal of Computational and Applied Mathematics*, 55:275–288, 1994.Google Scholar - 152.R. Kriemann. Parallel
*ℋ*-matrix arithmetics on shared memory systems.*Computing*, 74:273–297, 2005.Google Scholar - 153.J. P. Lauzon, D. M. Mark, L. Kikuchi, and J. A. Guevara. Two-dimensional run-encoding for quadtree representation.
*Computer Vision, Graphics, and Image Processing*, 30(1):56–69, 1985.Google Scholar - 154.J. K. Lawder and P. J. H. King. Querying multi-dimensional data indexed using the Hilbert space-filling curve.
*ACM SIGMOD Record*, 30(1):19–24, 2001.Google Scholar - 155.J. K. Lawder and P. J. H. King. Using state diagrams for Hilbert curve mappings.
*International Journal of Computer Mathematics*, 78:327–342, 2001.Google Scholar - 156.D. Lea. Digital and Hilbert k-d-trees.
*Information Processing Letters*, 27:35–41, 1988.Google Scholar - 157.H. L. Lebesgue.
*Leçons sur l’intégration et la recherche des fonctions primitives*. Gauthier-Villars, Paris, 1904. Available online on the University of Michigan Historical Math Collection.Google Scholar - 158.J.-H. Lee and Y.-C. Hsueh. Texture classification method using multiple space filling curves.
*Pattern Recognition Letters*, 15:1241–1244, 1994.Google Scholar - 159.M. Lee and H. Samet. Navigating through triangle meshes implemented as linear quadtrees.
*ACM Transactions on Graphics*, 19(2):79–121, 2000.Google Scholar - 160.A. Lempel and J. Ziv. Compression of two-dimensional data.
*IEEE Transactions on Information Theory*, IT-32(1):2–8, 1986.Google Scholar - 161.S. Liao, M. A. Lopez, and S. T. Leutenegger. High dimensional similarity search with space filling curves. In
*Proceedings of the 17th International Conference on Data Engineering*, pages 615–622. IEEE Computer Society, 2000.Google Scholar - 162.A. Lindenmayer. Mathematical models for cellular interactions in development.
*Journal of Theoretical Biology*, 18:280–299, 1968.Google Scholar - 163.P. Lindstrom, D. Koller, W. Ribarsky, L. F. Hodges, N. Faust, and G. A. Turner. Real-time, continuous level of detail rendering of height fields. In
*SIGGRAPH ’96: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques*, pages 109–118. ACM, 1996.Google Scholar - 164.P. Lindstrom and V. Pascucci. Terrain simplification simplified: A general framework for view-dependent out-of-core visualization. Technical Report UCRL-JC-147847, 2002.Google Scholar
- 165.A Liu and B. Joe. On the shape of tetrahedra from bisection.
*Mathematics of Computation*, 63:141–154, 1994.Google Scholar - 166.A Liu and B. Joe. Quality local refinement of tetrahedral meshes based on bisection.
*SIAM Journal on Scientific Computing*, 16:1269–1291, 1995.Google Scholar - 167.P. Liu and S. N. Bhatt. Experiences with parallel n-body simulation.
*IEEE Transactions on Parallel and Distributed Systems*, 11(12):1306–1323, 2000.Google Scholar - 168.X. Liu. Four alternative patterns of the Hilbert curve.
*Applied Mathematics and Communication*, 147:741–752, 2004.Google Scholar - 169.X. Liu and G. Schrack. Encoding and decoding the Hilbert order.
*Software: Practice and Experience*, 26(12):1335–1346, 1996.Google Scholar - 170.X. Liu and G. F. Schrack. A new ordering strategy applied to spatial data processing.
*International Journal of Geographical Information Science*, 12(1):3–22, 1998.Google Scholar - 171.Y. Liu and J. Snoeyink. A comparison of five implementations of 3d Delaunay tessellation.
*Combinatorial and Computational Geometry*, 52:439–458, 2005.Google Scholar - 172.J. Luitjens, M. Berzins, and T. Henderson. Parallel space-filling curve generation through sorting.
*Concurrency and Computation: Practice and Experience*, 19:1387–1402, 2007.Google Scholar - 173.G. Mainar-Ruiz and J.-C. Perez-Cortes. Approximate nearest neighbor search using a single space-filling curve and multiple representations of the data points. In
*18th International Conference on Pattern Recognition, 2006 – ICPR 2006*, pages 502–505, 2006.Google Scholar - 174.B. Mandelbrot.
*The Fractal Geometry of Nature*. Freeman and Company, 1977, 1982, 1983.Google Scholar - 175.Y. Matias and A. Shamir. A video scrambling technique based on space filling curves. In
*Advances in Cryptology – CRYPTO ’87*, volume 293, pages 398–417, 1987.Google Scholar - 176.J. M. Maubach. Local bisection refinement for
*n*-simplicial grids generated by reflection.*SIAM Journal on Scientific Computing*, 16(1):210–227, 1995.Google Scholar - 177.J. M. Maubach. Space-filling curves for 2-simplicial meshes created with bisections and reflections.
*Applications of Mathematics*, 3:309–321, 2005.Google Scholar - 178.D. Meagher. Geometric modelling using octree encoding.
*Computer Graphics and Image Processing*, 19:129–147, 1980.Google Scholar - 179.D. Meagher. Octree encoding: A new technique for the representation, manipulation and display of arbitrary 3d objects by computer. Technical report, Image Processing Laboratory, Rensselaer Polytechnic Institute, 1980.Google Scholar
- 180.M. Mehl, M. Brenk, H.-J. Bungartz, K. Daubner, I. L. Muntean, and T. Neckel. An Eulerian approach for partitioned fluid-structure simulations on Cartesian grids.
*Computational Mechanics*, 43(1):115–124, 2008.Google Scholar - 181.M. Mehl, T. Neckel, and P. Neumann. Navier-stokes and lattice-boltzmann on octree-like grids in the Peano framework.
*International Journal for Numerical Methods in Fluids*, 65(1):67–86, 2010.Google Scholar - 182.M. Mehl, T. Neckel, and T. Weinzierl. Concepts for the efficient implementation of domain decomposition approaches for fluid-structure interactions. In U. Langer, M. Discacciati, D.E. Keyes, O.B. Widlund, and W. Zulehner, editors,
*Domain Decomposition Methods in Science and Engineering XVII*, volume 60 of*Lecture Notes in Science an Enginnering*, 2008.Google Scholar - 183.M. Mehl, T. Weinzierl, and C. Zenger. A cache-oblivious self-adaptive full multigrid method.
*Numerical Linear Algebra with Applications*, 13(2–3):275–291, 2006.Google Scholar - 184.J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance for irregular applications using data and computation reorderings.
*International Journal of Parallel Programming*, 29(3):217–247, 2001.Google Scholar - 185.N. Memon, D. L. Neuhoff, and S. Shende. An analysis of some common scanning techniques for lossless image coding.
*IEEE Transactions on Image Processing*, 9(11):1837–1848, 2000.Google Scholar - 186.R. Miller and Q. F. Stout. Mesh computer algorithms for computational geometry.
*IEEE Transactions on Computers*, 38(3):321–340, 1989.Google Scholar - 187.W. F. Mitchell. Adaptive refinement for arbitrary finite-element spaces with hierarchical bases.
*Journal of Computational and Applied Mathematics*, 36:65–78, 1991.Google Scholar - 188.W. F. Mitchell. Hamiltonian paths through two- and three-dimensional grids.
*Journal of Research of the National Institute of Standards and Technology*, 110(2):127–136, 2005.Google Scholar - 189.W. F. Mitchell. A refinement-tree based partitioning method for dynamic load balancing with adaptively refined grids.
*Journal of Parallel and Distributed Computing*, 67(4):417–429, 2007.Google Scholar - 190.G. Mitchison and R. Durbin. Optimal numberings of an
*N*×*N*-array.*SIAM Journal on Algebraic and Discrete Methods*, 7(4):571–582, 1986.Google Scholar - 191.B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis of the clustering properties of the Hilbert space-filling curve.
*IEEE Transactions on Knowledge and Data Engineering*, 13(1):124–141, 2001.Google Scholar - 192.A. Mooney, J. G. Keating, and D. M. Heffernan. A detailed study of the generation of optically detectable watermarks using the logistic map.
*Chaos, Solitons and Fractals*, 30(5):1088–1097, 2006.Google Scholar - 193.E. H. Moore. On certain crinkly curves.
*Transactions of the American Mathematical Society*, 1(1):72–90, 1900. Available online on JSTOR.Google Scholar - 194.G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, 1966.Google Scholar
- 195.R. D. Nair, H.-W. Choi, and H. M. Tufo. Computational aspects of a scalable high-order discontinuous Galerkin atmospheric dynamical core.
*Computers & Fluids*, 38(2):309–319, 2009.Google Scholar - 196.R. Niedermeier, K. Reinhardt, and P. Sanders. Towards optimal locality in mesh-indexings.
*Discrete Applied Mathematics*, 117:211–237, 2002.Google Scholar - 197.M. G. Norman and P. Moscato. The Euclidian traveling salesman problem and a space-filling curve.
*Chaos, Solitons & Fractals*, 6:389–397, 1995.Google Scholar - 198.A. Null. Space-filling curves, or how to waste time with a plotter.
*Software: Practice and Experience*, 1:403–410, 1971.Google Scholar - 199.Y. Ohno and K. Ohyama. A catalog of non-symmetric self-similar space-filling curves.
*Journal of Recreational Mathematics*, 23(4):247–254, 1991.Google Scholar - 200.Y. Ohno and K. Ohyama. A catalog of symmetric self-similar space-filling curves.
*Journal of Recreational Mathematics*, 23(3):161–174, 1991.Google Scholar - 201.M. A. Oliver and N. E. Wiseman. Operations on quadtree encoded images.
*The Computer Journal*, 26(1):83–91, 1983.Google Scholar - 202.J. A. Orenstein. Spatial query processing in an object-oriented database system.
*ACM SIGMOD Record*, 15(2):326–336, 1986.Google Scholar - 203.J. A. Orenstein and F. A. Manola. PROBE spatial data modeling in an image database and query processing application.
*IEEE Transactions on Software Engineering*, 14(5):611–629, 1988.Google Scholar - 204.J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. In
*Proceedings of the 3rd ACM SIGACT-SIGMOD Symposium on Principles of Database Systems*, pages 181–190. ACM, 1984.Google Scholar - 205.C.-W. Ou, M. Gunwani, and S. Ranka. Architecture-independent locality-improving transformations of computational graphs embedded in k-dimensions. In
*ICS ’95: Proceedings of the 9th International Conference on Supercomputing*, pages 289–298, 1995.Google Scholar - 206.C.-W. Ou, S. Ranka, and G. Fox. Fast and parallel mapping algorithms for irregular problems.
*Journal of Supercomputing*, 10:119–140, 1996.Google Scholar - 207.R. Pajarola. Large scale terrain visualization using the restricted quadtree triangulation. In
*VIS ’98: Proceedings of the Conference on Visualization ’98*, pages 19–26. IEEE Computer Society Press, 1998.Google Scholar - 208.S. Papadomanolakis, A. Ailamaki, J. C. Lopez, T. Tu, D. R. O’Hallaron, and G. Heber. Efficient query processing on unstructured tetrahedral meshes. In
*SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data*, pages 551–562, 2006.Google Scholar - 209.M. Parashar, J. C. Browne, C. Edwards, and K. Klimkowski. A common data management infrastructure for parallel adaptive algorithms for PDE solutions. In
*Proceedings of the 1997 ACM/IEEE Conference on Supercomputing*, pages 1–22. ACM Press, 1997.Google Scholar - 210.M. Parashar and J. C. Browne. On partitioning dynamic adaptive grid hierarchies. In
*Proceedings of the 29th Annual Hawaii International Conference on System Sciences*, pages 604–613, 1996.Google Scholar - 211.V. Pascucci. Isosurface computation made simple: Hardware acceleration, adaptive refinement and tetrahedral stripping. In
*Joint Eurographics-IEEE TVCG Symposium on Visualization (VisSym)*, pages 293–300, 2004.Google Scholar - 212.A. Patra and J. T. Oden. Problem decomposition for adaptive
*hp*finite element methods.*Computing Systems in Engineering*, 6(2):97–109, 1995.Google Scholar - 213.A. K. Patra, A. Laszloffy, and J. Long. Data structures and load balancing for parallel adaptive hp finite-element methods.
*Computers & Mathematics with Applications*, 46(1):105–123, 2003.Google Scholar - 214.G. Peano. Sur une courbe, qui remplit toute une aire plane.
*Mathematische Annalen*, 36:157–160, 1890. Available online on the Göttinger Digitalisierungszentrum.Google Scholar - 215.A. Pérez, S. Kamata, and E. Kawagutchi. Peano scanning of arbitrary size images. In
*11th IAPR International Conference on Pattern Recognition, 1992. Vol.III. Conference C: Image, Speech and Signal Analysis*, pages 565–568. IEEE, 1992.Google Scholar - 216.E. Perlman, R. Burns, Y. Li, and C. Meneveau. Data exploration of turbulence simulations using a database cluster. In
*SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing*, pages 1–11, 2007.Google Scholar - 217.J. R. Pilkington and S. B. Baden. Partitioning with spacefilling curves. CSE Technical Report Number CS94–349, University of California, San Diego, 1994.Google Scholar
- 218.J. R. Pilkington and S. B. Baden. Dynamic partitioning of non-uniform structured workloads with spacefilling curves.
*IEEE Transactions on Parallel and Distributed Systems*, 7(3):288–300, 1996.Google Scholar - 219.L. K. Platzman and J. J. Bartholdi III. Spacefilling curves and the planar travelling salesman problem.
*Journal of the ACM*, 36(4):719–737, 1989.Google Scholar - 220.G. Polya. Über eine Peanosche Kurve.
*Bulletin de l’Académie des Sciences de Cracovie, Série A*, pages 1–9, 1913.Google Scholar - 221.P. Prusinkiewicz. Graphical applications of L-systems. In
*Proceedings of Graphics Interface ’86/Vision Interface ’86*, pages 247–253, 1986.Google Scholar - 222.P. Prusinkiewicz and A. Lindenmayer.
*The Algorithmic Beauty of Plants*. Springer, 1990.Google Scholar - 223.P. Prusinkiewicz, A. Lindenmayer, and F. D. Fracchia. Synthesis of space-filling curves on the square grid. In
*Fractals in the Fundamental and Applied Sciences*, pages 341–366. North Holland, Elsevier Science Publisher B.V., 1991.Google Scholar - 224.J. Quinqueton and M. Berthod. A locally adaptive Peano scanning algorithm.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, PAMI-3(4):403–412, 1981.Google Scholar - 225.A. Rahimian, I. Lashuk, S. K. Veerapaneni, A. Chandramowlishwaran, D. Malhotra, L. Moon, R. Sampath, A. Shringarpure, J. Vetter, R. Vuduc, D. Zorin, and G. Biros. Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In
*ACM/IEEE Conference on Supercomputing, 2010*, pages 1–11, 2010.Google Scholar - 226.F. Ramsak, V. Markl, R. Fenk, Elhardt K., and R. Bayer. Integrating the UB-tree into a database system kernel. In
*Proceedings of the 26th International Conference on Very Large Databases*, pages 263–272, 2000.Google Scholar - 227.M.-C. Rivara and Ch. Levin. A 3-d refinement algorithm suitable for adaptive and multi-grid techniques.
*Communications in Applied Numerical Methods*, 8:281–290, 1992.Google Scholar - 228.S. Roberts, S. Kalyanasundaram, M. Cardew-Hall, and W. Clarke. A key based parallel adaptive refinement technique for finite element methods. In
*Computational Techniques and Applications: CTAC 97*, pages 577–584. World Scientific Press, 1998.Google Scholar - 229.B. Roychoudhury and J. F. Muth. The solution of travelling salesman problems based on industrial data.
*Journal of Complexity*, 46(3):347–353, 1995.Google Scholar - 230.H. Sagan. Some reflections on the emergence of space-filling curves.
*Journal of the Franklin Institute*, 328:419–430, 1991.Google Scholar - 231.H. Sagan. On the geometrization of the Peano curve and the arithmetization of the Hilbert curve.
*International Journal of Mathematical Education in Science and Technology*, 23(3):403–411, 1992.Google Scholar - 232.H. Sagan. A three-dimensional Hilbert curve.
*International Journal of Mathematical Education in Science and Technology*, 24(4):541–545, 1993.Google Scholar - 233.H. Sagan.
*Space-Filling Curves*. Universitext. Springer, 1994.Google Scholar - 234.J. K. Salmon, M. S. Warren, and G. S. Winckelmans. Fast parallel tree codes for gravitational and fluid dynamical n-body problems.
*International Journal of Supercomputer Applications*, 8(2):129–142, 1994.Google Scholar - 235.H. Samet. The quadtree and related hierarchical data structures.
*ACM Computing Surveys*, 16(2):187–260, 1984.Google Scholar - 236.P. Sanders and T. Hansch. Efficient massively parallel quicksort. In
*Proceedings of the 4th International Symposium on Solving Irregularly Structured Problems in Parallel*, volume 1253 of*Lecture Notes in Computer Science*, pages 13–24, 1997.Google Scholar - 237.M. Saxena, P. M. Finnigan, C. M. Graichen, A. F. Hathaway, and V. N. Parthasarathy. Octree-based automatic mesh generation for non-manifold domains.
*Engineering with Computers*, 11(1):1–14, 1995.Google Scholar - 238.S. Schamberger and J.-M. Wierum. A locality preserving graph ordering approach for implicit partitioning: Graph-filling curves. In
*Proceedings of the 17th International Conference on Parallel and Distributed Computing Systems, (PDCS’04)*, pages 51–57. ISCA, 2004.Google Scholar - 239.S. Schamberger and J.-M. Wierum. Partitioning finite element meshes using space-filling curves.
*Future Generation Computer Systems*, 21:759–766, 2005.Google Scholar - 240.K. Schloegel, G. Karypis, and V. Kumar.
*Graph partitioning for high-performance scientific simulations*, pages 491–541. Morgan Kaufmann Publishers Inc., 2003.Google Scholar - 241.W. J. Schroeder and M. S. Shephard. A combined octree/Delaunay method for fully automatic 3-d mesh generation.
*International Journal for Numerical Methods in Engineering*, 26(1):37–55, 1988.Google Scholar - 242.E. G. Sewell. Automatic generation of triangulation for piecewise polynomial approximation. Technical Report CSD-TR83, Purdue University, 1972. PhD Thesis.Google Scholar
- 243.M. S. Shephard and M. K. Georges. Automatic three-dimensional mesh generation by the finite octree technique.
*International Journal for Numerical Methods in Engineering*, 32(4):709–749, 1991.Google Scholar - 244.W. Sierpinski. Sur une nouvelle courbe continue qui remplit toute une aire plane.
*Bulletin de l’Académie des Sciences de Cracovie, Série A*, pages 462–478, 1912.Google Scholar - 245.J. P. Singh, C. Holt, T. Totsuka, A. Gupta, and J. Hennessy. Load balancing and data locality in adaptive hierarchical
*n*-body methods: Barnes–Hut, fast multipole, and radiosity.*Journal of Parallel and Distributed Computing*, 27:118–141, 1995.Google Scholar - 246.R. Siromoney and K. G. Subramanian. Space-filling curves and infinite graphs. In
*Graph-Grammars and Their Application to Computer Science*, volume 153 of*Lecture Notes in Computer Science*, pages 380–391, 1983.Google Scholar - 247.B. Smith, P. Bjørstad, and W. Gropp.
*Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations*. Cambridge University Press, 1996.Google Scholar - 248.V. Springel. The cosmological simulation code GADGET-2.
*Monthly Notices of the Royal Astronomical Society*, 364:1105–1134, 2005.Google Scholar - 249.J. Steensland, S. Chandra, and M. Parashar. An application-centric characterization of domain-based SFC partitioners for parallel SAMR.
*IEEE Transactions on Parallel and Distributed Systems*, 13(12):1275–1289, 2002.Google Scholar - 250.R. J. Stevens, A. F. Lehar, and F. H. Preston. Manipulation and presentation of multidimensional image data using the Peano scan.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, PAMI-5(5):520–526, 1983.Google Scholar - 251.Q. F. Stout. Topological matching. In
*Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing*, pages 24–31, 1983.Google Scholar - 252.H. Sundar, R. S. Sampath, S. S. Adavani, C. Davatzikos, and G. Biros. Low-constant parallel algorithms for finite element simulations using linear octrees. In
*SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing*, pages 1–12. ACM, 2007.Google Scholar - 253.H. Sundar, R. S. Sampath, and G. Biros. Bottom-up construction and 2:1 balance refinement of linear octrees in parallel.
*SIAM Journal on Scientific Computing*, 30(5):2675–2708, 2008.Google Scholar - 254.S. Tanimoto and T. Pavlidis. A hierarchical data structure for picture processing.
*Computer Graphics and Image Processing*, 4:104–119, 1975.Google Scholar - 255.J. Thiyagalingam, O. Beckmann, and P. H. J. Kelly. Is Morton layout competitive for large two-dimensional arrays yet?
*Concurrency and Computation: Practice and Experience*, 18:1509–1539, 2006.Google Scholar - 256.S. Tirthapura, S. Seal, and W. Aluru. A formal analysis of space filling curves for parallel domain decomposition. In
*Proceedings of the 2006 International Conference on Parallel Processing (ICPP’06)*, pages 505–512. IEEE Computer Society, 2006.Google Scholar - 257.H. Tropf and H. Herzog. Multidimensional range search in dynamically balanced trees.
*Angewandte Informatik (Applied Informatics)*, 2:71–77, 1981.Google Scholar - 258.T. Tu, D. R. O’Hallaron, and O. Ghattas. Scalable parallel octree meshing for terascale applications. In
*SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing*, page 4. IEEE Computer Society, 2005.Google Scholar - 259.L. Velho and J. de Miranda Gomes. Digital halftoning with space filling curves.
*ACM SIGGRAPH Computer Graphics*, 25(4):81–90, 1991.Google Scholar - 260.L. Velho and J. Gomes. Variable resolution 4-
*k*meshes: Concepts and applications.*Computer Graphics forum*, 19(4):195–212, 2000.Google Scholar - 261.B. Von Herzen and A. H. Barr. Accurate triangulations of deformed, intersecting surfaces.
*ACM SIGGRAPH Computer Graphics*, 21(4):103–110, 1987.Google Scholar - 262.J. Wang and J. Shan. Space filling curve based point clouds index. In
*Proceedings of the 8th International Conference on GeoComputation*, pages 551–562, 2005.Google Scholar - 263.J. Warnock. A hidden surface algorithm for computer generated halftone pictures. Technical Report TR 4-15,, Computer Science Department, University of Utah, 1969.Google Scholar
- 264.M. S. Warren and J. K. Salmon. A parallel hashed oct-tree n-body algorithm. In
*Conference on High Performance Networking and Computing, Proceedings of the 1993 ACM/IEEE Conference on Supercomputing*, pages 12–21. ACM, 1993.Google Scholar - 265.T. Weinzierl.
*A Framework for Parallel PDE Solvers on Multiscale Adaptive Cartesian Grids*. Dissertation, Institut für Informatik, Technische Universität München, 2009.Google Scholar - 266.T. Weinzierl and M. Mehl. Peano – a traversal and storage scheme for octree-like adaptive Cartesian multiscale grids.
*SIAM Journal on Scientific Computing*, 33(5):2732–2760, 2011.Google Scholar - 267.R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project.
*Parallel Computing*, 27(1–2):3–35, 2001.Google Scholar - 268.J.-M. Wierum. Definition of a new circular space-filling curve – β
*Ω*-indexing. Technical Report TR-001-02, Paderborn Center for Parallel Computing, PC^{2}, 2002.Google Scholar - 269.N. Wirth.
*Algorithmen und Datenstrukturen*. Teubner, 1975.Google Scholar - 270.N. Wirth.
*Algorithms + Data Structures = Programs*. Prentice Hall, 1976.Google Scholar - 271.D. S. Wise. Representing matrices as quadtrees for parallel processors.
*Information Processing Letters*, 20(4):195–199, 1985.Google Scholar - 272.D. S. Wise and S. Franco. Costs of quadtree representation of nondense matrices.
*Journal of Parallel and Distributed Computing*, 9(3):282–296, 1990.Google Scholar - 273.I. H. Witten and R. M. Neal. Using Peano curves for bilevel display of continuous tone images.
*IEEE Computer Graphics and Applications*, pages 47–52, May 1982.Google Scholar - 274.I. H. Witten and B. Wyvill. On the generation and use of space-filling curves.
*Software: Practice and Experience*, 13:519–525, 1983.Google Scholar - 275.Dawes W. N., S. A. Harvey, S. Fellows, N. Eccles, D. Jaeggi, and W. P Kellar. A practical demonstration of scalable, parallel mesh generation. In
*47th AIAA Aerospace Sciences Meeting & Exhibit*, 2009. AIAA-2009-0981.Google Scholar - 276.W. Wunderlich. Irregular curves and functional equations.
*Ganita*, 5:215–230, 1954.Google Scholar - 277.W. Wunderlich. Über Peano-Kurven.
*Elemente der Mathematik*, 28(1):1–24, 1973.Google Scholar - 278.K. Yang and M. Mills. Fractal based image coding scheme using Peano scan. In
*Proceedings of ISCAS ’88*, volume 1470, pages 2301–2304, 1988.Google Scholar - 279.M.-M. Yau and S. N. Srihari. A hierarchical data structure for multidimensional digital images.
*Communications of the ACM*, 26(7):504–515, 1983.Google Scholar - 280.L. Ying, G. Biros, D. Zorin, and H. Langston. A new parallel kernel-independent fast multipole method. In
*SC ’03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing*. IEEE Computer Society, 2003.Google Scholar - 281.K. Yotov, T. Roeder, K. Pingali, J. Gunnels, and F. Gustavson. An experimental comparison of cache-oblivious and cache-conscious programs. In
*Proceedings of the 19th annual ACM symposium on Parallel algorithms and architectures*, pages 93–104, 2007.Google Scholar - 282.Y. Zhang and R. E. Webber. Space diffusion: an improved parallel halftoning technique using space-filling curves. In
*Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques*, pages 305–312. ACM New York, 1993.Google Scholar - 283.S. Zhou and C. B. Jones. HCPO: an efficient insertion order for incremental delaunay triangulation.
*Information Processing Letters*, 93:37–42, 2005.Google Scholar - 284.U. Ziegler. The NIRVANA code: Parallel computational MHD with adaptive mesh refinement.
*Computer Physics Communications*, 179(4):227–244, 2008.Google Scholar - 285.G. Zumbusch. On the quality of space-filling curve induced partitions.
*Zeitschrift für Angewandte Mathematik und Mechanik*, 81, Suppl. 1:25–28, 2001.Google Scholar - 286.G. Zumbusch. Load balancing for adaptively refined grids.
*Proceedings in Applied Mathematics and Mechanics*, 1:534–537, 2002.Google Scholar - 287.G. Zumbusch.
*Parallel Multilevel Methods: Adaptive Mesh Refinement and Loadbalancing*. Vieweg+Teubner, 2003.Google Scholar