Agglomerative Clustering of Growing Squares

We study an agglomerative clustering problem motivated by interactive glyphs in geo-visualization. Consider a set of disjoint square glyphs on an interactive map. When the user zooms out, the glyphs grow in size relative to the map, possibly with different speeds. When two glyphs intersect, we wish to replace them by a new glyph that captures the information of the intersecting glyphs. We present a fully dynamic kinetic data structure that maintains a set of n disjoint growing squares. Our data structure uses O(nlognloglogn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\bigl (n \log n \log \log n\bigr )$$\end{document} space, supports queries in worst case O(log2n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\bigl (\log ^2 n\bigr )$$\end{document} time, and updates in O(log5n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\bigl (\log ^5 n\bigr )$$\end{document} amortized time. This leads to an O(nα(n)log5n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\bigl (n\,\alpha (n)\log ^5 n\bigr )$$\end{document} time algorithm to solve the agglomerative clustering problem. This is a significant improvement over the current best O(n2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O\bigl (n^2\bigr )$$\end{document} time algorithms.


Introduction
We study an agglomerative clustering problem motivated by interactive glyphs in geovisualization.Our specific use case stems from the eHumanities, but similar visualizations are used in a variety of application areas.GlamMap [5] 1 is a visual analytics tool which allows the user to interactively explore datasets which contain (at least) the following metadata of a book collection: author, title, publisher, year of publication, and location (city) of publisher.Each book is depicted by a square, color-coded by publication year, and placed on a map according to the location of its publisher.Overlapping squares (many books are published in Leipzig, for example) are recursively aggregated into a larger glyph until all glyphs are disjoint (see Fig. 1).As the user zooms out, the glyphs "grow" relative to the map to remain legible.As a result, glyphs start to overlap and need to be merged into larger glyphs to keep the map clear and uncluttered.It is straightforward to compute the resulting agglomerative clustering whenever a data set is loaded and to serve it to the user as needed by the current zoom level.However, GlamMap allows the user to filter by author, title, year of publication, or other applicable meta data.It is impossible to pre-compute the clustering for any conceivable combination of filter values.To allow the user to browse at interactive speeds, we hence need an efficient agglomerative clustering algorithm for growing squares (glyphs).Interesting bibliographic data sets (such as the catalogue of WorldCat, which contains more than 321 million library records at hundreds of thousands of distinct locations) are too large by a significant margin to be clustered fast enough with the current state-of-the-art O(n 2 ) time algorithms (here n is the number of squares or glyphs).
In this paper we formally analyze the problem and present a fully dynamic data structure that uses O(n(log n log log n) 2 ) space, supports updates in O(log 7 n) amortized time, and queries in O(log 3 n) time, which allows us to compute the agglomerative clustering for n glyphs in O(nα(n) log 7 n) time.Here, α is the extremely slowly growing inverse Ackermann function.To the best of our knowledge, this is the first fully dynamic clustering algorithm which beats the classic O(n 2 ) time bound.Formal problem statement.Let P be a set of points in R 2 (the locations of publishers from our example).Each point p ∈ P has a positive weight p w (number of books published in this city).Given a "time" parameter t, we interpret the points in P as squares.More specifically, let p (t) be the square centered at p with width tp w .For ease of exposition we assume all x and y to be unique.With some abuse of notation we may refer to P as a set of squares rather than the set of center points of squares.Observe that initially, i.e. at t = 0, all squares in P are disjoint.As t increases, the squares in P grow, and hence they may start to intersect.When two squares p (t) and q (t) intersect at time t, we remove both p and q and replace them by a new point z = αp + (1 − α)q, with α = w p /(w p + w q ), of weight z w = p w + q w (see Fig. 2).Related Work.Funke, Krumpe, and Storandt [6] introduced so-called "ball tournaments", a related, but simpler, problem, which is motivated by map labeling.Their input is a set of balls in R d with an associated set of priorities.The balls grow linearly and whenever two balls touch, the ball with the lower priority is eliminated.The goal is to compute the elimination sequence efficiently.Bahrdt et al. [4] and Funke and Storandt [7] improved upon the initial results and presented bounds which depend on the ratio ∆ of the largest to the smallest radius.Specifically, Funke and Storandt [7] show how to compute an elimination sequence for n balls in O(n log ∆(log +∆ d−1 )) time in arbitrary dimensions and in O(Cn polylog n) time for d = 2, where C denotes the number of different radii.In our setting eliminations are not sufficient, since merged glyphs need to be re-inserted.Furthermore, as opposed to typical map labeling problems where labels come in a fixed range of sizes, the sizes of our glyphs can vary by a factor of 10.000 or more (Amsterdam with its many well-established publishers vs. Kaldenkirchen with one obscure one).Ahn et al. [2] very recently and independently developed the first sub-quadratic algorithms to compute elimination orders for ball tournaments.Their results apply to balls and boxes in two or higher dimensions.Specifically, for squares in two dimensions they can compute an elimination order in O(n log 4 n) time.Their results critically depend on the fact that they know the elimination priorities at the start of their algorithm and that they only have to handle deletions.Hence they do not have to run an explicit simulation of the growth process and can achieve their results by the clever use of advanced data structures.In contrast, we are handling the fully dynamic setting with both insertions and deletions, and without a specified set of priorities.
Our clustering problem combines both dynamic and kinetic aspects: squares grow, which is a restricted form of movement, and squares are both inserted and deleted.There are comparatively few papers which tackle dynamic kinetic problems.Alexandron et al. [3] present a dynamic and kinetic data structure for maintaining the convex hull of points (or analogously, the lower envelope of lines) moving in R 2 .Their data structure processes (in expectation) O(n 2 β s+2 (n) log n) events in O(log 2 n) time each.Here, β s (n) = λ s (n)/n, and λ s (n) is the maximum length of a Davenport-Schinzel sequence on n symbols of order s.Agarwal et al. [1] present dynamic and kinetic data structures for maintaining the closest pair and all nearest neighbors.The expected number of events processed is again roughly O(n 2 β s+2 (n) polylog n), each of which can be handled in O(polylog n) expected time.We are using some idea and constructions which are similar in flavor to the structures presented in their paper.

Results.
We present a fully dynamic data structure that can maintain a set P of disjoint growing squares.Our data structure will produce an intersection event at every time t when two squares p and q , with p, q ∈ P , start to intersect (i.e. at any time before t, all squares in P remain disjoint).At such a time, we then have to delete some of the squares, to make sure that the squares in P are again disjoint.At any time, our data structure supports inserting a new square that is disjoint from the squares in P , or removing an existing square from P .Our data structure can handle a sequence of m ≥ n updates in a total of O(mα(n) log 7 n) time, each update is performed in O(log 7 n) amortized time.
The Main Idea.We develop a data structure that can maintain a dynamic set of disjoint squares P , and produce an intersection event at every time t when q starts to intersect with a square p of a point p ∈ P that dominates q.We say that a point p dominates q if and only if q x ≤ p x and q y ≤ p y .We then combine four of these data structures, one for each quadrant, to make sure that all squares in P remain disjoint.The main observation that allows us to maintain P efficiently, is that we can maintain the points D(q) dominating q in an order so that a prefix of D(q) will have their squares intersect the top side of q first, and the remaining squares will intersect the right side of q first.We formalize this in Section 2. We then present our data structure -essentially a pair of range trees interlinked with linking certificates-in Section 3.While our data structure is conceptually simple, the exact implementation is somewhat intricate, and the details are numerous.Our initial analysis shows that our data structure maintains O(log 6 n) certificates per square, which yields an O(log 7 n) amortized update time.This allows us to simulate the process of growing the squares in P -and thus solve the agglomerative glyph clustering problem-in O(nα(n) log 7 n) time using O(n log 5 n) space.In Section 4 we analyze the relation between canonical subsets in dominance queries.We show that for two range trees T R and T B in R d , the number of pairs of nodes r ∈ T R and b ∈ T B for which r occurs in the canonical subset of a dominance query defined by b and vice versa is only O(n(log n log log n) 2 ), where n is the total size of T R and T B .This implies that the number of linking certificates that our data structure maintains, as well as the total space used, is actually only O(n(log n log log n) 2 ).Since the linking certificates actually provide an efficient representation of all dominance relations between two point sets (or within a point set), we believe that this result is of independent interest as well.

2
Geometric Properties Figure 3 The squares and the projection of their centers and relevant corners onto the line γ.
Let q denote the bottom left vertex of a square q , and let r q denote the top right vertex of q .Furthermore, let D(q) denote the subset of points of P dominating q, and let L(q) = { p | p ∈ D(q)} denote the set of bottom left vertices of the squares of those points.
Observation 1.Let p ∈ D(q) be a point dominating point q.The squares q (t) and p (t) intersect at time t if and only if r q (t) dominates p (t) at time t.
Consider a line γ with slope minus one, project all points in Z(t) = {r q (t)} ∪ L(q)(t), for some time t, onto γ, and order them from left to right.Observe that, since all points in Z move along lines with slope one, this order does not depend on the time t.Moreover, for any point p, we have r p (0) = p (0) = p, so we can easily compute this order by projecting the centers of the squares onto γ and sorting them.Let D − (q) denote the (ordered) subset of D(q) that occur before q in the order along γ, and let D + (q) denote the ordered subset of D(q) that occur after q in the order along γ.We define L − (q) and L + (q) analogously.
Observation 2. Let p ∈ D(q) be a point dominating point q, and let t * be the first time at which r = r q (t * ) dominates = p (t * ).We then have that x < r x and y = r y if and only if p ∈ D − (q), and x = r x and y < r y if and only if p ∈ D + (q).See Fig. 3 for an illustration.
Observation 2 implies that the points p in D − (q) will start to intersect q at some time t * because the bottom left vertex p of p will enter q through the top edge, whereas the bottom left vertex of the (squares of the) points in D + (q) will enter q through the right edge.We thus obtain the following result.Lemma 3. Let t * be the first time at which a square p of a point p ∈ D(q) intersects q .We then have that

A Kinetic Data Structure for Growing Squares
In this section we present a data structure that can detect the first intersection among a dynamic set of disjoint growing squares.In particular, we describe a data structure that can detect intersections between all pairs of squares p , q in P such that p ∈ D + (q).We build an analogous data structure for when p ∈ D − (q).This covers all intersections between pairs of squares p , q , where p ∈ D(q).We then use four copies of these data structures, one for each quadrant, to detect the first intersection among all pairs of squares.We describe the data structure itself in Section 3.1, and we briefly describe how to query it in Section 3.2.We deal with updates, e.g.inserting a new square into P or deleting an existing square from P , in Section 3.3.In Section 3.4 we analyze the total number of events that we have to process, and the time required to do so, when we grow the squares.

The Data Structure
Our data structure consists of two three-layered trees T L and T R , and a set of certificates linking nodes in T L to nodes in T R .These trees essentially form two 3D range trees on the centers of the squares in P , taking third coordinate p γ of each point to be their rank in the order along the line γ (ordered from left to right).The third layer of T L will double as a kinetic tournament tracking the bottom left vertices of squares.Similarly, T R will track the top right vertices of the squares.
The Layered Trees.The tree T L is a 3D-range tree storing the center points in P .Each layer is implemented by a weight-balanced binary search tree (bb[α] tree) [9], and each node µ corresponds to a canonical subset P µ of points stored in the leaves of the subtree rooted at µ.The points are ordered on x-coordinate first, then on y-coordinate, and finally on γ-coordinate.Let L µ denote the set of bottom left vertices of squares corresponding to the set P µ , for some node µ.
Consider the associated structure X L v of some secondary node v.We consider X L v as a kinetic tournament on the x-coordinates of the points L v [1].More specifically, every internal node w ∈ X L v corresponds to a set of points P w consecutive along the line γ.Since the γ-coordinates of a point p and its bottom left vertex p are equal, this means w also corresponds to a set of consecutive bottom left vertices L w .Node w stores the vertex p in L w with minimum x-coordinate, and will maintain certificates that guarantee this [1].
) and z ∈ Q(m w ) then we add a linking certificate between the rightmost upper right-vertex rq, q ∈ Pz, and the leftmost bottom left vertex p, p ∈ Pw.
The tree T R has the same structure as T L : it is a three-layered range tree on the center points in P .The difference is that a ternary structure X R v , for some secondary node v, forms a kinetic tournament maintaining the maximum x-coordinate of the points in R v , where R v are the top right vertices of the squares (with center points) in P v .Hence, every ternary node z ∈ X R v stores the vertex r q with maximum x-coordinate among R v .Let X L and X R denote the set of all kinetic tournament nodes in T L and T R , respectively.Linking the Trees.Next, we describe how to add linking certificates between the kinetic tournament nodes in the trees T L and T R that guarantee the squares are disjoint.More specifically, we describe the certificates, between nodes w ∈ X L and z ∈ X R , that guarantee that the squares p and q are disjoint, for all pairs q ∈ P and p ∈ D + (q).Consider a point q.There are O(log 2 n) nodes in the secondary trees of T L , whose canonical subsets together represent exactly D(q).For each of these nodes v we can then find O(log n) nodes in X L v representing the points in L + (q).So, in total q is interested in a set Q L (q) of O(log 3 n) kinetic tournament nodes.It now follows from Lemma 3 that if we were to add certificates certifying that r q is left of the point stored at the nodes in Q L (q) we can detect when q intersects with a square of a point in D + (q).However, as there may be many points q interested in a particular kinetic tournament node w, we cannot afford to maintain all of these certificates.The main idea is to represent all of these points q by a number of canonical subsets of nodes in T R , and add certificates to only these nodes.
Consider a point p.Symmetric to the above construction, there are O(log 3 n) nodes in kinetic tournaments associated with T R that together exactly represent the (top right corners of) the points q dominated by p and for which p ∈ D + (q).Let Q R (p) denote this set of kinetic tournament nodes.
Next, we extend the definitions of Q L and Q R to kinetic tournament nodes.To this end, we first associate each kinetic tournament node with a (query) point in R 3 .Consider a kinetic tournament node w in a tournament X L v , and let u be the node in the primary T L for which v ∈ T u .Let m w = (min a∈Pu a x , min b∈Pv b y , min c∈Pw c γ ) be the point associated with w (note that we take the minimum over different sets P u , P v , and P w for the different coordinates), and define Q R (w) = Q R (m w ).Symmetrically, for a node z in a tournament X R v , with v ∈ T u and u ∈ T , we define m z = (max a∈Pu a x , max b∈Pv b y , max c∈Pz c γ ) and We now add a linking certificate between every pair of nodes w ∈ X L and z ∈ X R for which (i) w is a node in the canonical subset of z, that is w ∈ Q L (z), and (ii) z is a node in the canonical subset of w, z ∈ Q R (w).Such a certificate will guarantee that the point r q currently stored at z lies left of the point p stored at w. Proof.We start with the first part of the lemma statement.Every node w ∈ X L can be associated with at most O(log 3 n) linking certificates: one with each node in Q R (w).Analogously, every node z ∈ X R can be associated with at most O(log 3 n) linking certificates: one for each node in Q L (z).
Every point p occurs in the canonical subset of at most O(log 3 n) kinetic tournament nodes in both X L and X R : p is stored in O(log 2 n) leaves of the kinetic tournaments, and in each such a tournament it can participate in O(log n) certificates (at most two tournament certificates in O(log n) nodes).As we argued above, each such a node itself occurs in at most O(log 3 n) certificates.The lemma follows.
What remains to argue is that we can still detect the first upcoming intersection.Proof.Let b be the first node on the path from the root of T B to p such that the canonical subset P b of b is contained in the interval [q, ∞), but the canonical subset of the parent of b is not.We define b to be the root of T B if no such node exists.We define r to be the first node on the path from the root of T R to q for which P r is contained in (−∞, x] but the canonical subset of the parent is not.We again define r as the root of T R if no such node exists.See Fig. 5. Clearly, we now directly have that r is one of the nodes whose canonical subsets form R ∩ (−∞, x], and that q ∈ P r (as r lies on the search path to q).It is also easy to see that p ∈ P b , as b lies on the search path to p.All that remains is to show that b is one of the canonical subsets that together form B ∩ [x , ∞).This follows from the fact that q ≤ x < x ≤ p -and thus P b is indeed a subset of [x , ∞)-and the fact that the subset of the parent v of b contains an element smaller than q, and can thus not be a subset of [x , ∞).Lemma 6.Let p and q , with p ∈ D + (q), be the first pair of squares to intersect, at some time t * , then there is a pair of nodes w, z that have a linking certificate that fails at time t * .
Proof.Consider the leaves representing p and q in T L and T R , respectively.By Lemma 5 we get that there is a pair of nodes u ∈ T L and u ∈ T R that, among other properties, have p ∈ P u and ∈ P u .Hence, we can apply Lemma 5 again on the associated trees of u and u , giving us nodes v ∈ T u and v ∈ T u which again have p ∈ P v and q ∈ P v .Finally, we apply Lemma 5 once more on X L v and X R v giving us nodes w ∈ X L v and z ∈ X R v with p ∈ P w and q ∈ P z .In addition, these three applications of Lemma 5 give us two points (x, y, γ) and (x , y , γ ) such that: P u occurs as a canonical subset representing P ∩ ([x , ∞) × R 2 ), P v occurs as a canonical subset representing P u ∩ (R × [y , ∞) × R), and P w occurs as a canonical subset representing ), and such that P u occurs as a canonical subset representing P ∩ ((−∞, x] × R 2 ), P v occurs as a canonical subset representing P u ∩ (R × (−∞, y] × R), and P z occurs as a canonical subset representing P v ∩ (R 2 × (−∞, γ]).Combining these first three facts, and observing that m z = (x , y , γ ) gives us that P w occurs as a canonical subset representing Analogously, combining the latter three facts and m w = (x, y, γ) gives us z ∈ Q R (w).Therefore, w and z have a linking certificate.This linking certificate involves the leftmost bottom left vertex a for some point a ∈ P w and the rightmost top right vertex r b for some point b ∈ P z .Since p ∈ P w and q ∈ P z , we have that r q ≤ r b and a ≤ p , and thus we detect their intersection at time t * .
From Lemma 6 it follows that we can now detect the first intersection between a pair of squares p , q , with p ∈ D + (q).We define an analogous data structure for when p ∈ D − (q).Following Lemma 3, the kinetic tournaments will maintain the vertices with minimum and maximum y-coordinate for this case.We then again link up the kinetic tournament nodes in the two trees appropriately.
Space Usage.Our trees T L and T R are range trees in R 3 , and thus use O(n log 2 n) space.However, it is easy to see that this is dominated by the space required to store the certificates.For all O(n log 2 n) kinetic tournament nodes we store at most O(log 3 n) certificates (Lemma 4), and thus the total space used by our data structure is O(n log 5 n).In Section 4 we will show that the number of certificates that we maintain is actually only O(n(log n log log n) 2 ).This means that our data structure also uses only O(n(log n log log n) 2 ) space.

Answering Queries
The basic query that our data structure supports is testing if a query square q currently intersects with a square p in P , with p ∈ D + (q).To this end, we simply select the O(log 3 n) kinetic tournament nodes whose canonical subsets together represent D + (q).For each such a node w we check if the x-coordinate of the lower-left vertex p stored at that node (which has minimum x-coordinate among L w ) is smaller than the x-coordinate of r q .If so, the squares intersect.The correctness of our query algorithm directly follows from Observation 2. The total time required for a query is O(log 3 n).Similarly, we can test if a given query point q is contained in a square p , with p ∈ D + (q).Note that our full data structure will contain trees analogous to T L that can be used to check if there is a square p ∈ P , with p ∈ D − (q), or p in one of the other quadrants defined by q, that intersects q .

Inserting or Deleting a Square
At an insertion or deletion of a square p we proceed in three steps.First, we update the individual trees T L and T R , making sure that they once again represent 3D range trees of all center points P , and that the ternary data structures are, by themselves, correct kinetic tournaments.For each kinetic tournament node in X L affected by the update, we then query T R to find a new set of linking certificates.We update the affected nodes in X R analogously.Finally, we update the global event queue that stores all certificates.
Lemma 7. Inserting a square into T L or deleting a square from T L takes O(log 3 n) amortized time.
Proof.We use the following standard procedure for updating the three-level bb[α] trees T L in O(log 3 n) amortized time.An update (insertion or deletion) in a ternary data structure can easily be handled in O(log n) time.When we insert into or delete an element x in a bb[α] tree that has associated data structures, we add or remove the leaf that contains x, rebalance the tree by rotations, and finally add or remove x from the associated data structures.When we do a left rotation around an edge (µ, ν) we have to build a new associated data structure for node µ from scratch.See Fig. 6.Right rotations are handled analogously.It is well known that if building the associated data structure at node µ takes O(|P µ | log c |P µ |) time, for some c ≥ 0, then the costs of all rebalancing operations in a sequence of m insertions and deletions takes a total of O(m log c+1 n) time, where n is the maximum size of the tree at any time [8].We can build a new kinetic tournament X L v for node v (using the associated data structures at its children) in linear time.Note that this cost excludes updating the global event queue.Building a new secondary tree T v , including its associated kinetic tournaments, takes O(|T v | log |T v |) time.It then follows that the cost of our rebalancing operations is at most O(m log 2 n).This is dominated by the total number of nodes created and deleted, O(m log 3 n), during these operations.Hence, we can insert or delete a point (square) in T L in O(log 3 n) amortized time.
Analogous to Lemma 7 we can update T R in O(log 3 n) amortized time.Next, we update the linking certificates.We say that a kinetic tournament node w in T L is affected by an update if (i) the update added or removed a leaf node in the subtree rooted at w, (ii) node w was involved in a tree rotation, or (iii) w occurs in a newly built associated tree X L v (for some node v).Let X L i denote the set of nodes affected by update i. Analogously, we define the set of nodes X R i of T R affected by the update.For each node w ∈ X L i , we query T R to find the set of O(log 3 n) nodes whose canonical subsets represent Q R (w).For each node z in this set, we test if we have to add a linking certificate between w and z.As we show next, this takes constant time for each node z, and thus O( i |X L i | log 3 n) time in total, for all nodes w.We update the linking certificates for all nodes in X R i analogously.We have to add a link between a node z ∈ Q R (w) and w if and only if we also have w ∈ Q L (z).We test this as follows.Let v be the node whose associated X L v contains w, and let u be the node in T L whose associated tree contains v.We have that w We can test each of these conditions in constant time: Observation 8. Let q be a query point in R 1 , let w be a node in a binary search tree T , and let x p = min P p of the parent p of w in T , or x p = −∞ if no such node exists.We have that w ∈ C(T, [q, ∞)) if and only if q ≤ min P w and q > x p .
Finally, we delete all certificates involving no longer existing nodes from our global event queue, and replace them by all newly created certificates.This takes O(log n) time per certificate.We charge the cost of deleting a certificate to when it gets created.Since every node w affected creates at most O(log 3 n) new certificates, all that remains is to bound the total number of affected nodes.We can show this using basically the same argument as we used to bound the update time.This leads to the following result.Lemma 9. Inserting a disjoint square into P , or deleting a square from P takes O(log 7 n) amortized time.
Proof.An update visits at most O(log 3 n) nodes itself (i.e.leaf nodes and nodes on the search path).All other affected nodes occur as newly built trees due to rebalancing operations.As in Lemma 7, the total number of nodes created due to rotations in a sequence of m updates is O(m log 2 n).It follows that the total number of affected nodes in such a sequence is O(m log 3 n).Therefore, we create O(m log 6 n) linking certificates in total, and we can compute them in O(m log 6 n) time.Updating the event global queue therefore takes O(m log 7 n) time.

Running the Simulation
All that remains is to analyze the number of events processed.We show that in a sequence of m operations, our data structure processes at most O(mα(n) log 3 n) events.This leads to the following result.
Theorem 10.We can maintain a set P of n disjoint growing squares in a fully dynamic data structure such that we can detect the first time that a square q intersects with a square p , with p ∈ D + (q).Our data structure uses O(n(log n log log n) 2 ) space, supports updates in O(log 7 n) amortized time, and queries in O(log 3 n) time.For a sequence of m operations, the structure processes a total of O(mα(n) Proof.We argued the bounds on the space, the query, and the update times before.All that remains is to bound the number of events processed, and the time to do so.
We start by the observation that each failure of a linking certificate produces an intersection, and thus a subsequent update.It thus follows that the number of such events is at most m.
To bound the number of events created by the tournament trees we extend the argument of Agarwal et al. [1].For any kinetic tournament node w in T L , the minimum x-coordinate corresponds to a lower envelope of line-segments in the t, x-space.This envelope has , where P * w is the multiset of points that ever occur in P w , i.e. that are stored in a leaf of the subtree rooted at w at some time t.Hence, the number of tournament events involving node w is also at most O(|P * w |α(n)).It then follows that the total number of events is proportional to the size of these sets P * w , over all in our tree.As in Lemma 7, every update directly contributes one point to O(log 3 n) nodes.The remaining contribution is due to rebalancing operations, and this cost is again bounded by O(m log 2 n).Thus, the total number of events processed is O(mα(n) log 3 n).
At every event, we have to update the O(log 3 n) linking certificates of w.This can be done in O(log 4 n) time (including the time to update the global event queue).Thus, the total time for processing all kinetic tournament events in T L is O(mα(n) log 7 n).The analysis for the kinetic tournament nodes z in T R is analogous.
To simulate the process of growing the squares in P , we now maintain eight copies of the data structure from Theorem 10: two data structures for each quadrant (one for D + , the other for D − ).We thus obtain the following result.
Theorem 11.We can maintain a set P of n disjoint growing squares in a fully dynamic data structure such that we can detect the first time that two squares in P intersect.Our data structure uses O(n(log n log log n) 2 ) space, supports updates in O(log 7 n) amortized time, and queries in O(log 3 n) time.For a sequence of m operations, the structure processes O(mα(n) log 3 n) events in a total of O(mα(n) log 7 n) time.
And thus we obtain the following solution to the agglomerative glyph clustering problem.
Theorem 12.Given a set of n initial square glyphs P , we can compute an agglomerative clustering of the squares in P in O(nα(n) log 7 n) time using O(n(log n log log n) 2 ) space.

Efficient Representation of Dominance Relations
The linking certificates of our data structure actually comprise an efficient representation of all dominance relations between two point sets.We therefore think that this representation, and in particular the tighter analysis in this section, is of independent interest.Let R and B be two point sets in R d with |R| = n and |B| = m, and let T R and T B be range trees built on R and B, respectively.We assume that each layer of T R and T B consists of a bb[α]-tree, although similar analyses can be performed for other types of balanced binary search trees.By definition, every node u on the lowest layer of T R or T B has an associated d-dimensional range Q u (the hyper-box, not the subset of points).For a node u ∈ T R , we consider the subset of points in B that dominate all points in Q u , which can be comprised of O(log d m) canonical subsets of B, represented by nodes in T B .Similarly, for a node v ∈ T B , we consider the subset of points in R that are dominated by all points in Q v , which can be comprised of O(log d n) canonical subsets of R, represented by nodes in T R .We now link a node u ∈ T R and a node v ∈ T B if and only if v represents such a canonical subset for u and vice versa.By repeatedly applying Lemma 5 for each dimension, it can easily be shown that these links represent all dominance relations between R and B.
As a d-dimensional range tree consists of O(n log d−1 n) nodes, a trivial bound on the number of links is O(m log 2d−1 n) (assuming n ≥ m).Below we show that the number of links can be bounded by O(n(log n log log n) d−1 ).We first consider the case for d = 1.

Analyzing the Number of Links in 1D
Let R and B be point sets in R with |R| = n, |B| = m, and n ≥ m.Now, every associated range of a node u in T R or T B is an interval I u .We can extend the interval to infinity in one direction; to the left for u ∈ T R , and to the right for u ∈ T B .For analysis purposes we construct another range tree T on R ∪ B, where T is not a bb[α]-tree, instead a perfectly balanced tree with height log(n + m) .For convenience we assume that the associated intervals of T are slightly expanded so that all points in R ∪ B are always interior to the associated intervals.We associate a node u in T R or T B with a node v in T if the endpoint of I u is contained in the associated interval I v of v.
Observation 13.Every node of T R or T B is associated with at most one node per level of T .
For two intervals I u = (−∞, a] and I v = [b, ∞), corresponding to a node u ∈ T R and a node v ∈ T B , let [a, b] be the spanning interval of u and v.We now want to charge spanning intervals of links to nodes of T .We charge a spanning interval and [a, b] is cut by the splitting coordinate of w.Clearly, every spanning interval can be charged to exactly one node of T .Now, for a node u of T , let h R (u) be the height of the highest node of T R associated with u, and let h B (u) be the height of the highest node of T B associated with u.

Lemma 14. The number of spanning intervals charged to a node
Proof.Let x be the splitting coordinate of u and let r ∈ T R and b ∈ T B form a spanning interval that is charged to u.We claim that, using the notation introduced in Lemma 5, r ∈ ), then the right endpoint of I r must lie between x and x .But then the spanning interval of r and b would not be charged to u.As a result, we can only charge spanning intervals between h R (u) nodes of T R and h B (u) nodes of T B , of which there are at most O(h R (u) • h B (u)).
Using Lemma 14, we count the total number of charged spanning intervals and hence, links between T R and T B .We refer to this number as numLinks (T R , T B ).This is simply ).We can split the sum and assume w.l.o.g. that numLinks where n T (h R ) is the number of nodes of T that have a node of height h R associated with it.
To bound n T (h) we use Observation 13 and the fact that T R is a bb[α] tree.Let c =  Using this bound on n T (h) in the sum we previously obtained gives: Where indeed,

Extending to Higher Dimensions
We now extend the bound to d dimensions.The idea is very simple.We first determine the links for the top-layer of the range trees.This results in links between associated range trees of d − 1 dimensions (see Fig. 7).We then determine the links within the linked associated trees, which number can be bounded by induction on d.
Theorem 17.The number of links between two d-dimensional range trees T R and T B containing n and m (n ≥ m) points, respectively, is bounded by O(n(log n log log n) d−1 ).
Proof.We show by induction on d that the number of links is bounded by the minimum of O(n(log n log log n) d−1 ) and O(m log 2d−1 n).The second bound is simply the trivial bound given at the start of Section 4. The base case for d = 1 is provided by Theorem 16.Now consider the case for d > 1.We first determine the links for the top-layer of T R and T B .Now consider the links between an associated tree T u in T R containing k points and other associated trees T 0 , . . ., T r that contain at most k points.Since T u can be linked with only one associated tree per level, and because both range trees use bb[α] trees, the number of points m 0 , . . ., m r in T 0 , . . ., T r satisfy m i ≤ k/c i (0 ≤ i ≤ r) where c = 1 1−α .By induction, the number of links between T u and T i is bounded by the minimum of O(k(log n log log n) d−2 ) and O(m i log 2d−3 n).Now let i * = log c (log d−1 n) = O(log log n).Then, for i ≥ i * , we get that O(m i log 2d−3 n) = O(k log d−2 n).Since the sizes of the associated trees decrease geometrically, the total number of links between T u and T i for i ≥ i * is bounded by O(k log d−2 n).The links with the remaining trees can be bounded by O(k log d−2 n(log log n) d−1 ).Finally note that the top-layer of each range tree has O(log n) levels, and that each level contains n points in total.Thus, we obtain O(n log d−1 n(log log n) d−1 ) links in total.The remaining links for which the associated tree in T B is larger than in T R can be bounded in the same way.
It follows from Theorem 17 that our data structure from Section 3 actually maintains only O(n(log n log log n) 2 ) certificates.This directly implies that the space usage is only O(n(log n log log n) 2 ) as well.

Conclusion and Future Work
We presented an efficient fully dynamic data structure for maintaining a set of disjoint growing squares.This leads to an efficient algorithm for agglomerative glyph clustering.The main future challenge is to improve the analysis of the running time.Our analysis from Section 4 shows that at any time, we need only few linking certificates.However, we would like to bound the total number of linking certificates used throughout the entire sequence of operations.An interesting question is if we can extend our argument to this case.This may also lead to a more efficient algorithm for maintaining the linking certificates during updates.

Figure 1
Figure 1 Zooming out in GlamMap will merge overlapping squares.This figure shows a sequence of three steps zooming out from the surroundings of Leipzig.

Figure 2
Figure 2The timeline of squares that grow and merge as they touch.

Figure 4
Figure 4The points m z and m w are defined by a pair of nodes z ∈ X R v , with v ∈ T u , andw ∈ X L v , with v ∈ Tu.If w ∈ Q L (m z) and z ∈ Q(m w ) then we add a linking certificate between the rightmost upper right-vertex rq, q ∈ Pz, and the leftmost bottom left vertex p, p ∈ Pw.

Lemma 4 .
Every kinetic tournament node is involved in O(log 3 n) linking certificates, and thus every point p is associated with at most O(log 6 n) certificates.

Lemma 5 .Figure 5
Figure 5 The nodes b and r in the trees T B and T R .

Figure 6
Figure 6After a left rotation around an edge (µ, ν), the associated data structure Tµ of node µ (pink) has to be rebuilt from scratch as its canonical subset has changed.For node ν we can simply use the old associated data of node µ.No other nodes are affected.

Figure 7
Figure 7Two layered trees with two layers, and the links between them (sketched in black).We are interested in bounding the number of such links.

∞ h=0 h 3 c
h = O(1) because c > 1.Thus, we conclude: Theorem 16.The number of links between two 1-dimensional range trees T R and T B containing n and m points, respectively, is bounded by O(n + m).
y, and p(t * ) is the point with minimum y-coordinate among the points in L − (q)(t * ) at time t * , if and only if p ∈ D − (q), and (ii) rq(t * )x = p(t * )x, and p(t * ) is the point with minimum x-coordinate among the points in L + (q)(t * ) at time t * , otherwise (i.e. if and only if p ∈ D + (q)).
11−α , then we get that height(T R ) ≤ log c (n) from properties of bb[α] trees.Therefore, the number of nodes in T R that have height h is at most O( n c h ).As argued, there are at most O(n/c h ) nodes in T R of height h.Consider cutting the tree T at level log(n/c h ).This results in a top tree of size O(n/c h ), and O(n/c h ) bottom trees.Clearly, the top tree contributes at most its size to n T (h).All bottom trees have height at most log(n + m) − log(n/c h ) = O(log(c h ) + log(1 + m/n)) = O(h + m/n).Every