Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Introduction

MMOG (Massively Multiplayer Online Game) [17] has become very popular in last decades and brought us a powerful platform to disseminate information and retrieve information as well as analyze information, and nowadays the MMOG has been known as a big social interaction of data repository consisting of a variety of interaction/visualization types, as well as a knowledge base, in which informative MMOG knowledge is hidden. However, users are often facing the problems of information overload and drowning due to the significant and rapid growth in amount of information and the number of users. Particularly, in MMOG, users usually suffer from the difficulties in finding desirable and accurate information on the MMOG due to two problems of low fidelity and more lag caused by above reasons. For example, if a user wants to navigate the desired scene by utilizing a navigation tool such as mouse, the 3D scene management system will provide not only scenic contents related to the query topic, but also a large amount of irrelevant contents, which results in difficulties for users to obtain their exactly needed scenic contents. Thus, these bring forward a great deal of challenges for MMOG researchers to address the challenging research issues of effective and efficient 3D content-based information management and retrieval. On the other hand, for the data on the MMOG, it has its own distinctive features from the data in conventional database management systems. MMOG data usually exhibits the following characteristics: the data on the MMOG is huge in amount, distributed, heterogeneous, unstructured, real-time, on-line and dynamic. To deal with the heterogeneity and complexity characteristics of MMOG data, MMOG community has emerged as a new efficient 3D game data management means to model MMOG objects. Unlike the conventional database management, in which data models and schemas are well defined, MMOG community, which is a set of Game-based objects (3D objects and users) has its own logical/geometric structures. MMOG communities could be modeled as 3D game groups, user clusters and co-clusters of 3D contents and users. MMOG community construction is realized via various approaches on Computer graphics—textual, scenes, navigation, rendering or semantic-based analysis. Recently, the research of Social Network Analysis in the MMOG has become a newly active topic due to the prevalence of Web 2.0 technologies, which results in an inter-disciplinary research area of Social Networking. Social networking refers to the process of capturing the social and societal characteristics of networked structures or communities over the MMOG. Social networking research involves in the combination of a variety of research paradigms, such as data mining, MMOG communities, social network analysis and behavioral and cognitive modeling and so on.

However, quantification of collective human behavior or MMOG/social dynamics collecting is a difficult challenge [5, 6, 8, 9]. It is remarkable to some extent that human knows more than dynamics of atomic particles than it knows about the dynamics of human groups. The reason for this situation is that the establishment of a fully experimental and falsifiable social science of group dynamics is tremendously complicated by the following factors. First, unlike other problems in the social/natural sciences, dynamics of the social behaviors constitute a complex system, characterized by implicit/explicit and short/long-range interactions, which are in general not treatable by traditional mathematical methods and other concepts. Second, data is seldom available and of poor quality [8, 1013]. It is apparent that availability of these social interaction data from social systems is much harder than those data from other scientific systems. On the other side, many complex systems cannot be understood without their surroundings, contexts or boundaries, together with the interactions between these boundaries and the system itself. This is obviously necessary for measuring large-scale dynamics of human groups. Regarding data acquisition it is therefore essential not only to record decisions of individual humans but also the simultaneous state of their surroundings. Further, in any data-driven science the observed system should not be significantly perturbed through the act of measurement. In social science experiments subjects usually are fully aware of being observed—a fact that might strongly influence their behavior. Finally, data acquisition in the social sciences becomes especially tiresome on group levels, see, e.g. [4, 14]. Traditional methods of social science such as interviews and questionnaires do not only need a lot of time and resources to deliver statistically meaningful assertions, but may introduce well-known biases [8]. To many it might seem clear that social sciences cannot overcome these problems, and that therefore social sciences would always remain on a lower quantitative and qualitative level than the natural sciences.

In these different modeling for social behaviors and interactions with other people, the virtual environment or walkthrough system may be a better but more practical choice. From a scientific point of view online games provide a tool for understanding collective human phenomena and social dynamics on an entirely different scale [5, 6]. In these games all information about all actions taken by all players can be easily recorded and stored in log-files at practically no cost. This quantity of data has been unthinkable in the traditional social sciences where sample sizes often do not exceed several dozens of questionnaires, school classes or students in behavioral experiments. In MMOGs on the other hand, the number of subjects can reach several hundred thousands, with millions of recorded actions. These actions of individual players are known in conjunction with their surroundings, i.e. the circumstances under which particular actions or decisions were taken. This offers the unique opportunity to study a complex social system: conditions under which individuals take decisions can in principle be controlled, the specific outcomes of decisions can be measured. In this respect social science is on the verge of becoming a fully experimental science [10] which should increasingly become capable of making a great number of repeatable and eventually falsifiable statements about collective human behavior, both in a social and economical context.

Another advantage over traditional ways of data acquisition in the social sciences is that players of MMOGs do not consciously notice the measurement process 1. These “social experiments” practically do not perturb or influence the sample. Moreover MMOGs not only open ways to explore sociological questions, but—if economic aspects are part of the game (as it is in many MMOGs)—also to study economical behavior of groups. Here again economical actions and decisions can be monitored for a huge number of individual players within their social and economical contexts. This means that MMOGs offer a natural environment to conduct behavioral economics experiments, which have been of great interest in numerous small-scale surveys, see, e.g. [47, 9]. It becomes possible to study the socioeconomic unit of large online game societies. Based on the above discussions, we adopt and build a walkthrough system to simulate interactions between the users and 3D scenes and explore the hidden knowledge for predicting the future behavior of users.

Recent advances in storage technology make it possible to store and keep a series of large Web archives. It is now an exciting challenge for us to observe evolution of the Web, since it has experienced dramatic growth and dynamic changes in its structure. We could see a lot of phenomena in the Web, which correspond to social activities in the real world. For example, if some topic becomes popular in the real world, many games about the topic are created, then good quality web pages are pointed to by public bookmarks or link lists for that topic, and these games become densely connected.

In the last decades, there is an increasing interest in developing techniques to deal with different 3D applications like MMOG models, 3D walkthrough. Many query processing techniques are proposed [1517] to overcome the problems faced within the extensive scale of 3D applications. Most researches on 3D spatial databases focused on the Euclidean space, where the distances between the objects are determined by the relative positions of the objects in space. However, the operations in 3D spatial data, where the data has an underlying shape topology, do not solely rely on geographical locations. In WT, both the topological and geographical properties of the underlying network are important. The topological properties are usually represented as a finite collection of points, lines and surfaces. Due to serious transfer bottleneck of massive data from hard-disk to the main memory, the current researches mainly focus on the out-of-core rendering systems, in addition to their implementation of sophisticated techniques for culling, simplification, GPU-based rendering, etc., have to work on cache management of data for interactive out-of-core processing of massive data. On the other hand, efforts devoted for data layout on disks for efficient access have been proposed seldom. Especially, the layout can reflect semantic property is neglected and draw little attractions.

The remainder of this paper is organized as follows. Related works are discussed in section “Related Works.” Section “Motivating Examples,” describes the proposed hypergraph model and problem formulation. Sections “Problem Formulation and Graph-Based Model” and “Data Layout Algorithm Based on Hypergraph Structure” explain the recommended clustering mechanism with illustrative examples. Section “Experimental Evolution” then presents the experiment results. Conclusions are finally, drawn in section “Conclusions and Future Works,” along with future research.

Related Works

In this subsection, we will briefly describe related works about virtual environments, sequential pattern mining and pattern clustering, respectively.

Prefetching and Caching Methods Based on Spatial Access Models

Most of the earlier researches have assumed that WT are memory resident. It is not until recently that managing large WT has become an active area [1517]. Most of the works have adopted spatial indexes to organize, manipulate and retrieve the data in large WT.

The related work of prefetching and caching method involves in four aspects. First, data-related: it is concerned with the organization of objects, such as object partitioning, connectivity and block size. Like the level-of-detail (LOD) management [1820], hierarchical view-frustum and occlusion culling, working-set management (geometry caching) [2025] are these examples; Second, traversal-related: this focus on reduction of access times for objects. Traditional cache-efficient layouts [15, 20, 21, 26] (also called cache-aware), based on the knowledge of cache parameters, utilize the different localities to reduce the number of cache misses and the size of the working set; Finally, another variation is cache-oblivious algorithms [15, 20, 26]. Instead, they do not require any knowledge of cache parameters or block sizes of the memory hierarchy involved in the computation. In addition, large polygons of such highly complex scenes require a lot of hard disk space so that the additional data could exceed the available capacities [20, 21, 26, 27]. Moreover, the semantics of data access is more important in defining the placement policy [16, 17, 2830]. To meet these requirements, an appropriate data structure and an efficient technique should be developed with the constraints of memory consumption.

Hypergraph-Based Pattern Clustering Methods

The fundamental clustering problem is to partition a given data set into groups (clusters), such that data points in a cluster are more similar to each other (i.e., intra-similar property) than points in different clusters (i.e., inter-similar property) [31]. These discovered clusters are used to explain the characteristics of the data distribution [31]. However, these schemes fail to produce meaningful clusters, if the number of objects is large or the dimensionalities of the WT (i.e., the number of different features) are diverse and relatively large.

Motivating Examples

Well known drawbacks of traditional geometric scene modelers make it difficult, and even sometimes impossible, intuitive design of 3D scenes [1618, 20, 21, 26, 31]. In this section we will discuss on intelligent storage layout modeling, that is modeling using intelligent techniques during the designing process and thus allowing intuitive storage layout design. Based on the above discussions, the following subsections will explain our observations and motivations.

Motivations on Theoretical Foundations

Data mining [16, 17, 19, 32], in artificial intelligence area, deals with finding hidden or unexpected relationships among data items and grouping the related items together. The two basic relationships that are of particular concern to us are:

  • Association: states when one object occurs, it is highly probable that the other will also occur in.

  • Sequence: where the data items are associated and, in addition to that, we know the order of occurrence as well.

Association rule mining [19, 32] aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. Especially, a significant hidden relationship is the concept of association. More formally, let I = i 1, i 2, , i n denote a set of literals, called items, where a transaction T contains a set of items X if X ⊆ T. Let D represents a set of transactions such that ∀TD, T ⊆ I. A set X ⊆ I is also called an itemset. An itemset with k items is called a k-itemset. An association rule is indicated by an implication of the form \( X \Rightarrow Y \), where X ⊆ I, Y ⊆ I, and \( X \cap Y = \emptyset \). A \( X \Rightarrow Y \) is said to hold in transaction set D with support s in the transaction set D if s-% of transactions in D contain \( X \cup Y \). The rule \( X \Rightarrow Y \) has confidence c if c-% of the transactions in D containing X also contain Y. The thresholds for support and confidence are called minsup and minconf, respectively. The support of an itemset X, denoted σ(X), is the number of transactions in which that itemset arises as a subset. Thus σ (X) = |t(X)|. An itemset is called a frequent pattern [19, 32] if its support is greater than or equal to some threshold minimum support (min sup) value, i.e., if σ(X) \( \geqq \)min sup.

Motivations on Practical Demands

Suppose that we have a set of data items {a, b, c, d, e, f, g}. A sample access history over these items consisting of five sessions is shown in Table 11.1. The request sequences extracted from this history with minimum support 40% are (a, f) and (c, d). The rules obtained out of these sequences with 100% minimum confidence are \( a \Rightarrow f \) and \( c \Rightarrow d \), as shown in Table 11.2. Two accessed data organizations are depicted in Fig. 11.1. An accessed schedule without any intelligent preprocessing is shown in Fig. 11.1a. A schedule where related items are grouped together and sorted with respect to the order of reference is shown in Fig. 11.1b. Assume that the disk is spinning counterclockwise and consider the following client request sequence, a, f, b, c, d, a, f, g, e, c, d, shown in Fig. 11.1. Note that dashed lines mean that the first element in the request sequence (counted from left to right) would like to fetch the first item supplied by disk. And directed graph denotes the rotation of disk layout in counterclockwise way. For this request, if we have the access schedule (a, b, c, d, e, f, g), which dose not take into account the rules, the total I/O access times for the client will be a:5, f:5, b:3, c:2, d:6, a:5, f:5, g:1, e:5, c:6, d:6. The total access times is 49 and the average latency will be 49/11 = 4.454. However, if we partition the items to be accessed into two groups with respect to the sequential patterns obtained after mining, then we will have {a, b, f} and {c, d, e, g}. Note that data items that appear in the same sequential pattern are placed in the same group. When we sort the data items in the same group with respect to the rules \( a \Rightarrow f \) and \( c \Rightarrow d \), we will have the sequences (a, f, b) and (c, d, g, e). If we organize the data items to be accessed with respect to these sorted groups of items, we will have the access schedule presented in Fig. 11.1b. In this case, the total access times for the client for the same request pattern will be a:1, f:1, b:1, c:1, d:1, a:3, f:1, g:4, e:1, c:4, d:1. The total access times is 19 and the average latency will be 19/11 = 1.727, which is much lower than 4.454.

Table 11.1 Sample database of user requests
Table 11.2 Sample association rules
Fig. 11.1
figure 1

Effects on accessed objects organization in disk. (a): without association rules; (b) with association rules

Another example that demonstrates the benefits of rule-based prefetching is shown in Fig. 11.2. We demonstrate three different requests of a client for consideration. Based on the obtained association rules, the prediction can be achieved. The current request is c and these is a rule stating that, if data items c is requested, then data items d will be also be requested (i.e., association rule \( c \Rightarrow d \)). In Fig. 11.2a, data item d is absent in the cache and the client must spend more waiting time for item d. In Fig. 11.2b, although the item d is also absent in the cache, the client still spend one disk latency time for item d. In Fig. 11.2c, the cache can supply the item d and no disk latency time is needed.

Fig. 11.2
figure 2

Effects of prefetching

Instead, methods essentially exploit the semantic information about whether a data is cached or not given a cache and sort patterns depending on the data access pattern during the scene traversal. This kind of approach cannot only reduce the number of expensive disk I/O accesses but also achieve high system performance. In the following sections, we will explain how we cluster data items with respect to frequent patterns.

Motivations on Intertwined Relationship Demands

In essence, this can be classified into two different relationships. Under the concern of intra-similarity, every object represents some importance [32]. For example, the support for frequent pattern abcd is 5, but the supports for the object a, b, c, d are 5, 6, 7, 8, respectively. Under the concern of inter-similarity, shown in Fig. 11.3, every frequent pattern represents some importance. For example, the support for frequent pattern abcd is 5, but the supports for the object abe, abcde, cd, df are 5, 4, 6, 8, respectively. From the above observations, these patterns are intertwined with the relationships and should be properly and efficiently managed. Therefore, those observations also motivate us to adopt the HG model for representing the relationships.

Fig. 11.3
figure 3

Demonstration for intra-/inter-relationships among the frequent patterns

Problem Formulation and Graph-Based Model

In this section, we present a novel application of hypergraph partitioning to automatically determine the computation and I/O schedule. We begin with a definition of the problem and explain the hypergraph partitioning problem. We then present an alternative formulation that better solves our problem of interest.

Clustering is a good candidate for inferring object correlations in storage systems. As the previous sections mentioned, object correlations can be exploited to improve storage system performance. First, correlations can be used to direct prefetching. For example, if a strong correlation exists between objects a and b, these two objects can be fetched together from disks whenever one of them is accessed. The disk read-ahead optimization is an example of exploiting the simple data correlations by prefetching subsequent disk blocks ahead of time. Several studies [10, 33] have shown that using these correlations can significantly improve the storage system performance. Our results demonstrate that prefetching based on object correlations can improve the performance much better than that of non-correlation layout in all cases.

A storage system can also lay out data is disks according to object correlations. For example, an object can be collocated with its correlated objects so that they can be fetched together using just one disk access. This optimization can reduce the number of disk seeks and rotations, which dominate the average disk access latency. With correlation-directed disk layouts, the system only needs to pay a one-time seek and rotational delay to get multiple objects that are likely to be accessed soon. Previous studies [31, 34] have shown promising results in allocating correlated file blocks on the same track to avoid track-switching costs.

Problem Definition

Suppose that a user-based traversal database consists of a set of visible patterns, with each pattern accessing a set of data objects. The data objects are often in secondary storage and each data object is always potentially accessed or requested by more than one patterns. The objective is to determine a storage access schedule, so as to minimize the total disk I/O cost.

Hypergarph Partitioning Problem

A hypergraph HG = (V, N) [34, 35] is defined as a set of vertices V and a set of nets (hyper-edges) N among those vertices. Every net \( {n_j} \in N \) is a subset of vertices, i.e., \( {n_j} \subseteq V \). The size of a net n j is equal to the number of vertices it has, i.e., s j  = | n j |. Weight (w j ) and cost (c j ) can be assigned to the vertices (\( {v_j} \in V \)) and edges (\( {n_j} \in N \)) of the HG, respectively.  = {V 1, V 2, …, V n } is a n-way partition of HG and satisfies the following conditions: (1) each partition is a nonempty subset of V, (2) partitions are pairwise disjoint, and (3) union of K partitions is equal to V.

In our model, we assign every object to one vertex, and every frequent pattern is represented by one hyper-edge (net). As shown in Fig. 11.3, according to how many objects involved, object a, b, c, d, e, and f are circled together in different line form, respectively. Since there are five different patterns, we plot five different nets for demonstration.

Finally, we will define our problem in two phases. Phase I: given a frequent pattern set P = {p 1, p 2,, p n }, we design a efficient formulation scheme to bridge two different domain knowledge (i.e., P and HG model); phase II: in order to reduce the disk access time, we distribute P into a set of clusters, such that minimize inter-cluster similarity and maximize intra-cluster similarity.

Data Layout Algorithm Based on Hypergraph Structure

Several studies [3235] have shown that using these correlations can significantly improve the storage system performance. In this section, we describe a direct application of hypergraph partitioning to the disk I/O minimization problem. The construction of the hypergraph is described, followed by the partitioning procedure to derive a valid object layout and I/O schedule.

Object-Centric Clustering Scheme

First, as mentioned previously, let the object corresponds to a vertex, and a frequent pattern also corresponds to a hyperedge. The weight ψ e of a hyperedge e is defined as 1/|e|, which is inversely proportional to the number of objects that are incident to the hyperedge. Inspired by the main concepts of [3537], we propose our semantic-based hypergraph clustering scheme. Let two objects u and v are given, the similarity score d (u,v) for between u and v is defined as

$$ d(u,v) = \sum\limits_{e \in E|u,v \in e} {\frac{{\mathop \psi \nolimits_e }}{{(m(u) + m(v))}}} $$
(11.1)

Where e is a hyperedge connecting objects u and v. ψ e is a corresponding edge weight, and m(u) and m(v) are the some interesting measures of u and v, respectively. As Han et al. [31] cited that the support measure is not a proper measure used in a hypergraph model. Therefore, in our experiments, shown in next section, the confidence measure was used. The similarity score of two objects is directly proportional to the total sum of edge weights connecting them and is inversely proportional to the sum of their measures. Suppose N u is the set of neighboring objects to a given object u. We define the closest object to u, denoted c(u), as the neighbor object with the highest similarity score to u, i.e., c(u) = v such that d (u,v) = Max {d (u,z)| z \( \in \) N u }.

In Fig. 11.4, the dot lines denote which vertexes are connected in the hyperedges. The circle is especially for represented one hyperedge {A, C, F}. Since the multiplicity of hyperedge {A, C} is two. Therefore, there are two hyperedges between vertex A and C. The following is the pseudo codes of object-based hypergraph clustering algorithm.

Fig. 11.4
figure 4

The initial condition

Object-Oriented Hypergraph-Based Clustering ( OHGC ) Algorithm

//D is the database. P is the set of frequent patterns. Obj is the set of frequent patterns. T is the set of clusters, and is set to empty initially.

  • Input: D, P, Obj, and T.

  • Output: T

  • Begin

  • // Phase 1: Initialization step for Priority_Queue (PQ)

  • 1. While (let each object \( u \in Obj \) and Obj is not empty) do

  • 2. Begin

  • 3.  Find closest object v, and its associated similarity score d;

  • 4.   Insert the tuple (u, v, d) into PQ with d as key;

  • 5. End; // while in Phase 1.

  • // Phase 2: Hypergraph Clustering based on PQ

  • 6. While (user-defined cluster number is not reached or top tuple’s score d >0) do

  • 7.   Pick top tuple (u, v, d) from PQ;

  • 8.   Cluster u and v into new object u′ and update the T;

  • 9.   Find closest object v′ to u′ its associated similarity score d′;

  • 10.   Insert the tuple (u′, v′, d′) into PQ with d′ as key;

  • 11.   Update similarity scores of neighbor of u;

  • 12.End; // while in Phase 2.

Now, we will explain our clustering algorithms. The main ideas come from both object-based and HG-based mechanisms. Since there are multiple relationships exist in object-to-object and pattern-to-pattern formats. Using the ordinary graph models are not sufficient to represent such relationships. This is our main motivation for HG-model.

In order to identify the globally closest pair of objects with the highest score, a data structure with priority-queue (PQ) mechanism is implemented. There are two phases in our algorithm. Phase 1: we would like to build the PQ structure initially. First, for each object u in the Obj (Object Set), the closest object v and its associated similarity score are found, and inserted into the PQ with the key d. Note that for each object u, only one tuple with the closest object v is inserted and maintained. Due to the less complexity in computation, this vertex-oriented PQ is more efficient than methods of edge-based. Phase 2: First, we pick up the top tuple (u, v, d) in the PQ (step 7). If conditions are satisfied, the pair of objects (u, v) is clustered and created a new object u′ (step 8). In step 9 and 10, the new closest object v′ is found and the T set is updated. Therefore, a new tuple (u′, v′, d′) is inserted into PQ with d′ as the new key. Since the clustering changes the vertex-connectivity of HG, some of previously calculated similarity scores might become invalid. Thus, the similarity scores of the neighbors of the new object u′ need to be re-calculated, and the PQ is adjusted accordingly. Demonstration example will be given later. The following is the pseudo codes of object-based hypergraph clustering algorithm.

Example 1 (OHGC)

Assume that the system has 6 objects and 8 frequent patterns. Let Obj = {A, B, C, D, E, F, G} and frequent patterns ser P = {P 1= AB; P 2= AC; P 3= AD; P 4= AE; P 5= AF, P 6= AC, P 7= BC, and P 8= ACF}. Note that the multiplicity of hyperedge {A,C} is two. This is one of main differences between other methods and ours. We set up the level-wise threshold for the multiplicity of frequent patterns. For example, if the support of P i is less than some fixed constant, say α, then the multiplicy of Pi is set up to be 1; otherwise if the support of P i is less than or equal to some fixed constant but great than α, then the multiplicy of P i is set up to be 2. This idea will alleviate the complexity of HG-model for future partitioning. Therefore, the above initial conditions are shown in Fig. 11.4.

  • Step1. Initially, we start to calculating the similarity score of A and its neighbors. Let the size of each object is 1. Now consider the vertex B, D, and E.

    • Since only one hyperedge contains vertex A and B, its Weight W e =1/|e| =1/(A + B) = 1/(1 + 1) =1/2 = 0.5;

    • d (A, B) = (W e )/(A + B) = [1/2]/(1 + 1) = 1/4 = 0.25;

    • Similarly, both d (A, D) and d (A, E) also have the same value (i.e., 0.25).

  • Step2. We still calculate the similarity score of A and its neighbors continuously. Now consider the vertex F.

    • Since only two hyperedges contains vertex A and F, (i.e., 1* {A, F} and 1*{A, C, F}).

    • Case 1: for {A, F} is concerned,

    • its Weight W e =1/|e| = 1/2; therefore, d (A, F) = (W e )/(A + F) = 1*[(1/2)/ (1 + 1)] = 1/4 = 0.25;

    • Case 2: for {A, C, F} is concerned,

    • its Weight W e =1/|e| = 1/(1 + 1 + 1) = 1/3; therefore, d (A,F) = (W e )/(A + F)  = 1*[(1/3)/(1 + 1)] = 1/6 = 0.25;

    • To sum up, the d (A, F) = 1/4 + 1/6 = 5/12 = 0.416;

  • Step3. We still calculate the similarity score of A and its neighbors continuously. Now consider the vertex C.

    • Since only three hyperedges contains vertex A and C (i.e., 2* {A, F} and 1*{A, C, F}).

    • Case 1: for 2* {A, F} is concerned,

    • its Weight W e =1/|e| = 1/2; therefore,

    • d (A, F) = 2* (W e )/(A + F) = 2*[(1/2)/(1 + 1)] = 2* 1/4 = 1/2 = 0.5;

    • Case 2: for 1*{A, C, F} is concerned,

    • its Weight W e =1/|e| = 1/(1 + 1 + 1) = 1/3; therefore, d (A,F) = (W e )/(A + F)  = 1*[(1/3)/ (1 + 1)] = 1/6 = 0.25;

    • To sum up, the d (A, F) = 1/2 + 1/6 = 4/6 = 0.667;

  • Step4. From the above steps, we have the following similarity scores among vertex A and its neighbors.

    • d (A, B) = d (A, D) = d (A, F) = 0.25;

    • d (A, F) = 5/12 = 0.416;

    • d (A, C) = 2/3 = 0.667;

    • Since d (A, C) has the highest similarity score, vertex C is declared as the closest object to A. The result is shown in Fig. 11.5.

      Fig. 11.5
      figure 5

      After the computation of step1 to step 4, the vertex C was chosen and merged with vertex A

  • Step5. The following steps are shown how to update the similarity scores of neighbors of vertex A. As Fig. 11.5 shown, we want to update the related similarity values after the vertex A and vertex C were merged (i.e., vertex A and vertex C were considered as only one vertex). Now consider the vertex D.

    • Since only one hyperedge contains vertex AC and D, (i.e., 1*{AC, D}).

    • For {AC, D} is concerned,

    • its Weight W e =1/|e| = 1/(1 + 1 + 1) = 1/3; therefore, d (AC, D) = (W e )/(AC+D)  = 1*[(1/3)/(1 + 1)] = 1/6 =0.167;

    • Similarly, d (AC, E) has the same value (i.e., 0.167).

  • Step6. Now consider the vertex B. Since there are two hyperedges contain vertex AC and B, (i.e., 2*{AC, B}).

    • For {AC, B} is concerned,

    • its Weight W e =1/|e| = 1/(1 + 1 + 1) = 1/3; therefore, d (AC, B) = (W e )/(AC + B) = 2*[(1/3)/(1 + 1)] = 1/3 = 0.333;

  • Step7. Now consider the vertex F. Since there are two hyperedges contain vertex AC and B, (i.e., 1*{AC, F} and 1*{A, C, F}).

    • Case 1: for {AC, F} is concerned,

    • its Weight W e =1/|e| = 1/3; therefore, d (AC, F) = (W e )/(AC + F) = 1*[(1/3)/ (1 + 1)] = 1/6;

    • Case 2: for {A, C, F} is concerned,

    • its Weight W e =1/|e| = 1/(1 + 1 + 1) = 1/3; therefore, d (AC, F) = (W e )/(AC + F) = 1*[(1/3)/ (1 + 1)] = 1/6;

    • To sum up, the d (AC, F) = 1/6 + 1/6 = 2/3 = 0.333;

This idea will alleviate the complexity of HG-model for future partitioning. The initial conditions are shown in Fig. 11.4 and final result was shown in Fig. 11.5.

Quantity-Based Jaccard Function

The Jaccard index [38], also known as the Jaccard similarity coefficient, is a statistic used for comparing the similarity and diversity of sample sets. However, if the quantity for each element involved, this similarity can not reflect the following conditions. Let us consider three different frequent patterns, P 1, P 2, and P 3: P 1 = {5A, 6B, C}, P 2 = {3A, 2B, 5C, 8D}, P 3 = {5C, 8D}. Here take P 1 for an example, 5A means there are five elements for object with type A. Similarly, 6B means there are six elements for object with type B, and 1C means there is one element for object with type C.

As shown in Fig. 11.6, the new jaccard similarity mechanism (i.e., quantity-based Jaccard similarity formula) can capture more semantic meanings than the old one.

Fig. 11.6
figure 6

Illustration between two different concerns about the Jaccard mechanisms

Definition 1: Intra-distance measure (Co-occurrence)

Let P 1 and P 2 be two frequent patterns. We can represent D(P 1, P 2) as the normalized difference between the cardinality of their union and the cardinality of their intersection:

$$ D(\mathop P\nolimits_1, \mathop P\nolimits_2 ) = 1 - { }\frac{{\left| {\mathop P\nolimits_1 \cap \mathop P\nolimits_2 } \right|}}{{\left| {\mathop P\nolimits_1 \cup \mathop P\nolimits_2 } \right|}} $$
(11.2)

Quantity-Based Jaccard Clustering Approach

Based on the above discussions, the following is our Jaccard-based clustering algorithm.

Pattern Clustering Algorithm for Jaccard Function

// P is the set of frequent patterns. T is the set of clusters, and is set to empty initially.

  • Input: P and T.

  • Output: T

  • Begin

  • 1. FreqTable = {ft ij | the frequency of pattern i and pattern j co-existing in the database D};

  • 2. DistTable = {dt ij | the distance between of pattern i and pattern j in the database D};

  • 3.  C 1 = {C i | At the beginning each pattern to be a single cluster}

  • 4.  // Set up the Extra-Similarity Table for evaluation

  • 5.  M 1 = Intra-Similar (C 1, ∅);

  • 6.  k = 1;

  • 7.  while |C k | > n do Begin

  • 8.  C k+1 = PatternCluster (C k , M k , FreqTable, DistTable);

  • 9.   M k+1 = Intra-Similar (C k+1, M k );

  • 10.   k = k +1;

  • 11. End;

  • 12. return C k ;

  • 13. End;

Definition 2: View-radius

Inspired by the idea of view importance [33], Firstly, to obtain the efficient computation cost, we propose a simple but effective distance to measure the number of observable objects. Secondly, according to the distance threshold given by users, we define view radius in order to choose representative objects. For the comparison purposes, we divide the different intervals among the view-radius. The detailed data was shown in Table 11.3.

Table 11.3 Information on radius and average number of objects in one view

Experimental Evaluation

This section presents a progression of experiments for the effectiveness of predictive-prefetching mechanism in the storage systems. We begin by explaining the set-ups for experimental environments, included the experimental setups. Next, under different constraints, we show that HG-based clustering approach can outperform the other schemes. Besides, in order to compare with knowledge-based tree-like data structures [21], the mechanism of Access Path Predictor (APP) was implemented for performance comparisons.

Since the APP scheme share some similar properties with our approaches. For examples, APP claims that the most common access pattern within a dataset is the frequent appearance of the same predecessor-current-successor patterns [21]. This pattern represents a section of a path along which one application navigates through its data. As far as the intra-relationship be concerned, this type of patterns are similar our frequent sequential patterns. However, some significant limitation was posed on the APP scheme. Because the correlated relationships cover both the intra-relationship and inter-relationship [21, 30]. The APP scheme only considered their pattern inferred by intra-relationship only hidden in the same path. In other words, the inter-relationships across the different paths were neglected in the APP scheme. On the contrary, both of them were considered in our HG-based clustering approach. Based on the above discussions, we adopt the APP scheme for performance evaluation.

Implementation Setup

The experiments were conducted on a machine with a 1.6 GHz Pentium 4 processor, 1 GB main memory, running Microsoft Windows 2003 server. All algorithms were implemented in Java.

Results and Performance

We conduct several experiments to determine the performance of our proposed OHCG approach. In all four experiments, we test the performance of proposed technique for our traversal trace data using the mentioned clustering measures, Hypergraph and quantity-based Jaccard similarity.

Furthermore, we perform the same set of experiments for three other prediction techniques namely, Without-Clustering approach, APP approach, and quantity-based Jaccard approach. Without-Clustering approach is monitored and managed by operation system, where the prediction computation is deferred until LRU-like mechanism works. The APP approach uses tree-like data structure, a mechanism that calculates a probability by counting the frequency of values and combinations of values in the historical data. On the other side, quantity-based Jaccard approach focuses on selecting the most similar frequent patterns for a given patterns by combing both the co-occurrence and quantity principles. The final output of the hypergraph-based clustering is partitioned via pattern-based hypergraph model, where OHGC determines the best partition to place objects into the storage system for future accesses. This helps to explore the hidden relationships and estimate future fine-grained placements, leveraging the most effective predictive mechanism for each situation. Note that HG-Clustering represents our OHGC clustering scheme.

In particular, we focused on the following metrics, namely demanded total objects, response time (in ms), and number of retrieval files. The demanded total objects indicates that the percentage of request which can be met without accessing the out-of-core storage. The response time metric is the time interval elapsed that a clustering algorithm was required to load data form the disk. The number of retrieval files metric indicates that the effect of correlated relationships.

We carry out our experiments to compare four algorithms in the traversal database mainly on total objects/total files, response time, and number of retrieval files. Moreover, we vary the support threshold (between 70% and 10%). Similarly, we also vary the view radius threshold (between 2,000 and 12,000).

In the experiments of total objects/total files, as the Fig. 11.7 shown, it represents demanded total objects for the algorithms comprising points in a spherical volume. Besides, we have the following observations: firstly, the number of representative semantic patterns by OHCG is much more than those of the other three algorithms. It implies there are huge access time reductions during the retrieval of objects; secondly, in the algorithm of OHCG, we can obtain the dominating clusters which include the most representative sequential patterns. In addition, to verify the effectiveness of OHCG we proposed in our work, we also make the experiments on number of retrieval files which are shown in Figs. 11.811.10. Moreover, in the experiments of response time, as the Figs. 11.8 and 11.10 shown, the response time of OHCG is much less than the other three algorithms, and APP is very close to the time of quantity-Jaccard. This is because that the clustering mechanism can accurately support prefetching objects for future usage. Not only the access time is cut down but also the I/O efficiency is improved.

Fig. 11.7
figure 7

Comparison of different algorithms on the number of objects retrieved

Fig. 11.8
figure 8

Comparison of different algorithms on system response time under different support threshold

Fig. 11.9
figure 9

Comparison of different algorithms on the number of files retrieved

Fig. 11.10
figure 10

Comparison of different algorithms on system response time under different view radius

Conclusions and Future Works

This paper studies how to effectively and efficiently cluster frequent patterns from traversal database. To the best of our knowledge, the problem of clustering the semantic objects in 3D walkthrough has not been well studied in existing work.

Inspired by ideas of speedup of accessing semantic patterns, firstly, to obtain the high-quality clustering, we design a novel hypergraph-based model to measure the associations which shows the similarity between frequent patterns. Besides, the quantity-Jaccard distance is also presented. Secondly, according to the distance threshold, we define meaningful clustering in order to choose representative frequent patterns. Finally, since the problem of retrieving 3D objects is equivalent to minimizing the number of frequencies for accessing 3D objects, we develop an algorithm, OHCG, including an efficient partitioning strategy. On the other side, Quantity-Jaccard is more flexible, due to its easy adaptation and implementation. This opens a new era for inducing the concepts of data mining to discover the hidden but valuable knowledge to improve the performance of the system.