PNRMiner: a generic framework for mining interesting structured relational patterns
 861 Downloads
 2 Citations
Abstract
Methods for local pattern mining are fragmented along two dimensions: the pattern syntax, and the data types on which they are applicable. Pattern syntaxes include subgroups, nsets, itemsets, and many more; common data types include binary, categorical, and realvalued. Recent research on relational pattern mining has shown how the aforementioned pattern syntaxes can be unified in a single framework. However, a unified model to deal with various data types is lacking, certainly for more complexly structured types such as real numbers, time of day—which is circular—, geographical location, terms from a taxonomy, etc. We introduce PNRMiner, a generic tool for mining interesting local patterns in (relational) data with structured attributes. We show how to handle the attribute structures in a generic manner, by modelling them as partial orders. We also derive an informationtheoretic subjective interestingness measure for such patterns and present an algorithm to efficiently enumerate the patterns. We find that (1) PNRMiner finds patterns that are substantially more informative, (2) the new interestingness measure cannot be approximated using existing methods, and (3) we can leverage the partial orders to speed up enumeration.
Keywords
Data mining Pattern mining Information theory Subjective interestingness Relational data Structured attributes1 Introduction
Exploratory data mining (EDM) tools enable businesses and scientists to explore their data and find previously unknown patterns, which in turn helps them learn about reality, innovate, and gain a competitive edge. An important obstacle for the adoption of EDM techniques in general, and local pattern mining approaches in particular, is their limited flexibility in terms of the data types to which they can be applied, e.g., only tabular data, and the types of patterns they can generate, e.g., itemsets. In reality, however, data are often complexly structured (e.g., relational databases), and additionally there is often structure among the different values data attributes may attain, i.e., attribute values can be ordinal, interval, taxonomy terms, and more.
Local pattern mining has traditionally been rooted in categorical or even binary data, including algorithms for frequent itemset mining and variants [2], nset mining [7], subgroup discovery [14], and multirelational pattern mining [15, 24]. Some of these local pattern mining approaches have been extended in various ways to include ordinal, realvalued, or other data structures. For example, extensions of itemset mining to realvalued data have led to approaches akin to biclustering, and subgroup discovery methods exist that allow discovery of rules based on attributevalue inequalities.
However, that work is fragmented and often ad hoc, in the sense that other kinds of structure (taxonomy terms, timeofday intervals on a circular 24hour clock, geographical regions on the globe, etc.) may not be approachable in the same way and may necessitate fundamentally different approaches. The purpose of this paper is to provide an elegant and encompassing framework to deal with attributes of any of the structured types listed above and more, and this in a relational setting, i.e., applicable to data as it resides in relational databases. To illustrate the breadth and nature of the contributions, we provide two motivating examples.
Example 1
Consider a dataset of Foursquare^{1} checkin times of a number of users. Such a dataset has the potential of elucidating lifestyle patterns shared by a number of Foursquare users. To formalise and then find such patterns, it is tempting to specify a time resolution and discretise the data. However, it is unclear which discretisation level to use, and whether to take it uniform throughout the day. In fact, the optimal discretisation could vary for different lifestyle patterns.
An alternative approach could be to take the mean and possibly higherorder statistics of the checkin times for each user and find patterns in this summary description. This approach would suffer from two problems: first, computing averages of circular quantities is ambiguous (e.g. is the mean of 6 am and 6 pm midnight or noon?), and second, it ignores much of the information in the data.
The method developed in this paper, when applied to this data, deems as most interesting a pattern that reveals that 1.6\(\%\) of all users check in frequently in the 6am7am interval and again in the 10.10–10.50 am interval. Here, the interval sizes are tuned automatically to maximise interestingness, and the intervals can be of varying size even within a pattern.
While this example illustrates how the contribution in this paper advances the stateoftheart even for a single relation (between users and checkin times), the second example shows the full power on data in a relational database.
Example 2
Consider a relational database of users, who have rated books with an integer from 1 to 5, and where the books are tagged with a number of genres organised in a taxonomy. Applied to this dataset, the method proposed in this paper identifies interesting patterns in the form of sets of books that have been rated by the same set of users in a similar way (say, in the interval from 3 up to 5), which may all belong to a particular set of genres (e.g., fantasy and action).
This second example illustrates the ability of the proposed method to identify patterns that span several types of entities (users, ratings, books, genres), including structured entity types such as ordinal values or values organised in a taxonomy.

We formalise the problem and a matching pattern syntax, in a manner as generic as possible (Sect. 2). To achieve this, we adopt an abstract formalisation in terms of a partial order over the structured values. For example, with the timeofday and book ratings, the partial order is over the intervals, where one is ‘smaller’ than another if it is included in it. For taxonomy terms, one taxonomy term is smaller than another if it is a specialisation of it.

We formalise the interestingness of such patterns under the Information Theoretic framework for subjective interestingness [9, 11]. This is a nontrivial contribution over the approach applicable for the NRMiner pattern syntax, because the presence of entities is no longer independent (Sect. 3).

We provide an algorithm for efficiently enumerating all such patterns (Sect. 4). This is a nontrivial extension of the algorithmic approach used in NRMiner that is applicable due to the additional structure in the search space. However, we also prove that under the algorithmic framework used here (due to [1]), no algorithm can exist that uses only a polynomial number of steps per output. This result is new but also applies to earlier works.
2 Problem formalisation
2.1 Notation
We formalise a relational database as follows. Let E denote the set of entities, that is all possible values of all attributes, and \(t: E \rightarrow \{1,\ldots ,k\}\) a function that gives the type of an entity (assuming k types). We write \(\mathcal {R}\) to denote the set of all relationship instances in the database, while \(R\subseteq \{1,\ldots ,k\}\times \{1,\ldots ,k\}\) denotes the set of tuples of entity types whose entities may have relationships, according to the schema of the database. The elements of R will be referred to as the relationship types. A relational database is then a tuple \(\mathcal {D}= (E,t,\mathcal {R},R,\succeq )\), where \(\succeq \) will be introduced below.
To model such structure, we consider one additional element in the data model: a partial order \(\succeq \) that represents implication of relationships across entities of the same type. That is, \(e \succeq f\) means that if any entity g is related to f, i.e., \((f,g)\in \mathcal {R}\), then g is also related to e: \(\forall e,f,g \in E: e \succeq f \wedge (g,f)\in \mathcal {R}\Rightarrow (g,e) \in \mathcal {R}.\) Only implications between entities of the same type are allowed: \(e \succeq f \Rightarrow t(e) = t(f)\).
For example, in Fig. 2, we have \([\text {12}] \succeq 1\), \([\text {13}] \succeq [\text {12}]\), etc. This means that if an entity is connected to 1 it is also connected to \([\text {12}]\), \([\text {13}]\), and \([\text {14}]\). For notational convenience, we assume that \(\mathcal {R}\) contains both the relationship instances present in the database, as well as all relationship instances implied by \(\succeq \). In practice, we need not store these implied edges explicitly; details on this are presented in Sect. 5.
2.2 Pattern syntax
Our general aim is to find interesting sets of entities. We propose that the interestingness of a set of entities can be measured by contrasting the number of relationship instances present between the entities with the expected number of relationship instances present between those entities, where the expectation is subjective, i.e., dependent on the user. We formalise this subjective interestingness in Sect. 3. For now it suffices to know that it will depend on the number of relationship instances between the entities in the set.
We use a tiered approach to achieve our general aim. First, we enumerate all dense patterns that are potentially interesting, and secondly we rank them by interestingness. Hence, the first step is to find sets of entities that have many relationships. We will refer to a set of entities and the relationship instances among them as a pattern. We define a pattern as potentially interesting if it is complete, connected, maximal, and proper.
Definition 1
More verbosely, a pattern F is complete iff all relationship instances between entities in F that are allowed by the database schema are also present.
Definition 2
A set of entities \(F\text { s.t. }F \ge 2\) is connected iff there is a path between any two entities in F using only entities in F. Any \(F\text { s.t. }F \le 1\) is connected.
Definition 3
A pattern F is proper iff all superentities of any entity in F are also in F.
Definition 4
Finally, a pattern F is maximal iff no entity can be added without breaking completeness or connectedness. Note that if there is an entity \(e \in E \setminus F\) such that \(F \cup \{e\}\) is complete and connected, there must also be an entity \(f \succeq e, f \in E \setminus F\) such that \(F \cup \{f\}\) is complete, connected, and proper. We refer to sets that are complete, connected, and proper as complete connected proper subsets (CCPSs), and to sets that are also maximal as maximal CCPSs. In Sect. 4, we will show that we can enumerate all maximal CCPSs using the socalled fixpointenumeration algorithm.
In short, we add a properness constraint and the pattern syntax is otherwise equivalent to [23, 24]. Our implementation and theory also support nary relationships, but we do not discuss this further in order to prevent unnecessary complications in the exposition. One could also consider approximate patterns by discarding the completeness constraint. This would lead to an increased computational complexity, but the increase has been shown to be manageable [22]. For simplicity, we do not consider approximate patterns in this paper.
3 Interestingness
3.1 General approach

the selfinformation of the pattern, defined as minus the logarithm of the probability that the pattern is present under the background distribution, and

the description length of the pattern, which should formalise the amount of effort the user needs to expend to assimilate the pattern.
In [24], this framework is used successfully to formalise the interestingness of Complete Connected Subsets (CCSs), without the properness requirement that lies at the core of the contributions in this paper. The properness requirement creates an opportunity as well as a nontrivial challenge. It allows to describe single patterns that capture information that could previously be presented only with a set of patterns. Such patterns reduce the description length. On the other hand, it is more difficult to compute the selfinformation of a pattern. We briefly discuss the core principles in the next paragraphs, before discussing the computation of the selfinformation in greater detail in Sect. 3.5.
3.2 Description length
3.3 Information content
The central idea of FORSIED is to quantify the amount of information that a pattern conveys to a user, which in general terms is known as the information content of a pattern. The most interesting pattern is then the one that conveys the most information, i.e., that maximally reduces the uncertainty the user has about the data [9]. The selfinformation of a pattern quantifies the unexpectedness of that pattern, given a background distribution. We present the technical details of the selfinformation and the background distribution below.
In the following section, we argue that the background distribution can be fitted in the exact same way as in [24]. However, how to compute the probability that a given pattern is present—and thus its selfinformation—is not trivial. The difficulty stems from the fact that the presence of relationship instances is now dependent, owing to the partial order relation over the entities. Nonetheless, Sect. 3.5 describes how the probabilities can still be computed effectively by using the inclusion–exclusion principle.
3.4 The Background Distribution
In [24], interestingness is formalised under the assumption that users have prior beliefs on the number of entities of a specific type to which a given entity is related. It is argued that this is often a good assumption, and the experiments in the current paper also support that.^{2} This assumption leads to a tractable distribution, under which the relationship instances are independent with probabilities that can be found by solving an efficiently solvable convex optimisation problem.
This background distribution factorises over the different relationship types, such that the selfinformation can be decomposed into a sum of different contributions, each one of which corresponds to the relationship instances for one particular relationship type. That is also the case in the present paper, such that in the rest of this exposition it suffices to imagine just a single relationship type.
What is new is that we implicitly make a further assumption on the user’s knowledge state, namely that the user knows the partial order \(\succeq \), and hence the fact that if \(e\succeq f \wedge (g,f)\in \mathcal {R}\), then \((g,e) \in \mathcal {R}\). This creates hardtohandle dependencies between the presence of relationship instances. In practice, data will often only contain relationship instances between minimal entities, i.e., entities that are minimal in the partial order \(\succeq \). In this case, the background distribution can be fitted on the set of minimal entities without worrying about the dependencies, exactly as done in [24].
In particular, we assume prior beliefs on the number of relationship instances each (minimal) entity is involved in, for every relationship type. The maximum entropy distribution subject to these prior belief constraints is then used as the background distribution. This background distribution is a product of Bernoulli distributions with one factor for each possible relationship instance [24]. In other words: for each possible relationship instance (e, f), the distribution gives us a probability \(p_{(e,f)}\) that (e, f) is present in the data.
This background distribution defines the probabilities \(p_{(e,f)}\) of relationship instances between minimal entities e and f. Given this, it is possible to compute the probability \(p_{(e,f)}\) of any relationship instance (e, f), whether minimal or not, as the probability of presence of any of the relationship instances \((e',f')\) with \(e\succeq e'\) and \(f\succeq f'\) and \(e'\) and \(f'\) minimal. Indeed, the presence of any such \((e',f')\) would imply the presence of (e, f). How this probability and the overall probability of a CCPS pattern can be computed given the background distribution is the subject of Sect. 3.5.
In general, for data that includes relationship instances between nonminimal entities, let us define a partial order \(\succeq _\mathcal {R}\) over the relationship instances as follows: \((e_1,f_1)\succeq _\mathcal {R}(e_2,f_2)\) iff \(e_1\succeq f_1\) and \(e_2\succeq f_2\). Then, we suggest fitting the background distribution as before on the minimal relationship instances only. This includes the approach from the previous paragraph as a special case. This model is imperfect, as the user should be aware of negative dependencies between the presence of a relationship instance as a minimal one: if \((e_2,f_2)\) is a minimal relationship instance, then \((e_1,f_1)\) with \((e_1,f_1)\succeq _\mathcal {R}(e_2,f_2)\) and \((e_3,f_3)\) with \((e_3,f_3)\preceq _\mathcal {R}(e_2,f_2)\) cannot be minimal relationship instances. Yet, we argue that in this case, assuming independence is nonetheless still a good approximation.^{3}
3.5 Selfinformation
4 Enumeration algorithm
Last but not least, we study how to efficiently enumerate all maximal CCPSs. Like previous work on mining interesting patterns in relational data [22, 23, 24], our algorithm is based on the fixpointenumeration algorithm by Boley et al. [1]. Although that algorithm already exists, it should be noted that it is a metaalgorithm, which does not directly work on the data. The fixpointenumeration algorithm takes as input a set system and a closure operator that together define the problem setting and the output. The definitions are given below.
We first introduce the fixpointenumeration algorithm, after which we introduce notation and formalise our practical problem of enumerating maximal CCPSs as a problem of enumerating all fixpoints in a set system. We prove that the introduced set system is strongly accessible, which is required for the fixpoint enumeration to be applicable, and present a suitable closure operator. Finally, we analyse the computational complexity, and we prove that—unfortunately—the delay time between two maximal CCPSs cannot be polynomial under this scheme.
4.1 The enumeration algorithm
 (1)
Start with an empty set: \(F := \{\emptyset \}\).
 (2)
Compute the closure of the current set: \(F := \sigma (F)\). This closure is one of the fixpoints to return.
 (3)
If the current set can be extended, that is, \(\exists G \supseteq F : G \in \mathcal {F}\), then pick any element \(f \in G \setminus F : F \cup \{f\} \in \mathcal {F}\) and recurse from (2) to one branch where every set contains f and one branch where no set contains f. If the current set cannot be extended, then this branch ends.
4.2 Enumerating CCPSs
4.3 Strong accessibility
Theorem 1
\((E,\mathcal {F})\) is strongly accessible.
Proof
We prove each of the two properties separately, but first we introduce some notation for convenience. Let \((F,\succeq )\) denote the set F partially ordered by \(\succeq \). We write that an entity \(e \in F\) is minimal in \((F,\succeq )\) iff \(\not \exists f \in F, e \ne f, e \succeq f\). Likewise an entity \(e \in F\) is maximal in \((F,\succeq )\) iff \(\not \exists f \in F, e \ne f, f \succeq e\).
 (1)\(\forall F \in \mathcal {F}\setminus \{\emptyset \}: \exists e \in F: F \setminus \{e\} \in \mathcal {F}\), becauseThe second property states that for any pair of CCPS \(F, F' \in \mathcal {F}, F \subset F'\), there is an entity \(e \in F' \setminus F\) that can be added to F to lead to another CCPS \(F \cup \{e\} \in \mathcal {F}\). We prove this property by considering all entity types of entities that are in \(F'\) and not in F, and then we condition on whether F is the empty set or whether it already contains some entities.

Removing an entity never violates completeness.

Any entity \(e \in F\), e minimal in \((F,\succeq )\) can be removed without breaking properness and \(\exists e \in F\), e minimal in \((F,\succeq )\).

If \(\exists e,f \in F, e \succeq f, f \text { minimal in } (F,\succeq )\), then \(F \setminus \{f\} \in \mathcal {F}\), because \(F \setminus \{f\}\) is complete and proper (see two previous statements) and since F is connected, for any \((f,g) \in \mathcal {R}\) also \((e,g) \in \mathcal {R}\) (since \(e \succeq f\)), thus \(F \setminus \{f\}\) is also connected.

If \(\not \exists e,f \in F, e \succeq f, f \text { minimal in } (F,\succeq )\), then \(\forall e \in F: e \text { minimal in } (F,\succeq )\). Hence, removal of any entity would not break completeness or properness. Then, we could model the entities of F as nodes in a graph and the relationship instances between entities in F as its edges. Since F is connected, that graph is also connected. Any connected graph has a spanning tree and it is possible to remove any leaf node from that spanning tree without breaking connectedness of the graph.

 (2)\(\forall F, F' \in \mathcal {F}, F \subset F' : \exists e \in F' \setminus F : F \cup \{e\} \in \mathcal {F}\), because

Let \(t(F) = \{ t_j  t_j = t(e), e \in F \}\). For every type \(t_j \in t(F' \setminus F)\), \(\exists e \in F' \setminus F: t(e) = t_j\), e maximal in \((F' \setminus F, \succeq )\), since \(F' \setminus F\) is finite.

If \(F = {\emptyset }\), then for \(\forall e \in F' \setminus F, t(e) = t_j\), e maximal in \((F' \setminus F, \succeq ): F \cup \{e\} \in \mathcal {F}\).

If \(F \supset {\emptyset }\), then because every entity type has one or more maximal elements and \(F'\) is connected, there is a type adjacent to or present in F which includes an entity \(e \text { maximal in } (F' \setminus F, \succeq )\) and then \(F \cup \{e\}\) is complete, connected and proper.\(\square \)

4.4 The closure operator
Strong accessibility implies that we can efficiently enumerate all fixpoints in \(\mathcal {F}\) in a single traversal over the set system without considering any set twice [1]. A trivial choice for the fixpoints would be all sets in F; in which case \(\sigma (F) = F\text {, }\forall F\in \mathcal {F}\). However, in the worst case the number of CCPSs \(\mathcal {F}\) is an exponential in E, while there is only one maximal CCPS. Hence, we would like to choose the set of fixpoints such that it includes all maximal CCPSs and as few other CCPSs as possible. It is not possible to choose the closure operator such that we enumerate only maximal CCPSs, because a CCPS may have multiple maximal extensions.
In [24], it is assumed that the dataset does not contain any entity e that is related to all entities of a neighbouring type, because if such an entity exists, all other entities could be in its set of compatible entities (\(Comp(\{e\}) = E\)), hence \(\sigma (\emptyset ) \supseteq \{e\}\), while e need not be part of every CCS. Thus, this assumption is required for the closure operator to be monotonic.
In the current setting, entities that are fully connected to a neighbouring type would not be uncommon and this assumption is not reasonable. For example, there could be a catchall entity in a hierarchical attribute. Hence, we additionally define \(\sigma (\emptyset ) = \emptyset \). Alternatively, one could redefine Comp as \({{\mathrm{Comp}}}(F) = \{e \in E  \exists G \supseteq F \cup \{e\}, G \in \mathcal {F}\}\), but we leave that to future work. For brevity, we omit the proof that this \(\sigma \) is a closure operator.
4.5 Final remarks
The fixpointenumeration algorithm enumerates all fixpoints, i.e., any set that results from computing the closure operator. We are only interested in maximal CCPSs, so we output only those. Maximal CCPSs are easily identified at runtime, as they are fixpoints where no entity could be added (Sect. 2, Definition 4).
Finally, we allow a user to put any number of constraints on the set of patterns in the form “any pattern should include at least X entities of type Y”. We implement this by continuously computing upper bounds during the mining process, such that we can prune any branch where the constraints cannot be satisfied any more. A similar approach is followed in [24].
4.6 Computational complexity
As stated previously, the number of maximal CCPSs can be exponential in E. Since PNRMiner exhaustively enumerates all maximal CCPSs, the worstcase complexity of PNRMiner is also exponential in E. Unfortunately, we are not aware of an upper bound on the number of maximal CCPSs, nor do we know the exact worstcase complexity of our algorithm.
It has been shown that the delay time between finding two closed CCSs using the fixpointenumeration algorithm is \(O(E^3)\) [24]. The algorithm used here is almost the same, except that computing the set of augmentation entities also involves checking the properness constraint. The complexity of that is O(E), hence the delay time for closed CCPSs is also \(O(E^3)\).
It was previously not known whether the delay time between the enumeration of two maximal CC(P)Ss is always polynomial. Although we cannot make a general statement about the delay time, we prove here that the fixpointenumeration algorithm can indeed require a number of steps exponential in the number of outputs. We prove this by means of an example data set where the number of closed CCSs is exponential in the number of maximal CCSs, while indeed the number of closed CCSs is already exponential in the size of the input.
Theorem 2
No algorithm that is an instantiation of fixpoint enumeration can guarantee a polynomial number of steps in the number of outputs (maximal CCSs).
Proof
Consider a database with entity types A and B and a single binary relation between the two types; \(R = \{A, B\}\). Let both entity types have n entities, numbered \(a_1, a_2, \ldots , a_n\) and \(b_1, b_2, \ldots , b_n\). Let the set of relationship instances contain all pairs \((a_i, b_j), i, j \in [1, n], i \ne j\). That is, all possible relationship instances exist, except for entities \(a_i\) and \(b_i\) with the same index.
The number of maximal CCSs follows fairly straightforwardly: all CCSs can be extended until they have n entities and for each index i we can include either \(a_i\) or \(b_i\). Including both would violate completeness, while the CCS is not maximal as long as for some index i neither is included. This would lead to \(2^n\) maximal CCSs, except that neither the choice to include all entities in A, nor all entities in B are valid choices; this violates connectedness. Hence, there are \(2^n2\) maximal CCSs.
The number of closed CCSs is only slightly more involved: notice that \(2^n2 = \sum _{i=1}^{n1} {n \atopwithdelims ()i}\), which highlights that the number of maximal CCSs is indeed the number of choices to pick \(1,\ldots ,n1\) entities from A, which then form a unique maximal CCS if augmented with the remaining items from B. The second observation that we can use to derive the number of closed CCSs is that for this data every CCS is closed, because any entity that we add (\(a_i\) or \(b_i\)) will reduce the set of compatible entities by one. Hence, the closure of every CCS is that CCS itself.
Finally, notice that we concluded previously (Sect. 4.4) that regardless of the definition of the closure operator, the closure operator cannot add any entities to a set F unless they are part of every maximal CCPS that is a superset of F. This implies that our closure operator defined here is indeed optimal for any database involving only one relationship type. Hence, this proof holds for any instantiation of fixpoint enumeration. \(\square \)
Notice that this proof is for CCSs, and since properness need not be present in the data, the proof is also valid for CCPSs, as well as all other RMiner variants.
5 Implementation
We implemented the full program in C++ and the implementation turned out to be surprisingly difficult. The main difficulty is the efficiency of the enumeration algorithm. To facilitate understanding and reproduction of the tool, we provide full pseudocode here (Algorithms 1–4). The full source code is available at https://bitbucket.org/BristolDataScience/pnrminer. Our implementation is based on NRMiner [23] and the pseudocode is partly based on the description in [21].
The main function is PNRMiner (Algorithm 1), which takes as arguments four sets of entities and a list of entity types. Entity set F contains the entities whose supersets need to be enumerated in the current branch, this set is constructed via branching and the closure. Entity set B contains the entities all whose supersets already have been enumerated; this set is used for pruning. Entity sets C and A and the entity types list types are passed on for efficiency; these are the compatible entities, augmentation entities, and types adjacent to F. The initial call is PNRMiner(\(\emptyset ,\emptyset ,E,E,\emptyset \)).
For the compatible entities computation, we present pseudocode for the general nary case (Algorithm 3). Compute_Comp takes as arguments two sets of entities: G is the entities to verify for compatibility, and F is the set of entities to check compatibility against. The routine works by considering each entity \(e \in G\) separately (line 3). Then, compatibility with F is checked for each relationship type that e can participate in (line 5). If the check fails for any relationship type, e is not compatible with F, the routine breaks (lines 7–9) and continues from line 3. Line 2 contains an optimisation that is explained below after introducing Is_Comp.
6 Case studies
 1.
Can we find patterns that are more interesting?
 2.
Is the new interestingness score relevant?
 3.
Is the new enumeration algorithm faster?
6.1 Foursquare checkins
First we return to the Foursquare Checkins data discussed in the introduction. These data were gathered by Cheng et al. [8] from several online social media but mostly (\(>\)50 %) from Foursquare, and consists of userids, checkin times, and venues. The data consists of 225K users and 22M recorded checkins. Data such as this could be useful to identify patterns of people’s mobility, busy times of certain services, etc. Ordinarily, we would represent this data using three entity types and triary relationship instances; user x checks in at time t into venue y. To make this example as simple as possible, we omit the information about venues.
In this case, we are interested in finding patterns in the checkin times across users such as “many users check in somewhere both between 8.30 and 9.30 in the morning and between 11.30 and 12.30 around noon”. Such patterns cannot be identified by running PNRMiner on the data directly, because relationship instances carry no weights or any information about their probability. Hence, users that check in frequently and are tracked over a long period of time will have checked in somewhere at many times of the day.
We are interested in discovering patterns that possibly include time intervals and not just specific times. As possibilities, we considered asking PNRMiner to try intervals up to one, one and a half, and two hours. The reason we consider several options is because the more intervals there are, the more difficult the computational problem is. We identified for each interval size the largest subsampled data that we could run in less than 8 hours^{10}, using a reasonable constraint on minimum number of users in any CCPS, each time cutting the data size in half. We found these sample sizes to be \(\text {2}^{\text {8}}\) (879 users), \(\text {2}^{\text {9}}\) (440 users), and \(\text {2}^{\text {10}}\) (220 users), with minimum constraints of 0, 10, and 10 users in all patterns.
None of the settings yields substantially more interesting patterns than another. The ‘up to 2hour intervals’ adds least information to the other two; more than half of the top100 patterns for that setting contain only intervals that are shorter than 1.5 h and are thus also present in those results, and the interestingness scores are \({<}0.815\), while the top65 for ‘\({\le } 1\text {hour}\)’ and the top26 for ‘\({\le } 1.5\text {hours}\)’ have higher scores; up to 0.861 and 0.855, respectively. Notice that such scores are not straightforward to interpret, because whether such a score is low or high depends on the data at hand. For example, the pattern ranked 4th for ‘\({\le } 2\text {hours}\)’ is interesting. It contains three intervals and reads: \(4.5\,\%\) of the users checked in frequently between \([\text {1.10 am2.30 am}]\), \([\text {4.30 pm6.30 pm}]\), as well as \([\text {8.30 pm9.30 pm}]\).
The overall most informative pattern that we identify is: \(1.6\,\%\) of the users checked in frequently between \([\text {6 am7 am}]\), as well as \([\text {10.10 am10.50 am}]\). This means that, compared to the number of users that check in frequently between those intervals, there is a surprisingly large set of users that checks in frequently during both intervals. This pattern was found in the subsample of 879 users using intervals up to one hour in duration. Interestingly, in that case computing the results without constraints took 2 hours 20 minutes, but all except one pattern in the top700 (ranked 269) have at least ten users, a result that can be computed in roughly half the time (1 h 13 m).
To confirm that handling intervals is relevant, we identified the top pattern that does not include any intervals; it is ranked 892nd, 2962nd, and 10138th, for the three settings, respectively. This clearly shows that patterns with intervals are more interesting in terms of information content. We also test the relevance of the new interestingness score, by comparing the ranking of PNRMiner against NRMiner on data augmented such that they produce the same patterns. We find that Kendall’s tau is 0.337 and 0.352, respectively (NRMiner did not finish in the specified maximum of 8 hours on the third dataset), which highlights that accounting for the partial order when computing interestingness is highly relevant.
6.2 Amazon book ratings
As a second case study, we downloaded a snapshot of Amazon product reviews from SNAP^{11}. This dataset contains around 500K products, 8M reviews with ratings from 1 to 5, and 2.5M product category memberships. From this we selected all reviews about books and uniformly subsampled 1 % of the customers.
We ran PNRMiner on this dataset with constraints of at least 6 books and 20 customers. As an example, we present the most highlyranked pattern. This contains 23 customers and 8 books, all of which are different versions of the book “Left Behind: A Novel of the Earth’s Last Days”, a rating \([\text {15}]\) and the subjects Fiction and Christianity. To our surprise, we found that most of the patterns in the result are like this; different versions of the same book (hard cover, audiobook, etc.).
Inspection of the raw data led us to the hypothesis that this happens because reviews are copied across different versions of the same book. Unfortunately, the text of reviews was not crawled, so it is not straightforward to identify reviews for different items that are equivalent. We attempted to tackle this problem by keeping only one such version of a book by looking for reviews that have the same date, rating, and user. However, after removing duplicates using this procedure, it appears that little structure remains in the data.
We also ran NRMiner on the same dataset, augmenting it with all the implied relationship instances. We see that the same pattern is now ranked at the 21st position. This is because NRMiner does not take into account the dependencies between the intervals and, as a result, intervals are by definition more highly connected and relationship instances containing intervals are more probable. This confirms that our new derivation of the interestingness score is indeed relevant.
6.3 Fisher’s Iris data
The Iris data^{12} have been pervasively used in machine learning and pattern recognition text books. The data consist of 150 measurements of plants. Each has four numerical attributes and a class label (one of three species). In Sect. 2, we have shown that PNRMiner can be used to mine tiles and frequent patterns. However, it can also be used to mine subgroups and subspace clusters, which we highlight in this case study.
Subgroup discovery is a form of pattern mining where a user chooses a target attribute and the aim is to find rules that predict high values of this attribute, or rules that predict true if the attribute is binary. For the Iris data, this means that we would like to find rules based on the four numerical attributes that predict a specific class label. We model the data as five entity types. We discretise each numerical attribute to ten different values using equal spacing and include intervals up to six adjacent values. This substantially reduces the computation time, while hardly affecting the patterns.
Subspace clustering is a form of pattern mining that is unsupervised. The goal is to discover clusters in the data, but unlike traditional clustering, the goal is not to provide a full partitioning of the data, and there is no requirement to use all variables. Our framework has roughly the same aim and could as such be considered a relational (exhaustive) approach to subspace clustering. Like in the case of the checkins data, our framework enables identification of patterns that are otherwise unattainable using existing methods.
To find subspace clusters in the data, we ran PNRMiner without constraints on the Iris data, leaving out the class labels. As output we find 25,365 patterns. The top pattern is: \(pl = [\text {1.2951.885}]\), \(pw = [\text {0.220.7}]\), \(sl = [\text {4.845.2}]\), \(sw = [\text {3.084.04}]\), with an interestingness score of 1.4744. So, it is similar to the top pattern predicting class 1 (see Figure 5), except that the intervals for sepal length and sepal width are slightly more narrow. The first subspace cluster occurs at rank 347 and is quite specific already: \(pl = [\text {1.2951.885}]\), \(pw = [\text {0.220.7}]\), \(sl = [\text {4.485.2}]\), with an interestingness score of 1.1040, again omitting sepal width.
As a final remark, we are not suggesting that PNRMiner can replace all existing subgroup discovery and subspace clustering methods, because PNRMiner has high computional cost, owing to the exhaustive search strategy. On the other hand, the advantage of exhaustive search is that the identified patterns are truly the most informative patterns in the data.
7 Scalability
To test the scalability of the algorithm and study what we gain by using the attribute structure in the form of the properness constraint, we again look at the checkins data. We created 11 versions of the data, each time throwing away half of the remaining users and their relationship instances (the checkin modes). We then ran both NRMiner on augmented data with the additional entities and relationship instances, and PNRMiner, which then output the same set of maximal CCPSs.
We are interested also in how the depth of the partial order of an attribute affects the scalability and potential speedup by PNRMiner. Hence, we tested runtimes for 6, 9, and 12 levels, i.e., time intervals up to one, one and a half, and two hours. We exhaustively tested constraints on the number of users from 10, 20, 40, etc. up to the sample size. We stopped any experiment that had not finished after 24 hours.
8 Related work
8.1 Exploratory versus predictive patterns
The broad purpose of the framework presented in this paper is to facilitate exploration of data in an entirely unsupervised manner. This distinguishes the framework from other types of local pattern for multirelational data mining such as Safarii [15], and more generally from approaches based on inductive logic programming. These alternative frameworks operate by a user selecting one or a set of attributes as a target, after which an algorithm builds rules to predict that target using the full relational data. Wu et al. [27] introduced a method for finding interesting chains of biclusters in relational data, which has a similar goal as our framework. Their approach differs in that they only consider binary relationships, they employ a heuristic greedy algorithm to find interesting patterns, and their method does not account for structure of attributes in any way.
8.2 Pattern syntax
The pattern syntax proposed in this paper is unique in being both relational and able to deal with structured attribute types such as ordinal and realvalued attributes, taxonomy terms, and more. The proposed pattern syntax, in being local, owes to the frequent pattern mining literature. Indeed, the CCS pattern syntax [24], which it generalises, has already been shown to be a generalisation itself of a local pattern type in binary databases known as tiles [12], which are essentially equivalent to frequent itemsets.
8.3 Structured attribute types
Realvalued and ordinal attributes have also been dealt with before in local pattern mining, in subgroup discovery and exceptional model mining. For example, in subgroup discovery, approaches have been developed to infer subgroup descriptions in terms of intervals for realvalued attribute types and subsets of categorical attributes. A notable paper in this regard is [19], where an efficient algorithm is introduced for finding optimal subgroups using any convex quality measure. Exceptional model mining, on the other hand, aims to extend subgroup discovery beyond a single target attribute [17]. None of these approaches, however, are as generic as our proposed approach: they are either ad hoc or remain limited to a very specific types of structured attributes. The approach of modelling the structure of the attributes as a partial order is also entirely novel.
8.4 Interestingness formalisations
The formalisation of interestingness of local patterns is a highly active research area, with most research targeted on itemsets in binary databases. This makes sense, as the problem is most acute for exploratory data mining approaches, in the absence of a particular set of target attributes to be predicted. Many approaches to formalising interestingness are based on modelling the unexpectedness of a pattern: the extent to which the pattern presents novel, surprising, or unexpected information to the user. A recent survey is [16].
There are three major lines of research aimed at mining (sets of) ‘interesting’ local patterns. Constrained randomisation techniques are based on the assumption that a pattern is more interesting if it is not present in randomised data [13, 18, 20]. Methods based on the Minimum Description Length principle assume that a pattern is more interesting if provides better compression [26]. Approaches based on the Maximum Entropy (MaxEnt) principle assume a pattern is more interesting the more surprising it is given a MaxEntbased background model [9, 10]. Both randomisation and MaxEnt approaches have been shown to allow for accounting prior knowledge, thus enabling subjective interestingness and iterative data mining.
The MaxEnt approach and the subjective interestingness framework FORSIED have been shown to be highly flexible in terms of pattern types [11]. Additionally, they have been used successfully to quantify interestingness patterns for RMiner [24], which we directly build upon. For these reasons we used this paradigm to formalise the interestingness of the patterns in the current paper. Clearly, a direct application of interestingness as defined in [24] would not have yielded desirable results, as the dependencies between relationship instances would be ignored (see also Sect. 6).
In the work on Domain Driven Data Mining [3, 5], it is stressed that there is a difference between technical and business interestingness. For patterns to be actionable, technical interestingness often does not suffice and patterns are only truly interesting if they reveal relations that are directly related to the business model, i.e., they take into account domain knowledge of the business [4]. Furthermore, a distinction is made between objective and subjective interestingness. Notably, in that line of work there are also results on mining patterns across data tables, called combined mining [6].
It is important to note that the FORSIED framework [9, 11] attempts to integrate objective and subjective interestingness by means of an objective score function that explicitly accounts for prior beliefs specified by the user. We have so far assumed that the user wants to learn everything about the data and largely ignored what to do if the user is interested only in (relationships to) part of the data. We envision that in our framework it should be possible to integrate both technical and business interestingness. It seems possible to manipulate the constraints on the minimum number of entities of certain types as well as the prior beliefs to ensure only patterns are found that are indeed interesting to the end user, whatever the context. However, further research in this direction is necessary.
8.5 Enumeration algorithms
The algorithm that we derived for enumeration of maximal CCPSs is based on the generic fixpointenumeration algorithm for enumerating all closed sets in a strongly accessible set system, introduced by Boley et al. [1]. This algorithmic scheme has been used before in the data mining literature for enumerating maximal CCSs [24], including extensions to nary relations [23] and approximate CCSs [22]. Here, we adhere to the same algorithmic scheme. In order to be able to use the scheme, we model the structure of attributes as a partial order, augment the pattern syntax, and add a properness constraint to the definition of the set of augmentation elements. As may be apparent, these changes are not trivial, and neither is the proof that the algorithmic scheme still works.
9 Conclusions
An important obstacle for the adoption of exploratory data mining techniques in general, and local pattern mining approaches in particular, is their limited flexibility in terms of data type to which they can be applied (e.g., only tabular data), and type of pattern they can generate, e.g. subgroups, itemsets, nsets. In reality, however, data are often complexly structured (as in, e.g., a relational database), and additionally there is often structure among the different values data attributes may attain, i.e., attribute values can be ordinal, interval, taxonomy terms, and more.
Attempts to resolve this inflexibility for specific data and pattern types are numerous. Yet, we are unaware of any generic approach that comes close to subsuming the range of pattern syntaxes considered by the local pattern mining research community, allowing for data types of a broad range of structures. The contributions made here may be an important step in this direction.
Our contributions raise a number of new research challenges. Ideally, the pattern syntax is tolerant to missing relations to ensure noise resilience, similar to [22]. The interestingness can be made more versatile by considering a more varied range of prior belief types. Another interesting question is whether the enumeration algorithm could still be improved. Our algorithm is similar to the BronKerbosch algorithm for enumerating maximal cliques in a graph, for which it is known that the worst case complexity of \(O(3^{n/3})\) is optimal, since it is equivalent to the number of maximal cliques in a graph [25]. Yet another interesting direction for future work is developing heuristic algorithms for finding interesting CCPSs directly, in order to avoid the costly exhaustive search step.
Footnotes
 1.
 2.
Of course, exploring other types of prior beliefs is an important line of further work.
 3.
The intuition is as follows. In practice the probabilities for relationship instances under the background distribution are small. Additionally, for two events with small probabilities p and q, the probability of their union is between \(p+q\) (in the case of perfect negative dependence) and \(1(1p)(1q)=p+qpq\) (in the case of independence), which differs by only pq, such that assuming independence results in at most a secondorder error in the probabilities.
 4.
As pointed out in Sect. 3.4, this expression is exact for databases where relationship instances involve only minimal pairs, and a good approximation in practice in other cases. Note also that only minimal relationship instances have positive probability, and hence nonminimal instances can be ignored.
 5.
Note again that only minimal relationship instances \((e',f')\) need to be considered, since nonminimal relationship instances have zero probability.
 6.
This is actually not the case here, because we are interested only in maximal sets, rather than all fixpoints (closed sets). This is explained further in Sect. 4.5.
 7.
The current implementation computes the closure only once, which probably negatively impacts the performance.
 8.
NB. C, A, and types are fixed also for given sets F and B.
 9.
This optimisation is currently not in the implementation, and that probably negatively impacts the performance.
 10.
Unfortunately our current implementation does not use any parallelisation, so it runs only in a single thread.
 11.
 12.
Notes
Acknowledgments
This work was supported by the European Union (ERC Grant FORSIED 615517) and the EPSRC (Grant EP/M000060/1).
References
 1.Boley, M., Horváth, T., Poigné, A., Wrobel, S.: Listing closed sets of strongly accessible set systems with applications to data mining. TCS 411(3), 691–700 (2010)MathSciNetCrossRefMATHGoogle Scholar
 2.Borgelt, C.: Frequent item set mining. WIREs: DMKD 2(6), 437–456 (2012)Google Scholar
 3.Cao, L.: Domain driven data mining (D3M). In: ICDM Workshops, pp. 74–76 (2008)Google Scholar
 4.Cao, L.: Domaindriven data mining: Challenges and prospects. IEEE TKDE 22(6), 755–769 (2010)Google Scholar
 5.Cao, L., Yu, P.S., Zhang, C., Zhao, Y.: Domain Driven Data Mining. Springer, New York (2010)CrossRefMATHGoogle Scholar
 6.Cao, L., Zhang, H., Zhao, Y., Luo, D., Zhang, C.: Combined mining: Discovering informative knowledge in complex data. IEEE TSMCB 41(3), 699–712 (2011)Google Scholar
 7.Cerf, L., Besson, J., Robardet, C., Boulicaut, J.F.: Data peeler: Constraintbased closed pattern mining in nary relations. In: Proceedings of SDM, pp. 37–48 (2008)Google Scholar
 8.Cheng, Z., Caverlee, J., Lee, K., Sui, D.Z.: Exploring millions of footprints in location sharing services. In: Proc. of ICWSM, pp. 81–88 (2011)Google Scholar
 9.De Bie, T.: An informationtheoretic framework for data mining. In: Proceedings of KDD, pp. 564–572 (2011)Google Scholar
 10.De Bie, T.: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. DMKD 23(3), 407–446 (2011)MathSciNetMATHGoogle Scholar
 11.De Bie, T.: Subjective interestingness in exploratory data mining. In: Proceedings of IDA, pp. 19–31 (2013)Google Scholar
 12.Geerts, F., Goethals, B., Mielikäinen, T.: Tiling databases. In: Proceedings of DS, pp. 278–289 (2004)Google Scholar
 13.Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. TKDD 1(3), 14 (2007)CrossRefGoogle Scholar
 14.Herrera, F., Carmona, C.J., González, P., del Jesus, M.J.: An overview on subgroup discovery: foundations and applications. KAIS 29(3), 495–525 (2011)Google Scholar
 15.Knobbe, A.J.: Multirelational data mining. IOS Press, Amsterdam (2006)MATHGoogle Scholar
 16.Kontonasios, K.N., Spyropoulou, E., De Bie, T.: Knowledge discovery interestingness measures based on unexpectedness. WIREs DMKD 2(5), 386–399 (2012)Google Scholar
 17.Leman, D., Feelders, A., Knobbe, A.: Exceptional model mining. In: Proceedings of ECMLPKDD, pp. 1–16 (2008)Google Scholar
 18.Lijffijt, J., Papapetrou, P., Puolamäki, K.: A statistical significance testing approach to mining the most informative set of patterns. DMKD 28(1), 238–263 (2014)MathSciNetMATHGoogle Scholar
 19.Mampaey, M., Nijssen, S., Feelders, A., Knobbe, A.: Efficient algorithms for finding richer subgroup descriptions in numeric and nominal data. In: Proceedings of ICDM, pp. 499–508 (2012)Google Scholar
 20.Ojala, M., Vuokko, N., Kallio, A., Haiminen, N., Mannila, H.: Randomization of realvalued matrices for assessing the significance of data mining results. Proc SDM 8, 494–505 (2008)Google Scholar
 21.Spyropoulou, E.: Local pattern mining in multirelational data. Ph.D. thesis, University of Bristol (2013)Google Scholar
 22.Spyropoulou, E., De Bie, T.: Approximate multirelational patterns. In: Proceedings of DSAA, pp. 477–483 (2014)Google Scholar
 23.Spyropoulou, E., De Bie, T., Boley, M.: Mining interesting patterns in multirelational data with Nary relationships. In: Proceedings of DS, pp. 217–232 (2013)Google Scholar
 24.Spyropoulou, E., De Bie, T., Boley, M.: Interesting pattern mining in multirelational data. DMKD 28(3), 808–849 (2014)MathSciNetMATHGoogle Scholar
 25.Tomita, E., Tanaka, A., Takahashi, H.: The worstcase time complexity for generating all maximal cliques and computational experiments. TCS 363(1), 28–42 (2006)MathSciNetCrossRefMATHGoogle Scholar
 26.Vreeken, J., van Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. DMKD 23(1), 169–214 (2011)MathSciNetMATHGoogle Scholar
 27.Wu, H., Vreeken, J., Tatti, N., Ramakrishnan, N.: Uncovering the plot: detecting surprising coalitions of entities in multirelational schemas. DMKD 28(5–6), 1398–1428 (2014)MathSciNetGoogle Scholar