1 Introduction and motivation

In many applications of frequent item set mining one faces the problem that the transaction data to analyze is imperfect: items that are actually contained in a transaction are not recorded as such. The reasons can be manifold, ranging from noise through measurement errors to an underlying feature of the observed process. For instance, in gene expression analysis, where one may try to find co-expressed genes with frequent item set mining (Creighton and Hanash 2003), binary transaction data are often obtained by thresholding originally continuous data, which are easily affected by noise in the experimental setup or limitations of the measurement devices. Analyzing alarm sequences in telecommunication data for frequent episodes can be impaired by alarms being delayed or dropped due to the fault causing the alarm also affecting the transmission system (Wang et al. 2005). In neurobiology, where one searches for ensembles of neurons in parallel spike trains with the help of frequent item set mining and related approaches (Gerstein et al. 1978; Grün and Rotter 2010; Berger et al. 2009), ensemble neurons are expected to participate in synchronous activity only with a certain probability.

In this paper we present two new algorithms to cope with such problems. The first algorithm relies on a standard item set enumeration scheme and is fairly closely related to item set mining based on cover similarities (Segond and Borgelt 2011). It efficiently computes the subset size occurrence distribution of item sets, evaluates this distribution to find fault-tolerant item sets, and uses intermediate data to remove pseudo (or spurious) item sets, which contain one or more items that are too weakly supported (that is, items that occur in too few of the supporting transactions) as to warrant including them in the item set.

The second algorithm rather employs a heuristic search scheme (rather than full enumeration) and is inspired by the observation that in sparse data relevant item sets produce “lines” in a binary matrix representation of the transactional data if their items are properly reordered, or even “tiles” if the transaction are properly reorder as well. The interesting item sets are then found by linear traversals of the items and statistical tests. The main advantage of this method compared with item set enumeration schemes is that it considerably lowers the number of statistical tests that are needed to confirm the significance of found item sets. Thus it reduces the number of false positive results, while still being able to find the most interesting item sets (or at least part of them).

The rest of this paper is structured as follows: in Sect. 2 we review the task of approximate/fault-tolerant item set mining and categorize different approaches to this task. (However, we do not consider item-weighted or uncertain transaction data.) In Sect. 3 we describe how our first algorithm traverses the search space, how it efficiently computes the subset size occurrence distribution for each item set it visits, and how this distribution is evaluated. In Sect. 4 we discuss how the intermediate/auxiliary data that are available in our algorithm can be used to easily cull pseudo (or spurious) item sets. In Sect. 5 we evaluate our first algorithm on artificially generated data with injected approximate item sets in order to confirm its effectiveness. In addition, we compare its performance with two other algorithms that fall into the same category and for specific cases can be made to find the exact same item sets. In Sect. 6, we apply our first algorithm to a concept detection task on the 2008/2009 Wikipedia Selection for schools to demonstrate its practical usefulness. In Sect. 7 we review basics for the second algorithm, namely methods to measure the (dis)similarity of item covers by means of similarity/distance measures for sets and binary vectors. In Sect. 8 we consider, as a first step, how these (dis)similarities can be used in a clustering algorithm to find interesting item sets. In Sect. 9 we improve on this approach by applying non-linear mapping to reorder the items, so that they can be tested in a linear traversal. In Sect. 10 we describe an application of this method to find neuron ensembles in (simulated) parallel spike trains. Finally, in Sect. 11 we draw conclusions and point out possible future work.

2 Approximate or fault-tolerant item set mining

In standard frequent item set mining only transactions that contain all of the items in a given set are counted as supporting this set. In contrast to this, in approximate (or fault-tolerant or fuzzy) item set mining transactions that contain only a subset of the items can still support an item set, though possibly to a lesser degree than transactions that contain all items. Based on the illustration of these situations shown in the diagrams on the left and in the middle of Fig. 1, approximate item set mining has also been described as finding almost pure (geometric or combinatorial) tiles of ones in a binary matrix that indicates which items are contained in which transactions (Gionis et al. 2004).

Fig. 1
figure 1

Different types of item sets illustrated as binary matrices

In order to cope with missing items in the transaction data to analyze, several approximate (or fault-tolerant or fuzzy) frequent item set mining approaches have been proposed. They can be categorized roughly into three classes: (1) error-based approaches, (2) density-based approaches, and (3) cost-based approaches.

2.1 Error-based approaches

Examples of error-based approaches are Pei et al. (2001) and Besson et al. (2006). In the former the standard support measure is replaced by a fault-tolerant support, which allows for a maximum number of missing items in the supporting transactions, thus ensuring that the measure is still anti-monotone. The search algorithm itself is derived from the famous Apriori algorithm (Agrawal and Srikant 1994). In Besson et al. (2006) constraints are placed on the number of missing items as well as on the number of (supporting) transactions that do not contain an item in the set. Hence it is related to the tile-finding approach in Gionis et al. (2004). However, it uses an enumeration search scheme that traverses sub-lattices of items and transactions, thus ensuring a complete search, while Gionis et al. (2004) relies on a heuristic scheme.

2.2 Density-based approaches

Rather than fixing a maximum number of missing items, density-based approaches allow a certain fraction of the items in a set to be missing from the transactions, thus requiring the corresponding binary matrix tile to have a minimum density. This means that for larger item sets more items are allowed to be missing than for smaller item sets. As a consequence, the measure is no longer anti-monotone if the density requirement is to be fulfilled by each individual transaction. To overcome this, Yang et al. (2001) require only that the average density over all supporting transaction must exceed a user-specified threshold. As an alternative, Seppänen and Mannila (2004) define a recursive measure for the density of an item set.

2.3 Cost-based approaches

In error- or density-based approaches all transactions that satisfy the constraints contribute equally to the support of an item set, regardless of how many items of the set they contain. In contrast to this, cost-based approaches define the contribution of transactions in proportion to the number of missing items. In Wang et al. (2005) and Borgelt and Wang (2009) this is achieved by means of user-provided item-specific costs or penalties, with which missing items can be inserted. These costs are combined generally with each other and with the initial transaction weight of 1 with the help of a t-norm (triangular norm, see Klement et al. 2000 for a comprehensive treatment). In addition, a minimum weight for a transaction can be specified, by which the number of insertions can be limited.

Note that the cost-based approaches can be made to contain the error-based approaches as limiting or extreme cases, since one may set the cost (or penalty) of inserting an item into a transaction in such a way that the transaction weight is not reduced. In this case, limiting the number of insertions obviously has the same effect as allowing for a maximum number of missing items.

The first approach presented in this paper falls into the category of cost-based approaches, since it reduces the support contribution of transactions that do not contain all items of a considered item set. How much the contribution is reduced and how many missing items are allowed can be controlled directly by a user. However, it treats all items the same, while the cost-based approaches reviewed above allow for item-specific penalties. Its advantages are that, depending on the data set, it can be faster, admits more sophisticated support/evaluation functions, and allows for a simple filtering of pseudo (or spurious) item sets.

In pseudo (or spurious) item sets a subset of the items is strongly correlated in many transactions—possibly even a perfect item set (all subset items are contained in all supporting transactions). In such a case the remaining items may not occur in any (or only in very few) of the (fault-tolerantly) supporting transactions, but despite the ensuing reduction of the weight of all transactions, the item set support can still exceed the user-specified threshold (see Fig. 1 on the right for an illustration of a pseudo (or spurious) item set and note the regular pattern of missing items compared with the middle diagram). Obviously, such item sets are not useful and should be discarded, which is easy in our algorithm, but difficult in the cost-based approaches reviewed above.

The second approach we present in this paper is not easily classified into the above categories, because it does not rely on item set enumeration, but rather on a heuristic search scheme, which does not ensure a complete result. Since it decides by statistical tests which item sets are interesting/relevant, it does not use (explicit) transaction costs, nor does it require strictly limited errors or a minimum density. It is most closely related to the tile-finding approach of Gionis et al. (2004), but significantly improves the item ordering criterion and thus can apply a much simpler method to actually identify the significant item sets.

As a final comment we remark that a closely related setting is the case of uncertain transactional data, where each item is endowed with a transaction-specific weight or probability. This weight is meant to indicate the degree or chance with which the item is actually a member of the transaction. Approaches to this certainly related, but nevertheless, fundamentally different problem, which we do not consider here, can be found, for example, in Chui et al. (2007), Leung et al. (2007), Aggarwal et al. (2009) and Calders et al. (2010).

3 Subset size occurrence distribution

The basic idea of our first algorithm is to compute, for each visited item set, how many transactions contain subsets with \( 1, 2, \ldots, k\) items, where k is the size of the considered item set. We call this the subset size occurrence distribution of the item set, as it states how often subsets of different sizes occur. Note that we do not distinguish different subsets with the same size, and thus treat all items the same: it is only relevant that (and if, how many) items are missing, but not, which specific items are missing. This distinguishes this algorithm from those presented in Wang et al. (2005) and Borgelt and Wang (2009), which allow item-specific insertion costs (or penalties).

The subset size occurrence distribution is evaluated by a function that combines, in a weighted manner, the entries which refer to subsets of a user-specified minimum size, which is stated relative to the size of the item set itself and thus correspond to a maximum number of missing items. Item sets that reach a user-specified minimum value for the evaluation measure are reported.

Computing the subset size occurrence distribution is surprisingly easy with the help of an intermediate array that records for each transaction how many of the items in the currently considered set are contained in it. In the search, which is a standard depth-first search in the subset lattice that can also be seen as a divide-and-conquer approach (see, for example, Borgelt and Wang 2009 for a formal description), this intermediate array is updated each time an item is added to or removed from the current item set. The counter update is most conveniently carried out with the help of transaction identifier lists. That is, our algorithm uses a vertical database representation and thus is closely related to the well-known Eclat algorithm (Zaki et al. 1997). The updated fields of the item counter array then give rise to updates of the subset size occurrence distribution, which records, for each subset size, how many transactions contain at least as many items of the current item set.

Pseudo-code of the (simplified) recursive search procedure is shown in Fig. 2. Together with the recursion the main while-loop implements the depth-first/divide-and-conquer search by first including an item in the current set (first subproblem—handled by the recursive call) and then excluding it (second subproblem—handled by skipping the item in the while-loop).

Fig. 2
figure 2

Simplified pseudo-code of the recursive search procedure

The for-loop at the beginning of the outer while-loop increments first the item counters for each transaction containing the current item n (note that items are coded by consecutive integers starting at 0), which thus is added to the current item set. Then the subset size occurrence distribution is updated by drawing on the new value of the updated item counters. Note that one could also remove a transaction from the counter for the original subset size (after adding the current item), so that the distribution array elements represent the number of transactions that contain exactly the number of items given by their indices. This could be achieved with an additional instruction \({\mathbf {dec}}(dist [ cnts [ t [ i ] ] ] )\) as the first statement of the for-loop. However, this operation is costlier than forming differences between neighboring elements in the evaluation function, which yields the same values (see Fig. 4—to be discussed later).

As an illustration, Fig. 3 shows an example of the update. The top row shows the list of transaction identifiers for the current item n (held in the pseudo-code in the local variable t), which is traversed to select the item counters that have to be incremented. The second row shows these item counters, with old and unchanged counter values shown in black and updated values in blue. Using the new (blue) values as indices into the subset size distribution array, this distribution is updated. Again, old and unchanged values are shown in black and new values in blue. Note that dist[0] always holds the total number of transactions.

Fig. 3
figure 3

Updating the subset size occurrence distribution with the help of an item counter array, which records the number of contained items per transaction

An important property of this update operation is that it is reversible. By traversing the transaction identifiers again, the increments can be retracted, thus restoring the original subset size occurrence distribution (before the current item n was added). This is exploited in the for-loop at the end of the outer while-loop in Fig. 2, which restores the distribution, by first decrementing the subset size occurrence counter and then the item counter for the transaction (that is, the steps are reversed w.r.t. the update in the first for-loop).

Between the for-loops the subset size occurrence distribution is evaluated and if the evaluation result reaches a user-specified threshold, the extended item set is actually constructed and reported. Afterwards, supersets of this item set are processed recursively and finally the current item is removed again. This is in perfect analogy to standard frequent item set algorithms like Eclat or FP-growth, which employ the same depth-first/divide-and-conquer scheme.

The advantage of our algorithm is that the evaluation function has access to fairly rich information about the occurrences of subsets of the current item set. While standard frequent item set mining algorithms only compute (and evaluate) dist[k] (which always contains the standard support) and the JIM algorithm (Segond and Borgelt 2011) computes and evaluates only dist[k], dist[1] (number of transactions that contain at least one item in the set), and dist[0] (total number of transactions), our algorithm knows (or can easily compute as a simple difference) how many transactions miss 1, 2, 3, etc. items. Of course, this additional information comes at a price, namely a higher processing time, but in return one obtains the possibility to compute much more sophisticated item set evaluations.

Note, however, that the asymptotic time complexity of the search (in terms of the O notation) is not worsened (that is, the higher processing time is basically a constant factor), since all frequent item set mining algorithms using a depth-first search or divide-and-conquer scheme have, in the worst case, a time complexity that is exponential in the number of items (or linear in the number of found item sets), as well as linear in the number of transactions.

A very simple example of such an evaluation function is shown in Fig. 4: it weights the numbers of transactions in proportion to the number of missing items. The weights can be specified by a user and are stored in a global weights array. We assume that \(wgts[0] = 1\) and \(wgts[i] \ge wgts[i+1].\) With this function fault-tolerant item sets can be found in a cost-based manner, where the costs are represented by the weights array. Note, however, that it can also be used to find item sets with an error-based scheme by setting \(wgts[0] = \cdots = wgts[r] = 1\) for a user-chosen maximum number r of missing items and wgts[i] = 0 for all i > r. With such a “crisp” weighting scheme all transactions contribute equally, provided they lack no more than r items of the considered set.

Fig. 4
figure 4

Pseudo-code of a simple evaluation function

An obvious alternative to the simple weighting function of Fig. 4 is to divide the final value of e by dist[1] in order to obtain an extended Jaccard measure—an approach that is inspired by the JIM algorithm (Segond and Borgelt 2011). In principle, all measures listed in Segond and Borgelt (2011) can be generalized in this way, by simply replacing the standard support (all items are contained) by the extended support computed in the function shown in Fig. 4, thus providing a variety of measures.

Note that the extended support computed by the above function as well as the extended Jaccard measure that can be derived from it are obviously anti-monotone, since each element of the subset size occurrence distribution is anti-monotone (if elements are paired from the number of items in the respective sets downwards, as demonstrated in Fig. 4), while dist[1] is clearly monotone. This ensures the correctness of our algorithm in the sense that it is guaranteed to find all item sets satisfying the minimum evaluation criterion.

4 Removing pseudo/spurious item sets

As already mentioned earlier, pseudo (or spurious) item sets can result if there exists a subset of items that is strongly correlated and supported by many transactions. In such a case adding an item to this set may not reduce the support enough to let it fall below the user-specified threshold, even if this item is not contained in any of the transactions containing the correlated items. As an illustration consider the right diagram in Fig. 1: the third item is contained in only one of the eight transactions. However, the total number of missing items in this binary matrix (and thus the extended support) is the same as in the middle diagram, which we would consider as a representation of an acceptable fault-tolerant item set, since each item occurs in a sufficiently large fraction of the supporting transactions. (Of course, what counts as “sufficient” is a matter of choice and thus must be specified by a user.)

In order to cull such pseudo (or spurious) item sets from the output, we added to our algorithm a check whether all items of the set occur in a sufficiently large fraction of the supporting transactions. This check can be carried out in two forms: either the user specifies a minimum fraction of the support of an item set that must be produced from transactions containing the item (in this case the reduced weights of transactions with missing items are considered) or he/she specifies a minimum fraction of the number of supporting transactions that must contain the item (in this case all transactions have the same unit weight).

Both checks can fairly easily be carried out with the help of the vertical transaction representation (transaction identifier lists), the intermediate/auxiliary item counter array (with one counter per transaction), and the subset size occurrence distribution: one simply traverses the transaction identifier list for each item in the item set to check and computes the number of supporting transactions that contain the tested item (or the support contribution derived from these transactions). The result is then compared with the total number of supporting transactions (which is available in dist[m], where m is the number of weights—see Fig. 4) or the extended support (the result of the evaluation function shown in Fig. 4). If the result exceeds a user-specified threshold (given as a fraction or a percentage) for all items in the set, the item set is accepted; otherwise, it is discarded (from the output; but the set is still processed recursively, because these conditions are not anti-monotone and thus cannot be used for pruning).

In addition, it is often beneficial to filter the output for closed item sets (no superset has the same support/evaluation) or maximal item sets (no superset has a support/evaluation exceeding the user-specified threshold). In principle, this can be achieved with the same methods that are used in standard frequent item set mining. In our algorithm we consider closedness or maximality only w.r.t. the standard support (all items contained), but in principle, it could also be implemented w.r.t. the more sophisticated measures. Note, however, that this notion of closedness differs from the notion introduced and used in Boulicaut et al. (1910) and Pensa et al. (2006), which is based on δ-free item sets and is a mathematically more sophisticated approach. In principle, though, a check whether a found item set is closed w.r.t. this notion could be added to our algorithm, but we did not follow this path yet.

5 Experimental evaluation

We implemented the described item set mining approach as a C program, called SODIM (Subset size Occurrence Distribution based Item set Mining), that was essentially derived from an Eclat implementation (which provided the initial setup of the transaction identifier lists). We implemented all measures listed in Segond and Borgelt (2011), even though for these measures (in their original form) the JIM algorithm is better suited, because they do not require subset occurrence values beyond dist[k], dist[1], and dist[0]. However, we also implemented the extended support and the extended Jaccard measure [as well as generalizations of all other measures described in Segond and Borgelt (2011)], which JIM cannot compute. We also added optional culling of pseudo (or spurious) item sets, thus providing possibilities far surpassing the JIM implementation. This SODIM implementation has been made publicly available under the GNU Lesser (Library) Public License.Footnote 1

In a first set of experiments we tested our implementation on data that were artificially generated with a programFootnote 2 that was developed to simulated parallel neuronal spike trains (see also below, for example Sect. 10) We created a transaction database with 100 items and 10,000 transactions, in which each item occurs in a transaction with 5% probability (independent items, so co-occurrences are entirely random). Into this database we injected six groups of co-occurring items, which ranged in size from 6 to 10 items and which partially overlapped (some items were contained in two groups). For each group we injected between 20 and 30 co-occurrences (that is, in 20–30 transactions the items of the group actually co-occur). In order to compensate for the additional item occurrences due to this, we reduced (for the items in the groups) the occurrence probabilities in the remaining transactions (that is, the transactions in which they did not co-occur) accordingly, so that all items shared the same individual expected frequency. In addition, we removed from each co-occurrence of a group of items one of its item, thus creating the noisy instances of item sets we try to find with the SODIM algorithm. Note that due to this deletion scheme none of the transactions contained all items in a given group.Footnote 3 As a consequence, no standard frequent item set mining algorithm is able to detect the groups, regardless of the used minimum support threshold.

We then mined this database with SODIM, using a minimum standard support (all items contained) of 0, a minimum extended support of 10 (with a weight of 0.5 for transactions with one missing item), and a minimum fraction of supporting transactions containing each item of 75%. In addition, we restricted the output to maximal item sets (based on standard support) in order to suppress the output of subsets of the injected groups. This experiment was repeated several times with different databases generated in the way described above.Footnote 4 We observed that the injected groups were always perfectly detected, while only rarely a false positive result, usually with four items, was produced.

In a second set of experiments we compared SODIM with the two other cost-based methods reviewed in Sect. 2,, namely RElim (Wang et al. 2005) and SaM (Borgelt and Wang 2009). As a test data set we chose the well-known BMS-Webview-1 data, which describes a web click stream from a leg-care company that no longer exists. This data set has been used in the KDD cup 2000 (Kohavi et al. 2000) as well as in many other comparisons of frequent item set mining algorithms. By properly parameterizing RElim and SaM (namely by choosing the same insertion penalty for all items and specifying a corresponding transaction weight threshold to limit the number of insertions), these methods can be made to find exactly the same item sets. We chose two insertion penalties (RElim and SaM) or downweighting factors for missing items (SODIM), namely 0.5 and 0.25, and tested with one and two insertions (RElim and SaM) or missing items (SODIM).

The results, which were obtained on an Intel Core 2 Quad Q9650 (3 GHz) computer with 8 GB main memory running Ubuntu Linux 10.04 (64 bit) and gcc version 4.4.3, are shown in Fig. 5. Clearly, SODIM outperforms both SaM and RElim by a large margin, with the exception of the lowest support value for one insertion and a penalty of 0.5, where SODIM is slightly slower than both SaM and RElim. It should be noted, though, that this does not render SaM and RElim useless for fault-tolerant item set mining, because they offer options that SODIM does not, namely the possibility to define item-specific insertion penalties. (SODIM treats all items the same.) On the other hand, SODIM allows for more sophisticated evaluation measures and the removal of pseudo (or spurious) item sets. Hence all three algorithms are useful.

Fig. 5
figure 5

Execution times on the BMS-Webview-1 data set. Light colors refer to an insertion penalty factor of 0.25, dark colors to an insertion penalty factor of 0.5 (color figure online)

6 Application to concept detection

To demonstrate the practical usefulness of our method, we also applied it to the 2008/2009 Wikipedia Selection for schools,Footnote 5 which is a subset of the English WikipediaFootnote 6 with about 5,500 articles and more than 200,000 hyperlinks. We used a subset of this data set that does not contain articles belonging to the subjects “Geography”, “Countries”, or “History”, resulting in a subset of about 3,600 articles and more than 65,000 hyperlinks. The excluded subjects do not affect the chemical topic we focus on in our experiment, but contain articles that reference many articles or that are referenced by many articles (such as United States with 2,230 references). Including the mentioned subject areas would lead to an explosion of the number of discovered item sets and thus would make it much more difficult to demonstrate the effects we are interested in.

The 2008/2009 Wikipedia Selection for schools describes 118 chemical elements.Footnote 7 However, there are 157 articles that reference the Chemical element article or are referenced by it, so that simply collecting the referenced or referencing articles does not yield a good extensional representation of this concept. Searching for references to the Chemical element article thus results not only in articles describing chemical elements but also in other articles including Albert Einstein, Extraterrestrial Life, and Universe. Furthermore, there are 17 chemical elements (e.g. palladium) that do not reference the Chemical element article.

In order to better filter articles that provide information about chemical elements, one may try to extend the query with the titles of articles that are frequently co-referenced with the Chemical element article, but are more specific than a reference to/from this article alone. In order to find such co-references, we apply our SODIM algorithm. In order to do so, we converted each article into a transaction, such that each referenced article is an item in the transaction of the referring article. This resulted in a transaction database with 3,621 transactions.

We then ran our SODIM algorithm with a minimum item set size of 5 and a minimum support (all items contained) of 25 in order to find suitable co-references. 29 of the 81 found item sets contain the item Chemical element and thus are candidates for the sets of co-referenced terms we are interested in. From these 29 item sets we chose the following item set for the subsequent experiments: {Oxygen, Electron, Hydrogen, Melting point, Chemical element}. Considering the semantics of the terms in this set, we can expect it to provide a better characterization of the extension of the concept of a chemical element.

In order to illustrate this, we retrieved from our selection of articles those containing the chosen item set, allowing for a varying number of missing items (from this set), which produces different selections of articles that are related to chemical elements. The results are shown in Table 1. The first column contains the allowed number of missing items (out of the 5 items in the item set stated above) and the second column the number of articles (transactions) that are retrieved under this condition. The third column states how many of these articles are actually about chemical elements and the fourth column how many other articles were retrieved (hence we always have column 3 plus column 4 equals column 2). Most interesting is the last column, which contains the number of discovered chemical elements that do not reference the Chemical element article.

Table 1 Results for different numbers of missing items

Obviously, the selected item set is very specific in selecting articles about chemical elements, because if it has to be contained as a whole (no missing items), only one article is retrieved that is not about a chemical element. However, its recall properties are not particularly good, since it retrieves only 24 out of the 118 articles about chemical elements. The more missing items are allowed, the better the recall gets, though, of course, specificity goes down. However, this is compensated by the fact that also some articles about chemical elements are retrieved that do not reference the Chemical element article and hence this method provides a better recall than a simple retrieval based on a reference to the Chemical element article (at the price of a somewhat lower specificity).

As a consequence, we believe that we can reasonably claim that finding approximate item sets with the SODIM algorithm can help to detect new concepts (we used a known concept only to have a standard to compare to) and to complete missing links and references for existing concepts.

7 Measuring item cover (dis)similarity

The approach described in Segond and Borgelt (2011), which we already referred to above, introduced the idea to evaluate item sets by the similarity of the covers of the contained items, where an item cover is the set of identifiers of transactions that contain the item. Under certain conditions (which can reasonably be assumed to hold in the area of parallel spike train analysis we study below), even a mere pairwise analysis can sometimes reveal the existence of strongly correlated sets of items. As an example, Fig. 6 shows an item cover distance matrix for 100 items and a database with 10,000 transactions, which was generated with the same program already used for the experiments reported in Sect. 5.Footnote 8 Into this data set a single strongly correlated item set of size 20 was injected. Since the darker fields (apart from the diagonal, which represents the distance of items to themselves and thus necessarily represents perfectly similar—namely identical—item covers) exhibit a clear regular structure, the existence of such a group of correlated items is immediately apparent and the members of the group can easily be identified.

Fig. 6
figure 6

Distance/similarity matrix of 100 items (item covers) computed with the Dice measure (see Table 3) with an injected item set of size 20 (the darker the gray, the lower the pairwise distance). The data set underlying this distance matrix is depicted as a dot display in the middle diagram of Fig. 8

In order to actually measure the (dis)similarity of (pairs of) item covers, one can employ any similarity or distance measure for sets or binary vectors (since the set of identifiers of supporting transactions can also be represented by a binary vector ranging over all transactions in the database, with entries of value 1 indicating which are the supporting transactions). Recent extensive overviews of such measures include Cha et al. (2006) and Choi et al. (2010); a selection that can be reasonably be generalized to more than two sets or binary vectors can be found in Segond and Borgelt (2011). Technically, all of these measures are computed from 2 × 2 contingency tables like the one shown in Table 2. The fields of such a contingency table count how many transactions contain both items (n 11), only the first item (n 10), only the second (n 01), or neither (n 00). The row and column marginals are simply n i. = n i0 + n i1 and n .i  = n 0i  + n 1i for \(i \in \{ 0, 1 \}, \) while n .. = n 0. + n 1. = n .0 + n .1 is the total number of transactions in the database to analyze.

Table 2 A contingency table for two items A and B

In principle, we could use any of the measures listed in Cha et al. (2006) and Choi et al. (2010) for our algorithm.Footnote 9 However, since considering all of these is clearly infeasible, we decided—after some experimentation—on the subset shown in Table 3. We believe that this subset, though small, still reasonably represents the abundance of available measures, covering several different fundamental ideas, and emphasizing different aspects of the similarity of binary vectors.

Table 3 Some distance measures used for comparing two item covers

8 Finding item sets with noise clustering

In order to assess how well the distance measures listed in Table 3 are able to distinguish between vectors that contain only random noise and vectors that contain possibly relevant correlations, we evaluated them by an outlier detection method that is based on noise clustering. In this evaluation we assume that all items not belonging to a relevant item set are outliers, so that after the removal of outliers only relevant items remain. This approach can be interpreted in two ways: in the first place it can be seen as a preprocessing method that focuses the search towards relevant items and reduces the computational costs of the actual item set finding step by culling the item base on which it has to be executed. This is particularly important if the subsequent step, which actually identifies the item sets, is an enumeration approach, as this has—in the worst case—computational costs that are exponential in the number of items. Second, if there are only few relevant sets of items to be detected and their structure is sufficiently clear (as can reasonably be expected, at least under certain conditions, in the area of spike train analysis), the method may already yield the desired item sets.

A usable noise-based outlier detection method has been proposed in Rehm et al. (2007). The algorithm introduced, in a manner similar to Davé (1991), a noise cloud, which has the same distance to every data point (here: item cover). A data point is assigned to the noise cloud if and only if there is no other data point in the data set that is closer to it than the noise cloud. At the beginning of the algorithm the distance to the noise cloud is 0 for all data points and thus all points belong to the noise cloud. Then the distance to the noise cloud is slowly increased. As a consequence, more and more data points fall out of the noise cloud (and may then be assigned to an actual cluster of items if clusters are formed rather than a mere outlier detection and elimination is carried out).

Plotting the distance of the noise cloud against the number of points not belonging to the noise cluster leads to diagrams like those shown in Fig. 7, which cover different data properties. The default parameters with which the underlying data sets were generated (using a neurobiological spike train simulator) are as stated in the caption, while deviating settings are stated above each diagram. Note that by varying the copy probability r we generate occurrences of item sets that miss some of the items that should be present, since it is our goal to develop a method that can find approximate (or fault-tolerant or fuzzy) item sets. Note also that all diagrams in the upper row are based on data with a copy probability of 1.0, thus generating perfect item sets. In the lower row, however, different smaller values of the copy probability were tested.

Fig. 7
figure 7

Fraction of items not assigned to the noise cloud (vertical axis) plotted over the distance from the noise cloud (horizontal axis; note that all distance measure have range [0,1]) for different distance measures. The default parameters of the underlying data sets are: n = 100 items, t = 10,000 transactions, p = 0.02 probability of an item occurring in a transaction, m = 1–20 group of items potentially occurring together, c = 0.005 probability of a coincident occurrence event for the group(s) of items, r = 1 probability with which an item is actually copied from the coincident occurrence process. Deviations from these parameters are stated above the diagrams

Analyzing the steps in such curves allows us to draw some conclusions about possible clusters (relevant item sets) in the data. Item covers that have a lot of transactions in common have smaller distances for a properly chosen metric and therefore fall out of the noise cluster earlier than items that only have a few transactions in common. Hence, by clustering items together that have low pairwise distances, we can find relevant item sets, while the noise cloud captures items that are not part of any relevant item sets.

The fairly wide plateaus in the diagrams of Fig. 7 (at least for most of the distance measures) indicate that this method is quite able to identify items belonging to a relevant set by relying on pairwise distances. Even if the copy probability drops to a value as low as 0.5 (only half of the items of the correlated set are, on average, contained in an occurrence of that set) the difference between the covers of items belonging to a relevant set and those not belonging to one remains clearly discernible. Judging from the width of the plateaus, and thus the clarity of detection, we can conclude that among the tested measures d Dice yields the best results for a copy probability of 1. For a copy probability <1 d Dice still yields very good results, but d Yule appears to be somewhat more robust, as it is less heavily affected by the imperfect data.

Unfortunately, from a pure outlier-focused application of this approach there arises a problem: even if all items can be detected that most likely belong to relevant item sets, the structure of these sets (which items belong to which set(s)) remains unknown, since data points are only assigned to the noise cluster or not. Figure 7 demonstrates this quite well: for the last diagram in the first row two sets of 20 items each (both with the same parameters) were injected into a total of 100 items, but there is only one step of 40 items visible in the diagram, rather than two steps of 20 items. The reason for this is, of course, that the size and parameters of the injected item sets were the same, so that they behave the same w.r.t. the noise cloud, even though their pairwise distance structure certainly indicates that not all of these items belong to one set.

In order to separate such item sets, one may consider applying a standard clustering algorithm to the non-outlier items to which the above algorithm has reduced the data set, or to apply the original version of the noise-clustering algorithm of Davé (1991), with a distance for the noise cluster that can be selected by evaluating the plateaus of the curves shown in Fig. 7. However, such an approach still has the disadvantage that the result is based solely on pairwise distances and does not actually check for a true higher-order correlation of the items. In principle, they could still be correlated in smaller, overlapping subsets, which can lead to the same pairwise distances.

9 Sorting with non-linear mappings

To avoid having to test all subsets of items in order to detect true higher-order correlations, we developed an algorithm that is inspired by the tile-finding algorithm of Gionis et al. (2004). The basic idea of this algorithm is that relevant item sets form “lines” in a dot-display of the transactional data if they are properly sorted. An example is shown in Fig. 8: while the data set that is depicted in the top diagram contains only random noise (independent items), the diagram in the middle displays data into which a group of 20 correlated items was injected (this is the same data set from which Fig. 6 was computed). However, since the items of this set are randomly interspersed with independent items, there are no visual clues by which one can distinguish it reliably from the top diagram. However, if the correlated items are relocated to the bottom of the diagram, as shown in the bottom diagram, clear lines become visible, by which the correlated set of items can be identified. If one also reorders the transactions, so that those in which the items of the correlated group occur together are placed on the very left, the item set becomes visible as a tile in the lower left corner.

Fig. 8
figure 8

Dot-displays of two transaction databases (horizontal: transactions, vertical: items) generated with neurobiological spike train simulator. The top diagram shows independent items, while the bottom diagrams contain a group of 20 correlated items. In the middle diagram, these items are randomly interspersed with independent items, while in the bottom diagram they are sorted into the bottom rows

Based on this fundamental idea, the main steps of our algorithm consist in finding a proper reordering on the items and then applying a sequence of statistical tests in order to actually identify, in a statistically sound way, significant item sets. Note that in this respect the advantage of reordering the items is that no complete item set enumeration is needed, but a simple linear traversal of the reordered items suffices. This is important in order to reduce the number of necessary statistical tests and thus to reduce the number of false-positive results (and generally to mitigate the problem of multiple testing and the loss of control of significance ensuing from it).

Gionis et al. (2004) proposed the use of a concept from linear algebra to find an appropriate sorting of the items. The idea is to compute the symmetric matrix L S  = R S  − S, where S = (s ij ) is the similarity matrix of the item covers and R S  = (r ii ) is a diagonal matrix with r ii  = ∑ j s ij . The elements s ij of the matrix S may be computed, for example, as 1 − d ij , where d ij is the distance of the covers of the items i and j, for which all measures shown in Fig. 3 can be used. In principle, however, all similarity and distance measures listed in Choi et al. (2010) may be used, at least if properly adjusted. From this matrix L S the so-called Fiedler vector is computed, which is the eigenvector corresponding to the smallest non-zero eigenvalue. The items are then reordered according to the coordinate values that are assigned to them in the Fiedler vector.

In Gionis et al. (2004) it is argued that this is a good choice, because the Fiedler vector minimizes the stress function \(x^{\top}L_{s}x = \sum_{i,j}{s_{ij}\cdot(x_{i}-x_{j})^{2}}\) w.r.t. the constraints \(x^{\top}e = 0, \) where \(e = (1, \ldots, 1), \) and \(x^{\top}x = 1\) (as can be shown fairly easily). However, experiments we carried out with this approach showed that even when there was only one set of correlated items, reordering them according to the Fiedler vector did not sort the members of this set properly together, thus leading to highly dissatisfying results. We therefore looked for alternatives and since we want to place similar item covers (or item covers having a small distance) close to each other, the Sammon (1969) projection suggests itself. This algorithm maps data points from a high-dimensional space onto a low-dimensional space, usually a plane, with the goal to preserve the pairwise distances between the data points as accurately as possible. Formally, it tries to minimize the error function

$$ E = \frac{1}{\sum_{i<j} d_{ij}} \sum_{i<j} \frac{(d_{ij} -d_{ij}^*)^2}{d_{ij}}, $$

where d ij is the distance between the ith and the jth data point in the original space, while d * ij is the distance between the projections of these data points in the target space. This error function is minimized by an iterative scheme, which may be derived from a gradient descent. The initial projections of the data points may either be initialized randomly or (preferably) computed by a principle component analysis that discards all but two dimensions (or whatever the chosen dimensionality of the target space is).

The core idea of our algorithm is to apply the Sammon projection to find a mapping of the item covers to one-dimension (that is, to simple real values), so that the pairwise distances in the target dimension mimic, as well as possible, the distances of the item covers. Thus each item is mapped to a real value, according to which the items can then be reordered. Our experiments (see below) revealed that this approach yields a much better reordering of the items than the Fiedler vector and is able to reliably place items of the same set next to each other.

In a second step the reordered item list is traversed linearly, starting at the end of the list at which the more strongly correlated item pair is located. A new item is added to the currently formed set whenever a statistical independence test (for instance, a chi-square test) indicates that the item is correlated with the already collected items. For this test the new item can either be compared with the preceding item only, thus still relying exclusively on pairwise tests, or some higher-order test involving all already collected items may be performed. After a relevant item set has been discovered in this way, the contained items are removed and the process of mapping/resorting the items and traversing the resorted item list is restarted in order to find another item set. The process stops if either all items have been removed or neither the item pair at the start, nor the one at the end of the item list show significant correlation.

Note that the time complexity (as well as the space complexity) of this heuristic scheme is mainly governed by the size of the distance matrix and thus quadratic in the number of items. An additional factor is the number of iterations needed to compute Sammon’s non-linear mapping, which depends on specific features of the data set. The complexity of extracting the item sets from the reordered item list depends on the exact testing scheme used, but is certainly bounded by the square of the number of items. This shows that this scheme, at the price of not being exhaustive, is much more efficient than enumeration approaches, which cannot avoid an exponential complexity in the worst case.

10 Application to spike train analysis

As already mentioned in the introduction, we developed this algorithm for an application in neurobiology, namely the analysis of parallel spike trains. Finding ensembles in neural spike trains has been a vital task in neurobiology ever since Hebb (1949) suggested ensembles of neurons as the building blocks of information processing in the brain. However, with recent advancements in multi-electrode technology, which provides means to record 100 and more spike trains simultaneously, classical ensemble detection methods became infeasible due to a combinatorial explosion and a lack of reliable statistics. The algorithm described in the preceding sections tries to amend this situation by allowing us to find relevant neuron ensembles with a limited number of statistical tests.

The data sets are obtained by analyzing the wave form of the recorded electrical potential, which allows to detect sharp increases in the potential (so-called spikes) as well as to separate signals from multiple neurons recorded by a single electrode (Lewicki 1998; Buzśaki 2004). After this step, which is called spike sorting, a single spike train consists of a list of the exact times at which spikes have been recorded. Due to the relatively high time resolution, it is advisable to perform some time binning in order to cope with inherent jitter and to simplify the subsequent ensemble detection. That is, the continuous measurement times are discretized by assigning a time index to each spike. Typical bin sizes range from 1 to 10 ms (since the duration of an action potential is only about 1ms), so that a 10-s recording can give rise to up to 10,000 time bins (transactions).

In order to test the quality of the algorithm, we conducted several experiments on data sets that were simulated with the spike train generator already mentioned earlier (see also Berger et al. 2010). The reason why we relied on simulated data is that we need to determine how reliably the method can detect ensembles that we inject into the data, before we can apply it to real data, where we do not know whether and, if yes, which ensembles are present. The tests were carried out with several different parameter sets, specifying a different number of ensembles as well as different firing rates, copy probabilities, etc. The data generation proceeds by generating independent parallel spike trains with the chosen background parameters, as well as a mother process for the coincident spiking activity, from which the ensemble neurons then copy with a specified copy probability.

The generated data sets were then analyzed with our algorithm and it was counted how many of the injected ensembles were detected. An ensemble is considered as found if and only if every neuron belonging to that ensemble has been successfully identified. An ensemble is considered as partially found if only a subset of its neurons have been identified by the algorithm, no matter how small this subset may be. To prevent too small subsets to appear in the result the minimum ensemble size to be reported was set to three. Each configuration is tested several times to avoid misleading conclusions due to random fluctuations. The results of a first batch of experiments are shown in Table 4. Our algorithm never reported any neurons belonging to an ensemble that was not actually present in the data (no false positives). That our algorithm suppresses false positives so successfully is most likely due to the constraint that an ensemble must comprise at least three neurons to be reported.

Table 4 Experimental results for the ensemble detection method based on reordering the items/neurons with the Sammon projection

In a second batch of experiments (almost) all spike train parameters were fixed except the copy probability for the coincidences (whereas the first batch of experiments was conducted with a copy probability of 1.0). Table 5 shows the results of these tests, with copy probabilities from 0.4 to 0.85. As could be foreseen, fewer complete ensembles are detected, the lower the copy probability. This is not surprising, because with a copy probabilities of, say, 0.5 only 50% of the ensemble neurons participate on average in synchronous activity and thus coincidences simply disappear in the background noise. On the other hand the rate of partially detected ensembles stays stable for a wide range of values, indicating that only a few neurons are overlooked, because they do not participate in enough coincidences. Even for a copy probability as low as 0.4, the partial detection rate stays at almost 80%. This means that even with (on average) 60% of the data missing, 80% of the ensembles were detected at least partially.

Table 5 Experimental results for the ensemble detection method based on reordering the items/neurons with the Sammon projection, different copy probabilities

11 Conclusions and future work

In this paper we presented two new algorithms for mining approximate frequent item sets, the first of which exploits subset size occurrence distributions. It efficiently computes these distributions while enumerating item sets in the usual depth-first manner. As evaluation measures we suggested a simple extended support, by which transactions containing only some of the items of a given set can still contribute to the support of this set, as well as an extension of the generalized Jaccard index that is derived from the extended support. Since the algorithm records, in an intermediate array, for each transaction how many items of the currently considered set are contained, we could also add a simple and efficient check in order to cull pseudo (or spurious) item sets from the output. We demonstrated the practical usefulness of our algorithm by applying it, combined with filtering for maximal item sets, to the 2008/2009 Wikipedia Selection for schools, where it proved beneficial to detect the concept of a chemical element despite the only limited standardization of pages on such substances.

Our second algorithm uses the Sammon projection to map item covers to real values, so that the differences of these values reflect, as well as possible, the distances of the item covers. It then uses the resulting values to reorder the items and extracts relevant items sets by linearly traversing the items and testing for significant item sets. Based on the results shown in the Sect. 10 we conclude that this approach produces very good results when applied to the task of detecting neuron ensembles. This holds even if the quality of the data is not ideal, regardless of whether this is due to a lossy data accumulation or missing spikes due to the underlying (biological) process. The tests performed have shown that even with 60% of all data lost nearly as much as 80% of the ensembles could be detected at least partially.

We are currently trying to extend our first algorithm to incorporate item weights (weighted or uncertain transactional data, see Sect. 2), in order to obtain a method that can mine fault-tolerant item sets from uncertain or weighted data. A main problem of such an extension is that the item weights have to be combined over the items of a considered set (for instance, with the help of a t-norm). This naturally introduces a tendency that the weight of a transaction goes down even if the next added item is contained, simply because the added item is contained with a weight <1. If we now follow the scheme of downweighting transactions that are missing an item with a user-specified factor, we have to make sure that a transaction that contains an item (though with a low weight) does not receive a lower weight than a transaction that does not contain the item (because the downweighting factor is relatively high).

As an improvement of the second algorithm, we are exploring ways of removing the time binning and working directly on a continuous time scale. The main problems here are to find proper distance or similarity measures that capture coincidence activity of the neurons even if the spike location is affected by jitter.