Cluster-based information retrieval using pattern mining

This paper addresses the problem of responding to user queries by fetching the most relevant object from a clustered set of objects. It addresses the common drawbacks of cluster-based approaches and targets fast, high-quality information retrieval. For this purpose, a novel cluster-based information retrieval approach is proposed, named Cluster-based Retrieval using Pattern Mining (CRPM). This approach integrates various clustering and pattern mining algorithms. First, it generates clusters of objects that contain similar objects. Three clustering algorithms based on k-means, DBSCAN (Density-based spatial clustering of applications with noise), and Spectral are suggested to minimize the number of shared terms among the clusters of objects. Second, frequent and high-utility pattern mining algorithms are performed on each cluster to extract the pattern bases. Third, the clusters of objects are ranked for every query. In this context, two ranking strategies are proposed: i) Score Pattern Computing (SPC), which calculates a score representing the similarity between a user query and a cluster; and ii) Weighted Terms in Clusters (WTC), which calculates a weight for every term and uses the relevant terms to compute the score between a user query and each cluster. Irrelevant information derived from the pattern bases is also used to deal with unexpected user queries. To evaluate the proposed approach, extensive experiments were carried out on two use cases: the documents and tweets corpus. The results showed that the designed approach outperformed traditional and cluster-based information retrieval approaches in terms of the quality of the returned objects while being very competitive in terms of runtime.


Introduction
Data mining [1,2] is an interdisciplinary field that deals with the extraction of information from a large set of data and transformation into an easily interpretable structure for further use. Information retrieval (IR) is the task of retrieving the information that is relevant to a user query (represented by a set of terms) from a collection of objects [3]. Several variant IR problems have been considered in the literature. For instance, document information retrieval (DIR) [4] is the first IR problem that has been dealt with. In this problem, objects are documents and the terms are the keywords therein. Hashtag retrieval (HR) [5] is another IR problem, in which objects are the tweets and the terms are hashtags. Solutions to the IR problem use a similarity search approach, which has a polynomial complexity and needs high computational time in real-world scenarios. For instance, if we consider the Football corpus containing 3,000,000 tweets, 90,660 hashtags, and 1,000,000 user queries, then the number of possible matches is as much as 27 × 10 16 . One possible alternative to facing this problem is the use of clustering techniques [1,[6][7][8], and many clusterbased retrieval approaches have been investigated [9][10][11][12].
The key idea in all these approaches is to group documents from a collection of objects into several clusters such that similar objects are grouped in the same cluster, and then the search is only performed on the clusters deemed relevant to a given user query. In general, cluster-based retrieval approaches can be classified into two categories, i) querydependent and ii) query-independent. In query-dependent methods, given a query, an initial list of objects is first retrieved from the entire collection. This list is clustered using some clustering technique and the created clusters are ranked concerning the query. It should be noted that querydependent clusters can also be used to enrich document representations [13]). In query-independent approaches, clusters are created offline from all the objects in the corpus independently from queries, and then, given a user query, the best cluster is selected. Although this approach is fast, it is restricted to enabling answering to conjunctive queries. The existing cluster-based approaches, in general, are much faster than traditional approaches when applied to large collections, but they often retrieve objects with less quality. The reason for this is that inefficient ranking procedures are used, which only rank clusters for a query using information about centroids and the nearest neighbors. This paper investigates a pattern mining model for solving cluster-based IR problems.

Motivation
Consider the five objects ( 1 , . . . , 5 ) illustrated in Table 1. The second column represents the set of keywords with their frequencies in each object obtained after preprocessing the objects. For instance, (Data,4) in the first row means that there are four different occurrences of the term "Data" in the first object. At a first glance of Table 1, the keywords "Data", "Mining", and "Knowledge" appear together in 1 , 2 , and 3 , which represent 60% of the whole observations, but the three keywords appear with different frequencies. Thus, the keywords "Data" and "Mining" are observed with high frequencies (up to  2) for all the cases while the keyword "Knowledge" is observed with low frequency (1 for all cases). Studying the correlation of the relevant patterns from the set of keywords may enhance information retrieval performance. If the terms "Data" and "Mining" are assumed to be relevant, then is the pattern: "Data", "Mining", and "Knowledge" relevant? In the previous example, the term "Knowledge" appears only once for all cases. Is the term "Engineering" relevant? It indeed appears four times in the fourth object; however, it appears only in 20% of the whole set of objects. Moreover, it is judicious to deal with the first three objects when talking about data mining separately from the other objects. Several questions should be answered related to this context. How can objects be efficiently split into groups (clusters)? How can the relevant patterns be extracted with different frequencies for each cluster? How to identify the relevant patterns from other patterns? Finally, how can we use the relevant patterns of each cluster to efficiently respond to the user queries?

Contribution
Attempting to answer the above-mentioned questions, this paper proposes a new approach, called Cluster-based Retrieval using Pattern Mining (CRPM). To the best of our knowledge, this is the first work that considers frequent and high-utility pattern mining in cluster-based information retrieval problems. The major contributions of this paper can be summarized as follows: 1. Three different algorithms (k-means, DBSCAN, and Spectral) are proposed to split the objects database into clusters while similar objects are grouped in the same cluster. The aim is to minimize the number of shared terms among the clusters of objects. 2. Two transformation approaches (Boolean and weighted) are proposed to adapt the pattern mining algorithms in searching for the relevant patterns on each cluster of objects. The Boolean approach transforms the set of the objects into a transaction database without considering the frequency of the terms in the objects, whereas the weighted approaches consider the frequency of the objects in the transformation process. 3. Two pattern mining algorithms are adapted when searching for the relevant patterns for each cluster of objects. The first algorithm adapts the Fpgrowth algorithm [14], which uses a Boolean transformation to discover frequent patterns for each cluster of objects. The second algorithm adapts UP-Growth [15], which uses the weighted transformation to discover highutility patterns for each cluster of objects. The patternsbased construction is applied for each cluster to store relevant patterns and irrelevant objects. The relevant patterns for each cluster of objects is stored in the pattern base, whereas the irrelevant objects are derived from the patterns base of each cluster to handle unexpected user queries. 4. Two novel ranking strategies are presented, i) Weighted Terms in Cluster (WTC) and ii) Score Pattern Computing (SPC). These enable ranking clusters of objects for a query using the discovered frequent and high-utility patterns. The WTC strategy ranks clusters based on the weights of each term in the user query and the relevant patterns for each cluster. The SPC strategy ranks clusters based on the relevance of the patterns according to the user query. 5. To demonstrate the usefulness of the proposed solution, intensive experiments were carried out using two case studies, i) document information retrieval and ii) hashtag retrieval. The results showed that the designed approach outperformed traditional and cluster-based information retrieval approaches in the quality of the returned objects while being very competitive in terms of runtime.

Outline
The remainder of this paper is organized as follows: Section 2 presents the main concepts of the information retrieval process. Section 3 reviews related work, including traditional approaches for information retrieval, clusterbased retrieval, and pattern mining approaches. Section 4 introduces the proposed approach and its main components. Section 5 presents the experimental study and its results. Finally, Section 6 draws conclusions and discusses opportunities for future work.

Background
Information retrieval is the process of finding objects that are relevant to the image query. Some formal definitions of the concepts used throughout the paper are given below. The importance of a term in a given object is determined by the Term Frequency-Inverse Document Frequency (TFI-DF), which is defined below. Definition 2 (TF-IDF) The TF-IDF of the term T i in the object j , is calculated as where and It should be noted that f T i , j is the frequency of T i in j .
The relevance of the objects to the given query using the ranking function is defined in the following.

Definition 3 (Ranking Function) Consider a function f :
× Q → R + that determines the score for each object i ∈ according to a given query Q j ∈ Q while the result is denoted f ( i , Q j ). The ranking function Rank f aims to rank the scores of the objects for each given query Q j obtained by f .
The traditional IR solutions need to scan the whole objects for every user query. This process is highly timeconsuming, particularly for a large number of objects and queries. To deal with this problem, cluster-based retrieval solutions have been largely studied in the last decade [16][17][18].
Definition 4 (Cluster-Based Retrieval) Consider a set of k clusters G = {G 1 , G 2 ...G k }, where each G i is represented by the set of objects { i 1 , i 2 ... i |G i | } and consider a set of queries Q = {Q 1 , Q 2 , . . . , Q l }. Cluster-based retrieval aims at retrieving one or more clusters in G in response to every query in Q. The task is to match the query against clusters of objects instead of individual objects and rank clusters based on their similarity to the query.
Solutions to cluster-based retrieval are aimed at reducing the time performance of the information retrieval process. Instead of processing the whole object databases, only the relevant clusters to the user query are explored.

Earlier IR methods
Although several models have been suggested, the cosine similarity function [19] is the most used ranking function in the literature. i) In the Boolean model [20], both objects and queries are represented with Boolean operators while the ranking function is the intersection between every object and the given query. The result of the ranking is the set of objects that maximizes this intersection. ii) In the vectorial model [21], both objects and queries are represented by numeric vectors. Generally, the numeric values of each vector are determined by the TF-IDF values (See Def. 2 for more details). iii) In the probabilistic model [22], the set of probabilities of each term in both queries and objects are computed. Several works have been developed in these directions for solving the basic IR problems. Wang et al. [23] proposed a graph representation describing the relevant features by defining the co-occurrence and the literal meaning of objects. Luo et al. [24] investigated a datadriven approach by using structural information as relevant features in an ad hoc scenario. To improve the accuracy of the resulted hashtags, Bansal et al. [25] proposed a semantic approach. The set of hashtags are first segmented, and then each group is linked to Wikipedia to enrich the semantic search. Selvalakshmi et al. [26] proposed a new semantic information retrieval system for enhancing the relevancy score. This system integrates a new fuzzy-rough set based feature selection algorithm and the latent-Dirichlet allocation based semantic IR algorithm. Yadav [27] proposed a medical image retrieval system. The visually relevant features of the input images are first derived by the exploitation of image descriptors, and different weights are then allocated to each feature to retrieve the relevant images-to-image query. Sheetrit et al. [28] explored the passage-based information to improve document retrieval effectiveness. They investigated the use of learning-to-rankbased document retrieval methods that utilize a ranking of passages produced in response to the query. Deghan et al. [29] proposed a diversification strategy to improve the IR process. A probabilistic model was investigated to consider the vocabulary gap problem among the queries and the documents. Ji et al. [30] proposed a biomedical information retrieval system in healthcare decision-making to visualize the neural document embedding as a configurable document map and enabling reasoning for different user queries.

Cluster-based IR
Raiber et al. [16] presented a Markov random field model to rank document clusters. A hyper-graph composed of objects and queries is first built, and then the model can be used to estimate the probability that a cluster is relevant to a given query. However, this approach is inefficient when handling multiple queries due to its complexity. Cai et al. [31] proposed applying a ranking function with a composed model to transform each term into a K-dimensional vector. Every dimension is measured by considering the rank distribution of the term in the discovered clusters. Levi et al. [32] proposed a cluster-based approach to retrieve relevant objects. Their approach considers objects that are the nearest-neighbors of many other objects to be more likely to be relevant. It then calculates the overlap between two clusters as the ratio between the number of objects shared by the clusters to the number of objects in each cluster. Naini et al. [17] proposed the IC-GLS approach for cluster-based information retrieval. This method first groups documents from a collection using the k-means algorithm and then finds diversified and heterogeneous documents by applying a similarity measure. Although this strategy allows better exploration of the document space, the quality of the returned documents is low for homogeneous queries. Jin et al. [11] designed a hybrid indexing method for cluster-based retrieval. After grouping documents into clusters, an index structure is built and a representative document is selected for each cluster. Bhopale et al. [18] integrated swarm intelligence and clustering. The collection of documents is first decomposed into several groups using a bio-inspired K-Flock clustering algorithm. A cosine similarity based probabilistic model is then used to retrieve query-specific documents from clusters based on the matching scores between the queries and the knowledge extracted from the clusters. Sheetrit et al. [33] proposed the use of the focused retrieval algorithm, which ranks the documents' passages by their presumed relevance to a query. A learningto-rank approach was implemented for transforming the cluster ranking to passage ranking. Tam et al. [34] proposed an end-to-end approach for knowledge-grounded response generation in dialog system technology challenges. The k-means algorithm was adopted to enable dynamically grouping the similar partial hypotheses at each decoding step under a fixed beam budget. Moreover, a language model was investigated to prune meaningless responses.

Pattern mining
Frequent pattern mining (FPM) [35][36][37] is a common and fundamental part of knowledge discovery in data mining. It has been generalized to many kinds of patterns, such as frequent sequential patterns [38], frequent episodes [39], and frequent subgraphs [40]. The goal of FPM is to discover all the desired patterns that have support of no lower than a given minimum support threshold. If a pattern has higher support than this threshold, then it is called a frequent pattern; otherwise, it is called an infrequent pattern. Studies of FPM seldom consider databases with the weights of terms, and none of them consider the utility feature. Utility pattern mining (UPM) considers both the statistical significance and the profit significance, whereas FPM aims at discovering the interesting patterns that frequently cooccur in databases while all are given the same significance.
However, in practice, these frequent patterns do not show the business value and impact. In contrast, UPM aims at identifying the useful patterns that appear together and also bring high profits to the merchants [41]. In UPM, managers can investigate the historical databases and extract the set of patterns that have high combined utilities. Such problems cannot be tackled by the support/frequency-based FPM framework.
Numerous studies have incorporated pattern-mining techniques to solve the IR problem. Fung et al. [42] used frequent itemsets to construct a hierarchical tree, which represents the collection of documents. Yu et al. [43] dynamically generated different topics of the collected documents using only the closest frequent itemsets. It uses an intelligent structure that allows the hierarchical construction of the different links between each k-itemset with the (k-1)-itemset. Zhong et al. [44] improved the comprehension of the user's request using the patternmining algorithm. The taxonomy of the patterns is discovered by applying a closed-based algorithm in the training set. This technique reduces the noise between the user's request and the set of the collected documents. Zingla et al. [45] combined external hashtags resources and association rule mining for retrieving the most relevant texts from microblogs. Association-rule extraction is first applied to the text microblogging collection to generate the candidates. The original query is then transformed as the candidates using external knowledge sources. The score between the query and the set of candidates is finally determined using an explicit semantic analysis measure. Belhadi et al. [46] incorporated pattern-mining approaches to improve the accuracy of retrieving the relevant information and speeding up the search. In their approach, the set of tweets is first transformed into a set of transactions by considering two different strategies (trivial and temporal). After that, the set of relevant patterns is discovered and then used as a knowledge-based system for finding the relevant tweets based on users' queries under the similarity search process.

Discussion
From this short literature review, solutions to IR algorithms can be divided into three categories: i) Solutions that explore all the collection of objects. These solutions provide good quality results but require high computation time. ii) Solutions that first divide objects into different clusters and then rank the resulting clusters. These solutions only explore the relevant clusters to the query, and thus they are faster than the first category. However, a low quality of returned responses are yielded because they consider only the centroid and neighborhood computation information in the ranking process. iii) Solutions that explore pattern mining in the search process. These solutions are accurate but also time consuming, and they notably find the relevant patterns in the whole collection of objects. Some hybrid methods have been developed in the literature that combine clustering and pattern mining for document clustering [47,48]. This is completely different from the approach followed in this paper, where clustering and pattern-mining techniques are incorporated for ranking and searching steps to improve both the quality and the runtime.

CRPM: Cluster-based IR using pattern mining
This section presents the proposed CRPM approach, which integrates both clustering, and pattern mining in solving the information retrieval problem. Figure 1 shows an overview of the CRPM approach. It consists of two main steps: i) Pre-processing that first groups the whole set of objects into similar clusters and then discovers the relevant patterns and deduces the irrelevant objects from each cluster. This includes data collection, clustering, pattern mining, and pattern bases construction. This step is run only once and can be considered to be a pre-processing step for the CRPM algorithm. ii) Query processing fetches the objects that are relevant to the user query using the two components Fig. 1 The CRPM framework created in the previous step (the discovered patterns and irrelevant objects). This step benefits from the knowledge extracted in the previous step. Many queries can be handled by considering the most relevant clusters using an exact search, less relevant clusters using an approximate search, and the irrelevant objects. A detailed explanation of each step is given in the following subsections.

Pre-processing
This includes four main stages: 1. Data collection. This stage creates the corpus of objects to be retrieved from documents, tweets, and so on. Natural language processing (NLP) [49] may be incorporated to refine the extraction results by removing stop words, special characters, unifying dates, Uniform Resource Locator (URLs), letter levels (upper/lower cases), and so on. Additional filtering may be used for particular data forms by using a special API, for example, Twitter Java API in the case of the hashtag retrieval problem (e.g., removing unnecessary hashtags from user posters). In summary, the collection step involves two stages: cleaning and filtering. The cleaning stage consists of removing extra spaces, abbreviation expansion, stemming, and removing stop words, whereas the filtering stage selects the set of terms ignoring the relevant terms. For instance, if we consider the hashtags #BLOGGER, then #blogger represents the same hashtag but with different writing styles. These hashtags are unified to the same hashtag, #blogger. 2. Clustering. The set of objects, for example, , is grouped into the set of similar clusters G = {G 1 , G 2 ...G k } by using decomposition methods. Let i j denote the j th object in the cluster G i . The goal of the decomposition techniques is to minimize the number of shared terms between the groups, such as It should be noted that T (G i ) is the set of terms of the cluster G i . One way to solve this problem is to use state-of-the-art clustering algorithms [50], such as k-means [51], spectral clustering [52], and DBSCAN [53]. Mining frequent patterns This uses a Boolean transformation to organize the set of objects in a Boolean transaction database. Let = { 1 , · · · , m } be the set of objects. Each item t in the transaction D i j is set to 1 if the term T t belongs to the object i ; otherwise, it is set to 0. Therefore, we have All the frequent patterns in every cluster G i ∈ G are extracted using the Fpgrowth algorithm [14]. The algorithm generates all the possible patterns that exceed the minimum support constraint. This process is repeated for all the clusters in G. Every object in a cluster G i is used to create a transaction in the transactional database D i , such that every term becomes an item. If a term T i j,r belongs to an object i j , then the corresponding item I i j,r is added to the transaction D i j , which represents that object. Furthermore, the internal utility of I i j,r is set to the weight ω i j,r of the term T i j,r in the object i j . ω i j,r is defined as the number of occurrences of the term in the object. The external utility of all the items is set to 1, which indicates that all the terms have the same importance in all the clusters of objects. All the high-utility patterns in every cluster G i ∈ G are extracted using the UP-Growth algorithm [15]. The algorithm generates all the possible high-utility patterns that exceed the minimum utility constraint. This process is repeated for all the clusters in G.

Pattern bases construction.
A pattern base PB i is designed for each cluster G i . Every PB i is composed of two parts: i) PB R i contains the set of relevant (frequent or high-utility) patterns obtained by applying the mining process on the set of objects ; and ii) PBR i contains the set of irrelevant information derived from PB R i and the set of objects , defined by It should be noted that the proposed approach uses the irrelevant patterns for user queries that do not fit to the discovered relevant patterns. In other words, the irrelevant patterns are explored if and only if the satisfaction rate of the search is low. More details about using the irrelevant patterns are given in Sec. 3.2.2

Query processing
This step aims at finding the relevant objects for each user query Q l . Instead of scanning all the objects in , only the set of patterns in KS are used. It is performed in two stages. Every pattern p j i has a weight value ω j i . The weights are equal to 1 for mining frequent patterns, and the weights are equal to the utility value u j i for mining high-utility patterns. The user request is a set of terms Q l = {t 1 , t 2 ...t r }, where r is the number of distinct terms in Q l . The output is a ranking of the set of clusters G concerning the user query Q l . Two strategies are developed to rank clusters: WTC and SPC. The WTC strategy assigns a weight to every term in the set of patterns of every cluster, and then the terms are ranked by decreasing the weight. The score of every cluster for the query Q l is calculated using these weights, and it is used to rank the clusters. The WTC of the cluster G i versus the request Q l is given by The SPC computes the score of each cluster G i for a query Q l using the patterns in G i as follows: For instance, in the context of the document information retrieval problem, consider that the user query is Q l ={Data, Mining} and that the following high-utility patterns have been found in two clusters G 1 and G 2 : PB R 1 ={({Data Clusters}, 2), ({Data Structures}, 1), ({Data Model}, 1)}, and PB R 2 ={({Data Mining}, 2), ({Data Clusters}, 2)}. The WTC gives W T C(G 1 , Q l )=5 and W T C(G 2 , Q l )=6 while the SPC gives SP C(G 1 , Q l )=4 and SP C(G 2 , Q l )=6. In this example, the cluster ranking for the query Q l is G 2 and G 1 for both strategies. As shown, the second cluster is considered to be more relevant than the first one for this request. This is because the terms data mining and data clusters are frequent patterns in the documents of the second cluster. 2. Searching The clustering process may generate similar clusters, that is, objects may be close to multiple clusters. When considering the cluster-based retrieval approach, only the cluster of objects similar to the user query is retrieved, which may cause some relevant objects to missed. To deal with this issue, the searching step benefits from the ranking results and explores the clusters of the objects according to the ranking functions W T C or SP C. The set of ranked clusters according to the user query Q j is denoted by G l . The search starts by exploring the objects of the first cluster in G l . The satisfaction rate is computed and determined by the number of the relevant objects satisfied by the user. The algorithm stops if the satisfaction rate reaches the minimum satisfaction rate; otherwise, the same process is repeated for the second cluster in G l until all the clusters are explored. In the case where the satisfaction rate remains below the minimum satisfaction rate, the irrelevant objects in PBR are explored according to the ranked clusters in G l . It should be noted that the minimum satisfaction rate is the threshold that represents the relevant rate suggested by the user. A similarity measure for each selected object i and query Q j is calculated as follows: Algorithm 1 presents the pseudo-code of CRPM. The following variables are considered as input: the set of objects with their terms, the set of user queries, the minimum support, the minimum utility thresholds for the pattern mining process, a variable the type of the transformation of the input database, and the user satisfaction rate. The set of relevant objects of each user query is considered as output. In pre-processing, the set of objects is grouped into similar clusters in Line 4. From Lines 5 to 15, the pattern base is generated using both Boolean and weighted transformation. In query processing, the clusters are ranked for each user query from Lines 17 to 20. The searching process uses the ranked cluster to find the relevant objects for each user query in Line 21. The set of relevant objects of all the user queries are returned in Line 23.
In terms of complexity, pre-processing is the most timeconsuming task because it includes several loops and several scans of the database. However, query processing contains only two loops and needs to scan only the pattern base for each relevant cluster to the user query, and it may be (in the worst-case) the set of irrelevant objects PBR * . Preprocessing is performed only once, independently from the number of user queries |Q|. The cost of the query processing in the worst case scenario is |Q| × k × |PB R * | × |PBR | * . The traditional retrieval algorithms needs |Q|×| |×|T |, where k × |PB R * | × |PBR | * <<< | | × |T | for real-world cases.

Performance evaluation
Extensive experiments were carried out to evaluate the performance of the proposed approach (CRPM) using benchmark IR collections. Two case studies are presented in this section, DIR and HR. To evaluate the retrieved objects, the mean average precision (MAP) and the Fmeasure were used. These are widely used metrics for IR systems evaluation. They are defined as follows: 1. F-measure. It combines the precision and recall measures. It is given by (9) as follows: where Recall = |RRO| |RO| is the ratio of the number of retrieved relevant objects (RRO) to the total number of relevant objects (RO), whereas P recision = |RRO| |REO| is the ratio of the number of RROs to the total number of retrieved objects (REO). 2. MAP. It is computed using (10) as follows: where P recision i is the precision at rank i, that is, the first i ranked objects is considered while the remaining objects are ignored. Table 2 presents the data used in these experiments, which are categorized into two groups according to the problem dealt with (DIR and HR). These databases varied from small to large and sparse to dense. Thus, some databases contain a high number of objects, some databases contain a high number of terms, and some others contain both a high number of objects and terms.

Parameter settings
Several simulations were performed to select the best parameters in the mining process, which were set as follows:  1. Clustering Algorithm. Three algorithms were used: DBSCAN, k-means, and Spectral clustering. We varied the number of neighborhoods for DBSCAN, and the number of clusters for k-means and spectral clustering was 1 and 10, respectively. The best scenarios are summarized in Table 3. It should be noted that the number of clusters suggested for users is the best value returned in terms of F-measure and by fixing the execution time to 10 minutes. 2. Pattern Mining Algorithm 1 . Two tasks were used: mining frequent patterns and mining high-utility patterns. We varied the minimum support values of the mining frequent patterns task and the relative minimum high-utility values from 0.1 to 1.0, respectively. The best scenarios are summarized in Table 3.

Searching
Step. Two strategies were used: WTC and SPC. The best scenarios are summarized in Table 3.
In the remaining experiments, the best parameters that are given in Table 3 were used.

CRPM versus State-of-the-Art DIR: Accuracy
This experiment used CACM, TREC, Webdocs, and Wikilinks. Table 4 gives the results of a comparison with Pattern Term Mining (PTM) [44], Clustering Greedy Local Search (C-GLS) [17], and Probability Latent Semantic Analysis (PLSA) [54]. The results reveal that for medium collections, such as CACM and TREC50K, the three approaches (PTM, C-GLS, and PLSA) outperformed CRPM. However, for big collections, such as TREC with more than 100,000 documents, Webdocs, Wikilinks, and CRPM outperformed the three other approaches. These results confirm the benefits of using data mining techniques to explore collections of documents. A statistical test, Z-test, was carried out for the results of the CRPM and the state-of-the-art algorithms (reported in Table 4) using the documents corpus. This can be modelled as follows: 1. F-measure and the mean average precision were viewed as normal variables. 2. Each document's corpus was divided into 10 partitions such that each partition contained 10% of the whole corpus. Every partition represented an observation, and 80 different observations were generated. 3. The result of each partition was considered as a sample.
Six estimators (E 1 throughout E 6 ) were used in the analysis. The first three estimators were designated for the Fmeasure performance, and the second three estimators were designated for the MAP performance. A detailed description of these estimators is given below.   The significance level (α) was set to 1%. The results of the Shapiro-Wilk test indicated that H 0 cannot be rejected. This confirms the insignificance of non-normality, that is, the algorithms follow a normal distribution. A Z-test was also used with α = 5% to compare the algorithms. XLSTAT showed that E 1 and E 4 gave higher values than the other estimators, which means that CRPM was statistically better than the other algorithms in terms of F-measure and MAP measures. Figure 2(a) compares the runtime of CRPM with the stateof-the-art DIR algorithms using different corpus and using one user query as input. The results reveal that CRPM had a slightly higher time overhead compared to the other approaches. This is explained by the fact CRPM needs more time in the pre-processing step in which both clustering and pattern mining processes are performed. Moreover, the ranking step in CRPM is more complex compared to the existing cluster-based DIR algorithms, for example, C-GLS uses only information about centroids. To confirm that the time overhead is due mainly to the pre-processing step (which matters only for initialization and to handle the first query), another experiment was carried out (the results are reported in Fig. 2(b)).

CRPM versus State-of-the-Art DIR: Runtime
The largest corpus in the previous collection (Wikilinks) was considered, which contained 40,000,000 documents and 3,000,000 terms. By varying the number of user queries from 1 to 1,000, the runtime of CRPM stabilized at 20, 000sec after about 100 queries, whereas the runtime of the other approaches exceeded 45, 000sec. This confirms that CRPM was more than twice faster than the baseline approaches to sever a set of successive queries.  Table 5 shows a comparison of the quality of tweets retrieved by CRPM to the baseline approaches (Hashtag-ger+ [55], ATR-Vis [56], and SAX* [57]). Results reveal that for medium tweet collections, such as Sewol ferry, Wikipedia2, and Wikipedia3, the baseline approaches outperformed CRPM. However, for large tweet collections, such as football, TREC2011, and Nelson Mandela, CRPM outperformed all the approaches. This is explained by the fact that in the case of medium and large tweets, each group of user's tweets shared relevant hashtags, which helps the search process. In the case of the Sewol ferry corpus, the number of hashtags was too small compared to the number of users' tweets. By performing clustering on this corpus, a low number of shared hashtags was determined within each cluster of tweets. As a result, a low number of relevant patterns was discovered from the clusters. This reduced the accuracy of the proposed approach. In conclusion, CRPM performed better for rich tweets collection when a high number of hashtags was observed. A Z-test statistical test was conducted for the results of the CRPM and the state-of-the-art algorithms reported in Table 5 using the tweets corpus. This can be modeled as follows:

CRPM versus State-of-the-Art HR: Accuracy
1. F-measure and the mean average precision of each algorithm were viewed as normal variables. 2. Each tweets corpus was divided into 10 partitions while a partition contained 10% of the whole corpus. Every partition represented an observation, which generated 80 different observations. 3. The result of each partition was considered as a sample.
Six estimators (from E 1 to E 6 ) were used in the analysis. The first three estimators were designated for Fmeasure performance, and the second three estimators were designated for MAP performance. A detailed description of these estimators are given as follows: First, the normality of the four algorithms was checked using the Shapiro-Wilk test, which is available on the XLSTAT tool. Therefore, the first hypothesis H 0 and the alternative hypothesis H α were defined as follows: H 0 : The algorithms follow a normal distribution. H α : The algorithms do not follow a normal distribution.
The significance level (α) was set to 1%. The results of the Shapiro-Wilk test indicated that H 0 could not be rejected. Hence, the algorithms follow a normal distribution. Afterward, a Z-test was used with α = 5% to compare the algorithms. XLSTAT showed that E 1 and E 4 gave higher values than the other estimators, which means that CRPM was statistically better than the other algorithms in terms of F-measure and MAP measures.

CRPM vs. State-of-the-Art HR: Runtime
Similarly to the first case study, these results confirm that CRPM required more time for serving one query ( Fig. 3(a)), but it considerably outperformed all the baselines when serving a series of queries ( Fig. 3(b)). This Bold entries signify the best F-measure and Mean Average Precision of the compared solutions for the same reasons that were described in detail earlier (overhead due to pre-processing). This confirms that CRPM is independent of the number of user queries. It requires more time in the pre-processing step but explores fewer relevant patterns in the searching process.

Comparisons on big data
In this experiment, CRPM with PDRM [18] and JPD-LDR [28] were compared using the two big collections: Wikilinks for information retrieval and Football for hashtag retrieval. PDRM integrates swarm intelligence power and clustering techniques to solve the information retrieval problem, whereas JPD-LDR integrates the deep learning and decomposition techniques to satisfy user queries. Figure 4 shows the runtime, mAP, and the F-measure of CRPM, PDRM, and JPD-LDR on Wikilinks and the Football corpus. As shown, the runtime of all the approaches increased as the number of clusters increased along with the mAP and the F-measure, which converged after about 10 clusters. Moreover, CRPM outperformed PDRM and JPD-LDR both in terms of computational time and the quality of returned objects (mAP and F-measure). However, this result reveals that the proposed solution is still very sensitive to the number of clusters. Automatically fixing the number of clusters is a challenging issue from the perspective of this study. Using several runs to find the best value of the number of clusters is not effective. To address this, one possible direction is to learn the best number of clusters from some useful properties of the training corpus, such as the number of objects, the number of terms, and the terms distribution. This can help to automatically estimate the best value of the number of clusters of the new corpus.

Discussion
The lessons obtained from the application of pattern mining to cluster-based IR are summarized in this section.
In this work, different clustering methods were used to group objects into similar clusters. The choice of the best clustering algorithm to a real scenario depends on the shape of the data. If the data contains dense regions as illustrated in Webdocs and Wikilinks, then the DBSCAN algorithm is more suitable for finding the optimal clusters while minimizing both the number of shared terms among clusters and maximizing the number of terms within each cluster. If the data are heterogeneous and cover a large space, then the k-means algorithm is suitable, as was the case with the CACM, TREC, and sewol ferry datasets. This means that spectral clustering is suitable for heterogeneous data with high-density regions, as was the case with Nelson Mandela and Wikipedia. Furthermore, our experimental evaluation indicates that frequent and high-utility pattern mining can find interesting patterns on each cluster of objects. The pattern mining process helps discover relevant patterns that can be used to rank and select the most relevant cluster(s) for a user query. The frequent patterns provide information about the number of occurrences of terms in objects of every cluster whereas high-utility patterns represent the frequency of terms in every object for every cluster. The experimental results demonstrate that this approach outperformed the state-the-art information retrieval approaches in terms of the quality of the retrieved objects. They also show that the approach required a relatively high time in the preprocessing step. Nevertheless, this not a shortcoming given that thorough pre-processing enables flexibility regarding the number of user queries and thus reduces the processing time of future queries. In real scenarios, pre-processing is performed only for the first query, and then the results are used to serve a set of queries as long as there is no significant change in the databases.
This work is a typical example of the application of pattern mining techniques to IR. The literature calls for this type of research, particularly in the era of big data because increasingly large amounts of data become available in different domains, such as constraint programming [58], business intelligence [59], computational intelligence [60], the Internet of Things (IoT), and smart environments [61]. To the best of our knowledge, this is the first work to consider the use of frequent and high-utility pattern mining in cluster-based IR when dealing with large and big collections of objects. In general, porting a pure data-mining technique into a specific application domain requires methodological refinement and adaptation. In our context, this adaptation was implemented using frequent and high-utility pattern mining. This approach aligns with an emerging trend in search engine design that shifts the intelligence required for identifying useful patterns from a large, massive, and heterogeneous collection of objects to pro-actively suggesting areas of interest for further investigation. For instance, this approach could be adopted to retrieve several types of information, such as documents, hashtags, images and/or videos. Besides, it is interesting to investigate how high-performance computing can speed up the runtime performance of such an approach.

Conclusion
A novel cluster-based information retrieval approach for information retrieval was proposed in this paper, which benefits from frequent and high-utility pattern mining to extract useful patterns from the object collection. In this approach, an pre-processing step is first performed to find frequent and high-utility patterns in each cluster of objects. To rank clusters according the user's request, two strategies were proposed: i) WTC and ii) SPC. Extensive experiments were carried out on benchmark document and tweet collections to assess the performance of the designed approach. Results showed that the proposed approach benefited from the extracted patterns, which considerably improved the quality of the returned objects. The proposed approach was compared with several stateof-the-art information retrieval approaches on benchmarks datasets. Results indicated that the proposed approach outperformed the other approaches in terms of objects quality and that it was competitive in terms of runtime, particularly when dealing with many user queries.
In future work, it would be interesting to generalize the proposed approach to other types of objects, such as images and videos. Moreover, how to integrate other frequent and high-utility pattern-based approaches to the proposed framework can be further explored. Another good direction would be to discover other types of knowledge, such as maximal patterns, rare patterns, and closed patterns, that could be used to improve accuracy. Other data mining and machine learning techniques, such as deep learning, could also be used to group and find relevant patterns from a collection of objects. Last but not least, it is possible to design a parallel version of the proposed approach that relies on high-performance computing tools, such as MapReduce and Spark, to improve the mining performance.
Funding Open Access funding provided by SINTEF AS.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommonshorg/licenses/by/4.0/.