Abstract
Issues related to association mining have received attention, especially the ones aiming to discover and facilitate the search for interesting patterns. A promising approach, in this context, is the application of clustering in the pre-processing step. In this paper, eleven metrics are proposed to provide an assessment procedure in order to support the evaluation of this kind of approach. To propose the metrics, a subjective evaluation was done. The metrics are important since they provide criteria to: (a) analyze the methodologies, (b) identify their positive and negative aspects, (c) carry out comparisons among them and, therefore, (d) help the users to select the most suitable solution for their problems. Besides, the metrics do the users think about aspects related to the problems and provide a flexible way to solve them. Some experiments were done in order to present how the metrics can be used and their usefulness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this work, it is assumed that a pattern is interesting if it is relevant and/or useful to the user – rules having high support and/or high confidence are not necessarily interesting to the user.
- 2.
Any other criteria could be adopted to select the \(h\)-top interesting rules.
- 3.
In this work, it is considered that this labeling method is the one presented by [22].
- 4.
- 5.
In this work, each dendrogram obtained by Ward were cut considering each one of the values of \(k\).
- 6.
- 7.
Rule set obtained through a traditional process.
- 8.
Rule set obtained through a partitioned data.
References
Wu, X., Kumar, V.: The Top Ten Algorithms in Data Mining. Chapman & Hall/CRC, Boca Raton (2009)
Dadaser-Celik, F., Celik, M., Dokuz, A.S.: Associations between stream flow and climatic variables at Kizilirmak river basin in Turkey. Glob. NEST J. 14(3), 354–361 (2012)
Xiao, G.: Association rules algorithm in bank risk assessment. In: Lee, J. (ed.) Advanced Electrical and Electronics Engineering. LNEE, vol. 87, pp. 675–681. Springer, Heidelberg (2011)
Nuwangi, S.M., Oruthotaarachchi, C.R., Tilakaratna, J.M.P.P., Caldera, H.A.: Usage of association rules and classification techniques in knowledge extraction of diabetes. In: Proceedings of the 6th International Conference on Advanced Information Management and Service, pp. 372–377 (2010)
Rajasekar, U., Weng, Q.: Application of association rule mining for exploring the relationship between urban land surface temperature and biophysical/social parameters. Photogram. Eng. Remote Sens. 75(3), 385–396 (2009)
Changguo, Y., Nianzhong, W., Tailei, W., Qin, Z., Xiaorong, Z.: The research on the application of association rules mining algorithm in network intrusion detection. In: Proceedings of the 1st International Workshop on Education Technology and Computer Science, vol. 2, pp. 849–852 (2009)
Koh, Y.S., Pears, R.: Rare association rule mining via transaction clustering. In: 7th Australasian Data Mining Conference. CRPIT, vol. 87, pp. 87–94 (2008)
Maquee, A., Shojaie, A.A., Mosaddar, D.: Clustering and association rules in analyzing the efficiency of maintenance system of an urban bus network. Int. J. Syst. Assur. Eng. Manage. 3(3), 175–183 (2012)
Farajian, M.A., Mohammadi, S.: Mining the banking customer behavior using clustering and association rules methods. Int. J. Ind. Eng. Prod. Res. 21(4), 239–245 (2010)
Fan, L.: Research on classification mining method of frequent itemset. J. Convergence Inf. Technol. 5(8), 71–77 (2010)
Plasse, M., Niang, N., Saporta, G., Villeminot, A., Leblond, L.: Combined use of association rules mining and clustering methods to find relevant links between binary rare attributes in a large data set. Comput. Stat. Data Anal. 52(1), 596–613 (2007)
de Carvalho, V.O., dos Santos, F.F., Rezende, S.O.: Metrics to support the evaluation of association rule clustering. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 248–259. Springer, Heidelberg (2013)
Aggarwal, C.C., Procopiuc, C., Yu, P.S.: Finding localized associations in market basket data. IEEE Trans. Knowl. Data Eng. 14(1), 51–62 (2002)
Wang, K., Xu, C., Liu, B.: Clustering transactions using large items. In: 8th International Conference on Information and Knowledge Management, pp. 483–490 (1999)
Yun, C.-H., Chuang, K.-T., Chen, M.-S.: An efficient clustering algorithm for market basket data based on small large ratios. In: 25th International Computer Software and Applications Conference on Invigorating Software Development, pp. 505–510 (2001)
Wang, J., Karypis, G.: Summary: efficiently summarizing transactions for clustering. In: 4th IEEE International Conference on Data Mining, pp. 241–248 (2004)
Yang, L.: Pruning and visualizing generalized association rules in parallel coordinates. IEEE Trans. Knowl. Data Eng. 17(1), 60–70 (2005)
D’Enza, A.I., Palumbo, F., Greenacre, M.: Exploratory data analysis leading towards the most interesting binary association rules. In: 11th Symposium on Applied Stochastic Models and Data Analysis, pp. 256–265 (2005)
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recogn. 46(1), 243–256 (2013)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2/3), 107–145 (2001)
Carvalho, V.O., Biondi, D.S., Santos, F.F., Rezende, S.O.: Labeling methods for association rule clustering. In: Proceedings of the 14th International Conference on Enterprise Information Systems, pp. 105–109 (2012)
Padua, R., Carvalho, V.O., Serapião, A.B.S.: Labeling association rule clustering through a genetic algorithm approach. In: Proceedings of the 17th East European Conference on Advances in Databases and Information Systems, pp. 45–52 (2013)
Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right objective measure for association analysis. Inf. Syst. 29(4), 293–313 (2004)
Xu, R., Wunsch, D.: Clustering. Computational Intelligence. IEEE Press/Wiley, New York (2008)
Carvalho, V.O., Santos, F.F., Rezende, S.O., Padua, R.: PAR-COM: a new methodology for post-processing association rules. Lect. Notes Bus. Inf. Process. 102, 66–80 (2012)
Carvalho, V.O., Santos, F.F., Rezende, S.O.: Post-processing association rules with clustering and objective measures. In: Proceedings of 13th International Conference on Enterprise Information Systems, vol. 1, pp. 54–63 (2011)
Acknowledgments
We wish to thank Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) (processes numbers: 2010/07879-0 and 2011/19850-9) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) (process number DS-6345378/D) for the financial support. Besides, we also want to thank the reviewers for the great contributions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Questionnaire
Appendix: Questionnaire
Introduction. Many issues related to association rule mining have received attention in the last years, especially the ones aiming to discover and facilitate the search for the interesting patterns of the domain. One approach related to this issue is the application of clustering in the pre-process step. In this case, as noticed in the figure below, data are initially grouped in \(n\) groups (\(GD_1\),\(GD_2\),...,\(GD_n\)). From this initial clustering, the rules are then extracted within each group (cluster), obtaining \(n\) groups of rules (\(GR_1\),\(GR_2\),...,\(GR_n\)). The aim is to obtain potentially interesting rules that would not be extracted from unpartitioned data sets, for not having enough support, without overloading the user with a great amount of patterns. The user must set the minimum support to a low value to discover these same patterns from unpartitioned data sets, causing a rapidly increase in the number of rules. Thereby, data are initially split and the rules are extracted within each group, in a manner that each group expresses its own associations without the interference of the other groups that contain different association patterns. Distinct methodologies have been proposed to enable this process. Each methodology uses a different combination of clustering algorithms and similarity measures in order to obtain the groups of rules.
It is in this context that this evaluation should be done. Some scenarios that can occur in this scope are shown below, waiting for your contribution for a better understanding of the problem. In all the cases, it is assumed that two rule sets are available, in order to evaluate the presented scenarios: one extracted through traditional process, RsTFootnote 7, and one extracted through clustering (process above described), RsPFootnote 8 – the examples presented below are merely illustrations of the scenarios and, therefore, should not be evaluated considering the knowledge they express. Based on this evaluation, the aim is to propose an assessment procedure to support the analysis of the existing methodologies.
Scenarios
-
1.
In your opinion, observing “Scenario-A” (Table 6), how do you consider the occurrence of rules obtained in RsT in RsP (cases in green and orange)? Both the cases, green and orange, represent rules obtained in both of the sets, but the rules in orange are extracted more than once in RsP over the groups. If needed to distinguish the green cases of the orange cases, please let it indicated.
( ) desirable ( ) indifferent ( ) no desirable
a. Do you think important to consider this scenario in an assessment procedure to be used in the presented context?
( ) yes ( ) no
b. Would you like to make any comment about the scenario (advantage, disadvantage, etc.)?
-
2.
In your opinion, observing “Scenario-A” (Table 6), how do you consider the non-occurrence of rules obtained in RsP in RsT (cases in purple and red)? Both the cases, purple and red, represent rules obtained only in RsP, but the rules in red are extracted more than once in RsP over the groups. If needed to distinguish the purple cases of the red cases, please let it indicated.
( ) desirable ( ) indifferent ( ) no desirable
a. Do you think important to consider this scenario in an assessment procedure to be used in the presented context?
( ) yes ( ) no
b. Would you like to make any comment about the scenario (advantage, disadvantage, etc.)?
For questions “3” to “6”, consider that for each rule set, RsP and RsT, it is shown only the subset related to the \(n\) most interesting rules of the domain. These subsets can be identified, for example, automatically, based on a set of objective measures – assuming that objective measures are suitable to find the most interesting knowledge of a given domain.
-
3.
In your opinion, observing “Scenario-B” (Table 7), how do you consider the non-occurrence of some (or none) of the \(n\) most interesting rules in RsP in RsT (cases in blue)? Notice that the blue rules belong only to the RsP set.
( ) desirable ( ) indifferent ( ) no desirable
a. Do you think important to consider this scenario in an assessment procedure to be used in the presented context?
( ) yes ( ) no
b. Would you like to make any comment about the scenario (advantage, disadvantage, etc.)?
-
4.
In your opinion, observing “Scenario-B” (Table 7), how do you consider the reverse scenario? This is, the non-occurrence of some (or none) of the \(n\) most interesting rules in RsT in RsP (cases in orange)? Notice that the orange rules belong only to the RsT set.
( ) desirable ( ) indifferent ( ) no desirable
a. Do you think important to consider this scenario in an assessment procedure to be used in the presented context?
( ) yes ( ) no
b. Would you like to make any comment about the scenario (advantage, disadvantage, etc.)?
-
5.
In your opinion, observing “Scenario-B” (Table 7), how do you consider the existing intersection between the \(n\) most interesting rules in RsP and the \(n\) most interesting rules in RsT (cases in red)?
( ) desirable ( ) indifferent ( ) no desirable
a. Do you think important to consider this scenario in an assessment procedure to be used in the presented context?
( ) yes ( ) no
b. Would you like to make any comment about the scenario (advantage, disadvantage, etc.)?
-
6.
In your opinion, how do you would consider the spread of the \(n\) most interesting rules in RsP in a small number of clusters?
( ) desirable ( ) indifferent ( ) no desirable
a. Do you think important to consider this scenario in an assessment procedure to be used in the presented context?
( ) yes ( ) no
b. Would you like to make any comment about the scenario (advantage, disadvantage, etc.)?
-
7.
In your opinion, do you consider that the amount of rules to be extracted through clustering, compared to the traditional process, should be:
( ) low ( ) average ( ) high
a. Do you think important to consider this scenario in an assessment procedure to be used in the presented context?
( ) yes ( ) no
b. Would you like to make any comment about the scenario (advantage, disadvantage, etc.)?
-
8.
In your opinion, only in relation to RsP, do you consider that the clustering process should, as a consequence, enable each cluster to express a distinct topic of the domain?
( ) yes ( ) indifferent ( ) no
a. Do you think important to consider this scenario in an assessment procedure to be used in the presented context?
( ) yes ( ) no
b. Would you like to make any comment about the scenario (advantage, disadvantage, etc.)?
-
9.
Can you identify other scenario(s), not previously explored, that can be relevant to the presented context? Give an example of the scenario(s) that you identified.
a. Do you think important to consider this(these) scenario(s) in an assessment procedure to be used in the presented context?
( ) yes ( ) no
-
10.
If you want to leave any comment/observation, please do it below.
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
de Carvalho, V.O., dos Santos, F.F., Rezende, S.O. (2015). Metrics for Association Rule Clustering Assessment. In: Hameurlain, A., Küng, J., Wagner, R., Bellatreche, L., Mohania, M. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XVII. Lecture Notes in Computer Science(), vol 8970. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46335-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-662-46335-2_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46334-5
Online ISBN: 978-3-662-46335-2
eBook Packages: Computer ScienceComputer Science (R0)