Skip to main content
Log in

Soft constraints for pattern mining

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Constraint-based pattern discovery is at the core of numerous data mining tasks. Patterns are extracted with respect to a given set of constraints (frequency, closedness, size, etc). In practice, many constraints require threshold values whose choice is often arbitrary. This difficulty is even harder when several thresholds are required and have to be combined. Moreover, patterns barely missing a threshold will not be extracted even if they may be relevant. The paper advocates the introduction of softness into the pattern discovery process. By using Constraint Programming, we propose efficient methods to relax threshold constraints as well as constraints involved in patterns such as the top-k patterns and the skypatterns. We show the relevance and the efficiency of our approach through a case study in chemoinformatics for discovering toxicophores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. For the frequency measure, \(max_m = \mid\cal{T}\mid\); for the size measure, \(max_m = \mid\cal{I}\mid\).

  2. More information on the implementation of the above constraint-based pattern mining task using Constraint Programming techniques are in Guns et al. (2011), Khiari et al. (2010).

  3. http://www.gecode.org/

  4. The closed constraint is used to reduce pattern redundancy. Indeed, closed skypatterns make up an exact condensed representation of the whole set of skypatterns (Soulet et al. 2011).

  5. Lethal concentration of a substance required to kill half the members of a tested population after a specified test duration.

  6. A fragment denominates a connected part of a chemical structure containing at least one chemical bond.

  7. European Chemicals Bureau http://ecb.jrc.ec.europa.eu/documentation/ now http://echa.europa.eu/.

  8. A chemical Ch contains an item A if Ch supports A, and A is a frequent subgraph of \(\mathcal{T}\).

  9. The rigidity of a subgraph is equal to 2e/v(v − 1), where e (resp. v) is the number of its edges (resp. vertices).

  10. Ratio of the number of solutions containing a toxicophore by the total number of solutions.

  11. Smiles code is a line notation for describing the structure of chemical molecules: http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html.

References

  • Bajorath, J., & Auer, J. (2006). Emerging chemical patterns: a new methodology for molecular classification and compound selection. Journal of Chemical Information and Modeling, 46, 2502–2514.

    Article  Google Scholar 

  • Bistarelli, S., & Bonchi, F. (2007). Soft constraint based pattern mining. Data and Knowledge Engineering, 62(1), 118–137.

    Article  Google Scholar 

  • Börzönyi, S., Kossmann, D., Stocker, K. (2001). The skyline operator. In Proceedings of the 17th International Conference on Data Engineering (ICDE’01) (pp. 421–430). Springer: IEEE Computer Science.

    Chapter  Google Scholar 

  • De Raedt, L., Guns, T., Nijssen, S. (2008). Constraint programming for itemset mining. In KDD’08 (pp. 204–212). ACM.

  • De Raedt, L., & Zimmermann, A. (2007). Constraint-based pattern set mining. In Proceedings of the 7th SIAM international conference on data mining. Minneapolis, MN: SIAM.

  • Garofalakis, M.N., Rastogi, R., Shim, K. (1999). SPIRIT: Sequential pattern mining with regular expression constraints. Proceedings of 25th international conference on very large data bases, (pp. 223–234).

  • Gavanelli, M. (2002). An algorithm for multi-criteria optimization in csps. In F. van Harmelen (Ed.), ECAI (pp. 136–140). IOS Press.

  • Guns, T., Nijssen, S., De Raedt, L. (2011). Itemset mining: a constraint programming perspective. Artificial Intelligence, 175(12–13), 1951–1983.

    Article  MATH  MathSciNet  Google Scholar 

  • Hüllermeier, E. (2005). Fuzzy methods in machine learning and data mining: status and prospects. Fuzzy Sets and Systems, 156(3), 387–406.

    Article  MathSciNet  Google Scholar 

  • Jin, W., Han, J., Ester, M. (2004). Mining thick skylines over large databases. In PKDD’04 (pp. 255–266).

  • Ke, Y., Cheng, J., Yu, J.X. (2009). Top-k correlative graph mining. In SDM (pp. 1038–1049).

  • Khiari, M., Boizumault, P., Crémilleux, B. (2010). Constraint programming for mining n-ary patterns. In CP’10. LNCS (Vol. 6308, pp. 552–567). Springer.

  • Kung, H.T., Luccio, F., Preparata, F.P. (1975). On finding the maxima of a set of vectors. Journal of the ACM, 22(4), 469–476. doi:10.1145/321906.321910.

    Article  MATH  MathSciNet  Google Scholar 

  • Lin, X., Yuan, Y., Zhang, Q., Zhang, Y. (2007). Selecting stars: The k most representative skyline operator. In ICDE 2007 (pp. 86–95). IEEE Computer Society Press.

  • Mannila, H., & Toivonen, H. (1997). Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3), 241–258.

    Article  Google Scholar 

  • Matousek, J. (1991). Computing dominances in e. Information Processing Letter, 38(5), 277–278.

    Article  MATH  MathSciNet  Google Scholar 

  • Ng, R.T., Lakshmanan, V.S., Han, J., Pang, A. (1998). Exploratory mining and pruning optimizations of constrained associations rules. In Proceedings of ACM SIGMOD’98 (pp. 13–24). ACM Press.

  • Novak, P.K., Lavrac, N., Webb, G.I. (2009). Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research, 10, 377–403.

    MATH  Google Scholar 

  • Papadias, D., Tao, Y., Fu, G., Seeger, B. (2005). Progressive skyline computation in database systems. ACM Transactions on Database Systems, 30(1), 41–82.

    Article  Google Scholar 

  • Papadias, D., Yiu, M.L., Mamoulis, N., Tao, Y. (2008). Nearest neighbor queries in network databases. In Encyclopedia of GIS (pp. 772–776).

  • Petit, T., Régin, J., Bessière, C., Puget, J. (2000). An original constraint based approach for solving over constrained problems. In CP’2000. LNCS (Vol. 1894, pp. 543–548). Springer.

  • Poezevara, G., Cuissart, B., Crémilleux, B. (2011). Extracting and summarizing the frequent emerging graph patterns from a dataset of graphs. Journal of Intelligent Information System, 37(3), 333–353.

    Article  Google Scholar 

  • Soulet, A., Raïssi, C., Plantevit, M., Crémilleux, B. (2011). Mining dominant patterns in the sky. In 11th IEEE Int. Conf. on Data Mining series (ICDM 2011) (pp. 655–664).

  • Steuer, R.E. (1992). Multiple criteria optimization: Theory, computation and application. Radio e Svyaz, Moscow (504 pp) (in Russian)

  • Tan, K.L., Eng, P.K., Ooi, B.C. (2001). Efficient progressive skyline computation. In VLDB (pp. 301–310).

  • Ugarte, W., Boizumault, P., Loudni, S., Crémilleux, B. (2012). Soft threshold constraints for pattern mining. In J.G. Ganascia, P. Lenca, J.M. Petit (Eds.), Discovery science. Lecture notes in computer science (Vol. 7569, pp. 313–327). Springer.

  • Wang, J., Han, J., Lu, Y., Tzvetkov, P. (2005). Tfp: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Transactions on Knowledge and Data Engineering, 17(5), 652–664.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Willy Ugarte.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ugarte, W., Boizumault, P., Loudni, S. et al. Soft constraints for pattern mining. J Intell Inf Syst 44, 193–221 (2015). https://doi.org/10.1007/s10844-013-0281-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-013-0281-4

Keywords

Navigation