Abstract
Partially ordered feature sets appear naturally in many classification settings with structured input instances, for example, when the data instances are graphs and a feature tests whether a specific substructure occurs in the instance. Since such features are partially ordered according to an “is substructure of” relation, the information in those datasets is stored in an intrinsically redundant form. We investigate how this redundancy affects the capacity control behavior of linear classification methods. From a theoretical perspective, it can be shown that the capacity of this hypothesis class does not decrease for worst case distributions. However, if the data generating distribution assigns lower probabilities to instances in the lower levels of the hierarchy induced by the partial order, the capacity of the hypothesis class can be bounded by a smaller term. For itemset, subsequence and subtree features in particular, the capacity is finite even when an infinite number of features is present. We validate these results empirically on three graph datasets and show that the limited capacity of linear classifiers on such data makes underfitting rather than overfitting the more prominent capacity control problem. To avoid underfitting, we propose using more general substructure classes with “elastic edges” and we demonstrate how such broad feature classes can be used with large datasets.
Chapter PDF
Similar content being viewed by others
References
Bartlett, P.L., Mendelson, S.: Rademacher and gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2003)
Bringmann, B., Zimmermann, A., De Raedt, L., Nijssen, S.: Don’t be afraid of simpler patterns. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 55–66. Springer, Heidelberg (2006)
Deshpande, M., Kuramochi, M., Karypis, G.: Frequent sub-structure-based approaches for classifying chemical compounds. In: IEEE International Conference on Data Mining, p. 35 (2003)
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition (Stochastic Modelling and Applied Probability). Springer, New York (1996)
Fang, H., Tong, W., Shi, L.M., Blair, R., Perkins, R., Branham, W., Hass, B.S., Xie, Q., Dial, S.L., Moland, C.L., Sheehan, D.M.: Structure-activity relationships for a large diverse set of natural, synthetic, and environmental estrogens. Chemical Research in Toxicology 14(3), 280–294 (2001)
Li, H., Yap, C.W., Ung, C.Y., Xue, Y., Cao, Z.W., Chen, Y.Z.: Effect of selection of molecular descriptors on the prediction of blood-brain barrier penetrating and nonpenetrating agents by statistical learning methods. Journal of Chemical Information and Modeling 45(5), 1376–1384 (2005)
Otter, R.: The number of trees. The Annals of Mathematics 49(3), 583–599 (1948)
Rückert, U., Kramer, S.: Optimizing feature sets for structured data. In: Kok, J.N., Koronacki, J., Lopez de Mántaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 716–723. Springer, Heidelberg (2007)
Teicher, A.B. (ed.): The NCI Human Tumor Cell Line (60-Cell) Screen, 2nd edn., pp. 41–62. Humana Press, Totowa (1997)
Wale, N., Watson, I.A., Karypis, G.: Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl. Inf. Syst. 14(3), 347–375 (2008)
Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: ICDM 2002: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), Washington, DC, USA, p. 721. IEEE Computer Society, Los Alamitos (2002)
Yoshida, F., Topliss, J.: QSAR model for drug human oral bioavailability. J. Med. Chem. 43, 2575–2585 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rückert, U. (2009). Capacity Control for Partially Ordered Feature Sets. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2009. Lecture Notes in Computer Science(), vol 5782. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04174-7_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-04174-7_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04173-0
Online ISBN: 978-3-642-04174-7
eBook Packages: Computer ScienceComputer Science (R0)