Capacity Control for Partially Ordered Feature Sets

Rückert, Ulrich

doi:10.1007/978-3-642-04174-7_21

Ulrich Rückert²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5782))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3598 Accesses
1 Citations

Abstract

Partially ordered feature sets appear naturally in many classification settings with structured input instances, for example, when the data instances are graphs and a feature tests whether a specific substructure occurs in the instance. Since such features are partially ordered according to an “is substructure of” relation, the information in those datasets is stored in an intrinsically redundant form. We investigate how this redundancy affects the capacity control behavior of linear classification methods. From a theoretical perspective, it can be shown that the capacity of this hypothesis class does not decrease for worst case distributions. However, if the data generating distribution assigns lower probabilities to instances in the lower levels of the hierarchy induced by the partial order, the capacity of the hypothesis class can be bounded by a smaller term. For itemset, subsequence and subtree features in particular, the capacity is finite even when an infinite number of features is present. We validate these results empirically on three graph datasets and show that the limited capacity of linear classifiers on such data makes underfitting rather than overfitting the more prominent capacity control problem. To avoid underfitting, we propose using more general substructure classes with “elastic edges” and we demonstrate how such broad feature classes can be used with large datasets.

Download to read the full chapter text

Chapter PDF

Semi-greedy heuristics for feature selection with test cost constraints

Article 29 February 2016

A Feature Selection Algorithm Based on Heuristic Decomposition

Depth-First Traversal over a Mirrored Space for Non-redundant Discriminative Itemsets

Keywords

References

Bartlett, P.L., Mendelson, S.: Rademacher and gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3, 463–482 (2003)
MathSciNet MATH Google Scholar
Bringmann, B., Zimmermann, A., De Raedt, L., Nijssen, S.: Don’t be afraid of simpler patterns. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 55–66. Springer, Heidelberg (2006)
Chapter Google Scholar
Deshpande, M., Kuramochi, M., Karypis, G.: Frequent sub-structure-based approaches for classifying chemical compounds. In: IEEE International Conference on Data Mining, p. 35 (2003)
Google Scholar
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition (Stochastic Modelling and Applied Probability). Springer, New York (1996)
Book MATH Google Scholar
Fang, H., Tong, W., Shi, L.M., Blair, R., Perkins, R., Branham, W., Hass, B.S., Xie, Q., Dial, S.L., Moland, C.L., Sheehan, D.M.: Structure-activity relationships for a large diverse set of natural, synthetic, and environmental estrogens. Chemical Research in Toxicology 14(3), 280–294 (2001)
Article Google Scholar
Li, H., Yap, C.W., Ung, C.Y., Xue, Y., Cao, Z.W., Chen, Y.Z.: Effect of selection of molecular descriptors on the prediction of blood-brain barrier penetrating and nonpenetrating agents by statistical learning methods. Journal of Chemical Information and Modeling 45(5), 1376–1384 (2005)
Article Google Scholar
Otter, R.: The number of trees. The Annals of Mathematics 49(3), 583–599 (1948)
Article MathSciNet MATH Google Scholar
Rückert, U., Kramer, S.: Optimizing feature sets for structured data. In: Kok, J.N., Koronacki, J., Lopez de Mántaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 716–723. Springer, Heidelberg (2007)
Chapter Google Scholar
Teicher, A.B. (ed.): The NCI Human Tumor Cell Line (60-Cell) Screen, 2nd edn., pp. 41–62. Humana Press, Totowa (1997)
Google Scholar
Wale, N., Watson, I.A., Karypis, G.: Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl. Inf. Syst. 14(3), 347–375 (2008)
Article Google Scholar
Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: ICDM 2002: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), Washington, DC, USA, p. 721. IEEE Computer Society, Los Alamitos (2002)
Google Scholar
Yoshida, F., Topliss, J.: QSAR model for drug human oral bioavailability. J. Med. Chem. 43, 2575–2585 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

International Computer Science Institute, 1947 Center Street, Suite 600, Berkeley, CA, 94704
Ulrich Rückert

Authors

Ulrich Rückert
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

NICTA, Locked Bag 8001, Canberra, 2601, Australia and Helsinki Institute of IT, Finland
Wray Buntine
Dept. of Knowledge Technologies, Jožef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Marko Grobelnik & Dunja Mladenić &
The Centre for Computational Statistics and Machine Learning Department of Computer Science, University College London, Gower St.,, WC1E 6BT, London, UK
John Shawe-Taylor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rückert, U. (2009). Capacity Control for Partially Ordered Feature Sets. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2009. Lecture Notes in Computer Science(), vol 5782. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04174-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-04174-7_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04173-0
Online ISBN: 978-3-642-04174-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Capacity Control for Partially Ordered Feature Sets

Abstract

Chapter PDF

Similar content being viewed by others

Semi-greedy heuristics for feature selection with test cost constraints

A Feature Selection Algorithm Based on Heuristic Decomposition

Depth-First Traversal over a Mirrored Space for Non-redundant Discriminative Itemsets

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Capacity Control for Partially Ordered Feature Sets

Abstract

Chapter PDF

Similar content being viewed by others

Semi-greedy heuristics for feature selection with test cost constraints

A Feature Selection Algorithm Based on Heuristic Decomposition

Depth-First Traversal over a Mirrored Space for Non-redundant Discriminative Itemsets

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation