Abstract
Using string kernels, languages can be represented as hyperplanes in a high dimensional feature space. We discuss the language-theoretic properties of this formalism with particular reference to the implicit feature maps defined by string kernels, considering the expressive power of the formalism, its closure properties and its relationship to other formalisms. We present a new family of grammatical inference algorithms based on this idea. We demonstrate that some mildly context-sensitive languages can be represented in this way and that it is possible to efficiently learn these using kernel PCA. We experimentally demonstrate the effectiveness of this approach on some standard examples of context-sensitive languages using small synthetic data sets.
References
Asveld, P. R. J. (2006). Generating all permutations by context-free grammars in Chomsky normal form. Theoretical Computer Science, 354(1), 118–130.
Bach, E. (1981). Discontinuous constituents in generalized categorial grammars. In North east linguistics society (NELS 11) (pp. 1–12).
Becerra-Bonache, L. (2006). On the learnability of mildly context-sensitive languages using positive data and correction queries. Ph.D. thesis, Universitat Rovira i Virgili, Tarragona, Spain.
Becerra-Bonache, L., & Yokomori, T. (2004). Learning mild context-sensitiveness: Toward understanding children’s language learning. In G. Paliouras & Y. Sakakibara (Eds.), Lecture notes in computer science : Vol. 3264. ICGI (pp. 53–64). Berlin: Springer.
Becker, T., Rambow, O., & Niv, M. (1992). The derivational generative power of formal systems or scrambling is beyond LCFRS (Tech. Rep. 92–38). Institute For Research in Cognitive Science, University of Pennsylvania.
Chalup, S., & Blair, A. D. (1999). Hill climbing in recurrent neural networks for learning the a n b n c n language. In Proceedings of the sixth international conference on neural information processing (pp. 508–513).
Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information Theory, 2, 113–124.
Clark, A. (2006). PAC-learning unambiguous NTS languages. In Proceedings of the 8th international colloquium on grammatical inference (ICGI) (pp. 59–71).
Clark, A. (2007). Learning deterministic context free grammars: the Omphalos competition. Machine Learning, 66(1), 93–110.
Clark, A., & Eyraud, R. (2007). Polynomial identification in the limit of substitutable context-free languages. Journal Machine Learning Research, 8, 1725–1745.
Clark, A., & Thollard, F. (2004). PAC-learnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research, 5, 473–497.
Clark, A., & Watkins, C. (2008). Some alternatives to Parikh matrices using string kernels. Fundamenta Informaticae, 84(3–4), 291–303.
Clark, A., Costa Florêncio, C., & Watkins, C. (2006a). Languages as hyperplanes: grammatical inference with string kernels. In ECML, 17th European conference on machine learning (pp. 90–101). Berlin: Springer.
Clark, A., Costa Florêncio, C., Watkins, C., & Serayet, M. (2006b). Planar languages and learnability. In Proceedings of the international conference on grammatical inference (pp. 148–160). Tokyo: Springer.
Cortes, C., Kontorovich, L., & Mohri, M. (2007). Learning languages with rational kernels. In Lecture notes in computer science : Vol. 4539. Proceedings of the 20th annual conference on learning theory (COLT 2007) (pp. 349–364). Heidelberg: Springer.
Crammer, K., & Singer, Y. (2003). Learning algorithms for enclosing points in Bregmanian spheres. In 16th annual conference on learning theory (p. 388). Berlin: Springer.
de la Higuera, C. (1997). Characteristic sets for polynomial grammatical inference. Machine Learning, 27(2), 125–138.
Floyd, S., & Warmuth, M. (1995). Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine Learning 21(3), 269–304.
Gentner, T. Q., Fenn, K. M., Margoliash, D., & Nusbaum, H. C. (2006). Recursive syntactic pattern learning by songbirds. Nature, 440, 1204–1207. DOI 10.1038/nature04675. http://www.isrl.uiuc.edu/~amag/langev/paper/gentner06songbirds.html.
Gold, E. M. (1967). Language identification in the limit. Information and Control, 10, 447–474.
Heinz, J. (2010). String extension learning. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics. Uppsala, Sweden.
Huybregts, R. (1984). The weak inadequacy of context-free phrase structure grammars. In G. J. de Haan, M. Trommelen, & W. Zonneveld (Eds.), Van Periferie naar Kern. Dordrecht: Foris.
Kanazawa, M. (1994). A note on language classes with finite elasticity (Tech. Rep. CS-R9471). CWI, Amsterdam.
Kanazawa, M. (1998). Learnable classes of categorial grammars. CSLI publications, Stanford: Stanford University, distributed by Cambridge University Press.
Kearns, M., & Valiant, L. G. (1989). Cryptographic limitations on learning boolean formulae and finite automata. In 21st annual ACM symposium on theory of computation (pp. 433–444). New York: ACM.
Kearns, M., & Vazirani, U. (1994). An introduction to computational learning theory. Cambridge: MIT Press.
Kontorovich, L., Cortes, C., & Mohri, M. (2006). Learning linearly separable languages. In Algorithmic learning theory, 17th international conference (pp. 288–303).
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. Journal of Machine Learning Research, 2, 419–444.
Motoki, T., Shinohara, T., & Wright, K. (1991). The correct definition of finite elasticity: Corrigendum to identification of unions. In The fourth workshop on computational learning theory. San Mateo: Morgan Kaufmann.
Oates, T., Amstrong, T., Becerra-Bonache, L., & Atamas, M. (2005). A polynomial time algorithm for inferring grammars for mildly context sensitive languages. In Workshop on grammatical inference applications: successes and future challenges (pp. 61–65). Edinburgh, Scotland.
Parikh, R. J. (1966). On context-free languages. Journal of the ACM, 13(4), 570–581.
Radzinski, D. (1991). Chinese number-names, tree adjoining languages, and mild context-sensitivity. Computational Linguistics 17(3), 277–299.
Salomaa, A. (2005). On languages defined by numerical parameters (Tech. Rep. 663). Turku Centre for Computer Science.
Schölkopf, B., Smola, A., & Müller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5).
Sempere, J. M. (2008). Learning context-sensitive languages from linear structural information. In A. Clark, F. Coste, & L. Miclet (Eds.), LNAI : Vol. 5278. Proceedings of 9th international colloquium on grammatical inference ICGI’08 (pp. 175–186). Berlin: Springer.
Shawe-Taylor, J., & Christianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.
Shieber, S. M. (1985). Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8, 333–343.
Shinohara, T. (1990). Inductive inference of monotonic formal systems from positive data. In S. Arikawa, S. Goto, S. Ohsuga, & T. Yokomori (Eds.), Algorithmic learning theory (pp. 339–351). New York: Springer.
Starkie, B., Coste, F., & van Zaanen, M. (2004). The Omphalos context-free grammar learning competition. In LNAI : Vol. 3264. International colloquium on grammatical inference, Athens, Greece (pp. 16–27). Berlin: Springer.
Uemura, Y., Hasegawa, A., Kobayashi, S., & Yokomori, T. (1999). Tree adjoining grammars for RNA structure prediction. Theoretical Computer Science, 210(2), 277–303.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), 264–280.
Vijay-Shanker, K., Weir, D. J., & Joshi, A. K. (1987). Characterizing structural descriptions produced by various grammatical formalisms. In Proceedings of the 25th annual meeting on Association for Computational Linguistics (pp. 104–111). Morristown: Association for Computational Linguistics.
Watkins, C. (2000). Dynamic alignment kernels. In A. J. Smola, P. L. Bartlette, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 39–50). Cambridge: MIT Press.
Wright, K. (1989). Identification of unions of languages drawn from an identifiable class. In The 1989 workshop on computational learning theory (pp. 328–333). San Mateo: Morgan Kaufmann.
Yokomori, T. (1991). Polynomial-time learning of very simple grammars from positive data. In Proceedings of the fourth annual workshop on computational learning theory, University of California, Santa Cruz (pp. 213–227). New York: ACM Press.
Yokomori, T., & Kobayashi, S. (1998). Learning local languages and their application to DNA sequence analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(10), 1067–1079. DOI:http://dx.doi.org/10.1109/34.722617.
Yoshinaka, R. (2009). Learning mildly context-sensitive languages with multidimensional substitutability from positive data. In R. Gavaldà, G. Lugosi, T. Zeugmann, & S. Zilles (Eds.), Lecture notes in computer science : Vol. 5809. ALT (pp. 278–292). Berlin: Springer.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Nicolò Cesa-Bianchi.
Rights and permissions
About this article
Cite this article
Clark, A., Costa Florêncio, C. & Watkins, C. Languages as hyperplanes: grammatical inference with string kernels. Mach Learn 82, 351–373 (2011). https://doi.org/10.1007/s10994-010-5218-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5218-3
Keywords
- Kernel methods
- Grammatical inference