Skip to main content

Enhanced Suffix Arrays as Language Models: Virtual k-Testable Languages

  • Conference paper
Grammatical Inference: Theoretical Results and Applications (ICGI 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6339))

Included in the following conference series:

  • 784 Accesses

Abstract

In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited size n. This approach, which is used with synchronous back-off, allows us to distinguish between alternative sequences using large contexts. We also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency information.

The approach can also be viewed as a collection of virtual k-testable automata. Once built, we can directly access the results of any k-testable automaton generated from the input training data. Synchronous back-off automatically identifies the k-testable automaton with the largest feasible k. We have used this approach in several classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  2. Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modelling. In: Proceedings of the 34th Annual Meeting of the ACL, pp. 310–318. ACL (June 1996)

    Google Scholar 

  3. Daelemans, W., Van den Bosch, A., Zavrel, J.: Forgetting exceptions is harmful in language learning. Machine Learning, Special issue on Natural Language Learning 34, 11–41 (1999)

    MATH  Google Scholar 

  4. de la Higuera, C.: Grammatial Inference, Learning Automata and Grammars. Cambridge University Press, Cambridge (2010)

    Google Scholar 

  5. Knuth, D.E.: The art of computer programming. Sorting and searching, vol. 3. Addison-Wesley, Reading (1973)

    Google Scholar 

  6. Stehouwer, H., Van den Bosch, A.: Putting the t where it belongs: Solving a confusion problem in Dutch. In: Verberne, S., van Halteren, H., Coppen, P.A. (eds.) Computational Linguistics in the Netherlands 2007: Selected Papers from the 18th CLIN Meeting, pp. 21–36. Nijmegen, The Netherlands (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Stehouwer, H., van Zaanen, M. (2010). Enhanced Suffix Arrays as Language Models: Virtual k-Testable Languages. In: Sempere, J.M., García, P. (eds) Grammatical Inference: Theoretical Results and Applications. ICGI 2010. Lecture Notes in Computer Science(), vol 6339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15488-1_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15488-1_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15487-4

  • Online ISBN: 978-3-642-15488-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics