Skip to main content

A New Linguistic Engine for NooJ: Parsing Context-Sensitive Grammars with Finite-State Machines

  • Conference paper
  • First Online:
Formalizing Natural Languages with NooJ and Its Natural Language Processing Applications (NooJ 2017)

Abstract

NooJ is a linguistic development environment that allows linguists to construct large linguistic resources of the four types in the Chomsky hierarchy. NooJ uses a bottom-up, “cascade” approach to sequentially apply these linguistic resources: each parsing operation accesses a Text Annotation Structure, and enriches it by adding or removing linguistic annotations to it. We discuss the drawbacks of this approach, and we present a new approach that requires that all NooJ linguistic resources be represented by a single type of finite-state machine. In order to do that, we must solve theoretical problems such as “how to handle Context-Sensitive Grammars with finite-state machines”, as well as some engineering problems such as “how to compose sets of large dictionaries and grammars into a single finite-state machine”. Our first experiments show that although that composing large finite-state machines is extremely costly theoretically, the fact that linguistic resources in a typical NooJ cascade depend on each other heavily keeps the size of all intermediary machines manageable. Once the final resulting finite-state machine has been compiled and loaded in memory (e.g. on a webserver) it can be used to parse large texts in linear time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The time it takes to parse a text is proportional to its length n, i.e. O(n).

  2. 2.

    See for instance (Kasami 1965).

  3. 3.

    See for instance XLFG which is a parser for the LFG formalism.

  4. 4.

    i.e. one cannot even predict if a Turing machine will parse any text in finite time. Most linguists doubt that we would need the power of a Turing machine to describe real world natural languages. (Silberztein 2016a) argued that the typical examples of phenomena that would require unrestricted grammars are “extra-linguistic” in nature (e.g. anaphora resolution).

  5. 5.

    See (Linden et al. 2010), (Schmid 2005) and (Karttunen et al. 1997).

  6. 6.

    See (Gazdar 1988).

  7. 7.

    See (Kaplan Bresnan 1982) and (Dalrymple 1995).

  8. 8.

    See (Silberztein 2016a). NooJ is a free, open-source linguistic development environment available at www.nooj-association.org and supported and distributed by the European Metashare platform.

  9. 9.

    (Silberztein 2016b) shows how NooJ produces several millions of transformational variants for the simple sentence “Joe loves Lea”.

  10. 10.

    Translations are performed just like transformations; the only difference being that the translated lexemes are obtained via a lookup of a multilingual dictionary.

  11. 11.

    The input/output result produced by the corresponding RA Finite-State Machine is underlined.

  12. 12.

    In NooJ CFG grammars, β is a regular expression, built on terminal and non-terminal symbols and <E> (empty string), e.g.: NP = (<DET> | <E>) <ADJ> * <NOUN>.

  13. 13.

    This is the definition of left context-sensitive grammars. In right context-sensitive grammars, the non-terminal symbol of the left hand side is followed by the context, i.e. production rules look like: Aγ → δγ. The equivalence of left and right context-sensitive grammars was established by (Penttonen 1974). Another, more general definition is that context-sensitive grammars contain rules such as γAγ’ → γδγ’. (Kuroda 1964) proves that all these grammars have the same power of description.

  14. 14.

    This is the case for the grammar of Fig. 3, for which the parsing process produces w-1 intermediary solutions because there are w-1 ways to split a word form into two non-empty affixes.

References

  • Chomsky, N.: Three models for description of language. In: IEEE (IRE) Transactions on Information Theory IT-2, pp. 113–124 (1956). Reprinted in Readings in Mathematical Psychology, vol. 2, pp. 105–124. Wiley, New York (1965)

    Google Scholar 

  • Daciuk, J., Mihov, S., Watson, B.W., Watson, R.E.: Incremental construction of minimal acyclic finite-state automata. Comput. Linguist. 26(1), 3–16 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Dalrymple, M., Kaplan, R., Maxwell, J., et al.: Formal Issues in Lexical-Functional Grammar. CSLI Publications, Stanford (1995)

    MATH  Google Scholar 

  • Gazdar, G.: Applicability of indexed grammars to natural languages. In: Reyle, U., Rohrer, C. (eds.) Natural Language Parsing and Linguistic Theories. Studies in Linguistics and Philosophy, vol. 35, pp. 69–94. D. Reidel Publishing Company, Dordrecht (1988)

    Google Scholar 

  • Kaplan, R., Bresnan, J.: Lexical-functional grammar: a formal system for grammatical representation. In: Bresnan, J. (ed.) The Mental Representation of Grammatical Relations, pp. 173–281. MIT Press, Cambridge (1982)

    Google Scholar 

  • Kasami, T.: An efficient recognition and syntax-analysis algorithm for context-free languages. Technical report, AFCRL-65–758 (1965)

    Google Scholar 

  • Linden, K., Silfverberg, M., Pirinen, T.: HFST tools for morphology: an efficient open-source package for construction of morphological analysers. University of Helsinki, Finland (2010)

    Google Scholar 

  • Seljan, S., Vučković, K., Dovedan, Z.: Sentence representation in context-sensitive grammars. In: Suvremena lingvistika, vol. 53–54, pp. 205–218. Hrvatsko filološko društvo (2002)

    Google Scholar 

  • Kuroda, S.-Y.: Classes of languages and linear-bounded automata. Inf. Control 7(2), 207–223 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  • Penttonen, M.: One-sided and two sided context in formal grammars. Inf. Control 25(4), 371–392 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  • Silberztein, M.: Joe loves lea: transformational analysis of direct transitive sentences. In: Okrut, T., Hetsevich, Y., Silberztein, M., Stanislavenka, H. (eds.) NooJ 2015. CCIS, vol. 607, pp. 55–65. Springer, Cham (2016a). https://doi.org/10.1007/978-3-319-42471-2_5

    Chapter  Google Scholar 

  • Silberztein, M.: Formalizing Natural Languages: The NooJ Approach. Wiley-ISTE, London (2016b)

    Book  Google Scholar 

  • Schmid, H.: A programming language for finite-state transducers. In: Proceedings of the 5th International Workshop on Finite State Methods in Natural Language Processing (FSMNLP), Helsinki, Finland (2005)

    Google Scholar 

  • Karttunen, L., Tamás, G., Kempe, A.: Xerox finite-state tool, Technical report, Xerox Research Centre Europe (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Max Silberztein .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Silberztein, M. (2018). A New Linguistic Engine for NooJ: Parsing Context-Sensitive Grammars with Finite-State Machines. In: Mbarki, S., Mourchid, M., Silberztein, M. (eds) Formalizing Natural Languages with NooJ and Its Natural Language Processing Applications. NooJ 2017. Communications in Computer and Information Science, vol 811. Springer, Cham. https://doi.org/10.1007/978-3-319-73420-0_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73420-0_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73419-4

  • Online ISBN: 978-3-319-73420-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics