<tiger2/>: serialising the ISO SynAF syntactic object model

Abstract

This paper introduces <tiger2/>, an XML format developed to serialise the object model defined by the ISO Syntactic Annotation Framework SynAF. Based on widespread best practices we adapt a popular XML format for syntactic annotation, TigerXML, with additional features to support a variety of syntactic phenomena including constituent and dependency structures, binding, and different node types such as compounds or empty elements. We also define interfaces to other formats and standards including the Morpho-syntactic Annotation Framework MAF and the ISOCat Data Category Registry. Finally a case study of the German Treebank TueBa-D/Z is presented, showcasing the handling of constituent structures, topological fields and coreference annotation in tandem.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    http://lirics.loria.fr/.

  2. 2.

    <s> stands for any syntactically annotated segment—this need not be a sentence, and can also be larger, as in a textual segment, or smaller as in a phrase.

  3. 3.

    Reserved attributes like @type carry the namespace tiger2 in order to allow a further user-defined attribute @type .

  4. 4.

    An anonymous reviewer has asked whether <tiger2/> data could also be represented in a tuple store (e.g. in RDF): this is certainly possible. It should be noted that not all conceivable graphs can be represented in the format. For example n to m edges connecting multiple nodes at a time are not supported.

References

  1. Bies, A., Ferguson, M., Katz, K., & MacIntyre, R. (1995). Bracketing guidelines for Treebank II style. Penn Treebank Project. CIS Technical Report MS-CIS-95-06.

  2. Bosch, S., Choi, K.-S., Villemonte De La Clergerie, E., Fang, A. C., Faass, G., Lee, K., et al. (2012). tiger2 as a standardized serialisation for ISO 24615—SynAF. In I. Hendrickx, S. Kübler, & K. Simov (Eds.), TLT1111th international workshop on Treebanks and Linguistic Theories, Nov 2012, Lisbon, Portugal. Ediçoes Colibri, pp. 37–60.

  3. Burnard, L., & Bauman, S. (2008). TEI P5: Guidelines for electronic text encoding and interchange. Manual. http://www.tei-c.org/Guidelines/P5/

  4. Dipper, S. (2005). XML-based stand-off representation and exploitation of multi-level linguistic annotation. In Proceedings of Berliner XML Tage 2005 (BXML 2005), Berlin, Germany, pp. 39–50.

  5. Hajič, J., Panevová, J., Hajičová, E., Panevová, J., Sgall, P., Pajas, P., et al. (2006). Prague dependency Treebank 2.0. Philadelphia: Linguistic Data Consortium.

  6. Ide, N., & Romary, L. (2003). Encoding syntactic annotation. In A. Abeillée (Ed.), Treebanks: Building and using parsed corpora (pp. 281–296). Dordrecht: Kluwer.

    Google Scholar 

  7. Ide, N., & Suderman, K. (2007). GrAF: A graph-based format for linguistic annotations. In Proceedings of the linguistic annotation workshop 2007, Prague, pp. 1–8.

  8. Ide, N., & Suderman, K. (2014). The linguistic annotation framework: A standard for annotation interchange and merging. Language Resources and Evaluation, 8(3), 395–418.

    Article  Google Scholar 

  9. Ide, N., & Véronis, J. (1995). Encoding dictionaries. Computers and the Humanities, 29(2), 167–179.

    Article  Google Scholar 

  10. Krause, T., Ritz, J., Zeldes, A., & Zipser, F. (2011). Topological fields, constituents and coreference: A new multi-layer architecture for TüBa-D/Z. In H. Hedeland, T. Schmidt, & K. Wörner (Eds.), Multilingual resources and multilingual applications. Proceedings of GSCL 2011 (pp. 259–262). Hamburg: Hamburger Zentrum für Sprachkorpora.

  11. Langendoen, D. T., & Simons, G. F. (1995). A rationale for the TEI recommendations for feature-structure markup. Computers and the Humanities, 29(3), 191–209.

    Article  Google Scholar 

  12. Lee, K., Burnard, L., Romary, L., de la Clergerie, E., Declerck, T., Bauman, S., et al. (2004). Towards an international standard on feature structures representation. In Proceedings of LREC 2004, Lisbon, Portugal, pp. 373–376.

  13. Mengel, A., & Lezius, W. (2000). An XML-based encoding format for syntactically annotated corpora. In Proceedings of the second international conference on language resources and engineering (LREC 2000), Athens, pp. 121–126.

  14. Miller, J., & Mukerji, J. (Eds.). (2003). MDA guide version 1.0.1. Object Management Group (OMG), Needham, MA.

  15. Maedche, A., & Staab, S. (2000). Discovering conceptual relations from text. In Proceedings of ECAI 2000, pp. 321–325.

  16. Pollard, C. J., & Sag, I. A. (1994). Head-driven phrase structure grammar. Chicago: University of Chicago Press.

    Google Scholar 

  17. Romary, L. (2001). An abstract model for the representation of multilingual terminological data: TMF—terminological markup framework. In Proceedings of terminology in advanced management applications (TAMA) 2001. Antwerp, Belgium.

  18. Romary, L. (2013a). Standardization of the formal representation of lexical information for NLP. In R. Gouws, U. Heid, W. Schweickard, & H. Wiegand (Eds.), Dictionaries. An International Encyclopedia of Lexicography. Supplementary volume: Recent developments with special focus on computational lexicography. Mouton de Gruyter.

  19. Romary, L. (2013b). TEI and LMF crosswalks. In S. Gradmann & F. Sasaki (Eds.), Digital Humanities: Wissenschaft vom Verstehen. Humboldt Universität zu Berlin, Berlin.

  20. Romary, L., & Ide, N. (2004). International standard for a linguistic annotation framework. Natural Language Engineering, 10(3–4), 211–225.

    Google Scholar 

  21. Romary, L., & Witt, A. (2012). Data formats for phonological corpora. In U. Gut (Ed.), Handbook of corpus phonology. Oxford: Oxford University Press.

    Google Scholar 

  22. Steinberg, D., Budinsky, F., Paternostro, M., & Merks, E. (2009). EMF: Eclipse modeling framework 2.0. Upper Saddle River, NJ: Addison-Wesley.

    Google Scholar 

  23. Telljohann, H., Hinrichs, E., & Kübler, S. (2004). The TüBa-D/Z Treebank—annotating German with a context-free backbone. In Proceedings of the fourth international conference on language resources and evaluation (LREC 2004), Lisbon, Portugal, pp. 2229–2232.

  24. Telljohann, H., Hinrichs, E. W., Kübler, S., Zinsmeister, H., & Beck, K. (2009). Stylebook for the Tübingen Treebank of Written German (TüBa-D/Z). Tübingen: Universität Tübingen, Seminar für Sprachwissenschaft.

  25. Zeldes, A., Ritz, J., Lüdeling, A., & Chiarcos, C. (2009). ANNIS: A search tool for multi-layer annotated corpora. In Proceedings of corpus linguistics 2009, Liverpool, July 20–23, 2009.

  26. Zipser, F. (2009). Entwicklung eines Konverterframeworks für linguistisch annotierte Daten auf Basis eines gemeinsamen (Meta-)Modells. Diploma thesis, Humboldt-Universität zu Berlin, Institut für Informatik. http://hal.archives-ouvertes.fr/docs/00/60/61/02/PDF/Diplomarbeit_FZ_final.pdf

  27. Zipser, F., & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In Proceedings of the workshop on language resource and language technology standards, LREC 2010. Malta, pp. 7–18.

ISO Standards/Drafts

  1. ISO/DIS 24611 Language resource management—Morpho-syntactic annotation framework (MAF)

  2. ISO/DIS 24612 Language resource management—Linguistic annotation framework (LAF)

  3. ISO 24615 Language resource management—Syntactic annotation framework (SynAF)

  4. ISO 12620 Terminology and other language and content resources—Specification of data categories and management of a Data Category Registry for language resources; implemented in ISOcat.org

  5. ISO 24610-1. Language resource management—Feature structures—Part 1: Feature structure representation.

  6. ISO 24613 Language resource management—Lexical markup framework (LMF).

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Laurent Romary.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Romary, L., Zeldes, A. & Zipser, F. <tiger2/>: serialising the ISO SynAF syntactic object model. Lang Resources & Evaluation 49, 1–18 (2015). https://doi.org/10.1007/s10579-014-9288-x

Download citation

Keywords

  • syntactic annotation
  • XML format
  • corpus
  • corpora
  • Treebank
  • Tiger XML