Skip to main content

Representing Standard Text Formulations as Directed Graphs

  • 1216 Accesses

Part of the Lecture Notes in Computer Science book series (LNIP,volume 12917)


In order to ensure validity in legal texts like contracts and case law, lawyers rely on standardised formulations that are written carefully but also represent a kind of code with a meaning and function known to all legal experts. Using directed (acyclic) graphs to represent standardized text fragments, we are able to capture variations concerning time specifications, slight rephrasings, names, places and also OCR errors. We show how we can find such text fragments by sentence clustering, pattern detection and clustering patterns. To test the proposed methods, we use two corpora of German contracts and court decisions, specially compiled for this purpose. However, the entire process for representing standardised text fragments is language-agnostic. We analyze and compare both corpora and give an quantitative and qualitative analysis of the text fragments found and present a number of examples from both corpora.


  • Graph-based text representations
  • Legal writings
  • Standardised formulation

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-86159-9_34
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-86159-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.


  1. 1.

    The sources for the documents compiled for both corpora will be published on our website: Likewise, we publish the developed methods and also the document collections on our project page.


  1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB 1994, pp. 487–499. Morgan Kaufmann Publishers Inc. (1994)

    Google Scholar 

  2. Burger, H., Dobrovol’skij, D., Kühn, P., Norrick, N.R.: Phraseologie: Objektbereich, Terminologie und Forschungsschwerpunkte. In: Burger, H., Dobrovol’skij, D., Kühn, P., Norrick, N.R. (eds.) Phraseologie. Ein internationales Handbuch zeitgenössischer Forschung, pp. 1–10. Mouton de Gruyter, Berlin (2007)

    Google Scholar 

  3. Burgess, M., et al.: The legislative influence detector: finding text reuse in state legislation. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2016. pp. 57–66. ACM Press (2016).

  4. Busse, D.: Sprache und Recht, pp. 383–393. J.B. Metzler, Stuttgart (2018).

  5. Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: METER: MEasuring TExt reuse. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002). Conference Name: ACL-02 Library Catalog: Meeting Name: ACL-02 Pages: 152–159 Place: Philadelphia Publisher: ACL

  6. Engberg, J.: Signalfunktion und Kodierungsgrad von sprachlichen Merkmalen in Gerichtsurteilen. HERMES J. Lang. Commun. Bus. 65–82 (1992).

  7. Engberg, J.: Does routine formulation change meaning? - The impact of genre on word semantics in the legal domain, pp. 31–48. De Gruyter Mouton (2000).

  8. Filippova, K.: Multi-sentence compression: finding shortest paths in word graphs. In: Proceedings of the 23rd Int. Conference on Computational Linguistics, COLING 2010, pp. 322–330. Association for Computational Linguistics (2010)

    Google Scholar 

  9. Josi, F., Wartena, C.: Structural analysis of contract renewals. In: Proceedings of the ACM CIKM 2018 Workshops, Turin (2018)

    Google Scholar 

  10. Josi, F., Wartena, C., Ulrich, H.: Identifizierung von häufig vorkommenden Textabschnitten in juristischen Korpora. In: 56th Linguistics Colloquium, vol. 56. Peter Lang (2021, to appear)

    Google Scholar 

  11. Kjær, A.L.: On the structure of legal knowledge: the importance of knowing legal rules for understanding legal texts. In: Language, Text, and Knowledge. Mental Models of Expert Communication, pp. 127–161 (2000)

    Google Scholar 

  12. Kliche, F., Blessing, A., Heid, U., Sonntag, J.: The eIdentity text ExplorationWorkbench. In: Calzolari, N., et al. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA) (2014)

    Google Scholar 

  13. Lindroos, E.: Dissertation: Im Namen des Gesetzes. Eine vergleichende rechtslinguistische Untersuchung zur Formelhaftigkeit in deutschen und finnischen Strafurteilen. Fachsprache 37(3), 218–222 (2015).

  14. Ma, D., Chen, C., Golshan, B., Tan, W.C.: Essentia: mining domain-specific paraphrases with word-alignment graphs. In: Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 52–57. Association for Computational Linguistics (2019).

  15. Płomińska, M.: Routine expressions in German legal texts - an attempt at typology. Colloquia Germanica Stetinensia 29, 239–253 (2020).

  16. Sailer, M.: Idiom and phraseology. In: Aronoff, M. (ed.) Oxford Bibliographies in Linguistics. Oxford University Press, New York (2013).

  17. Searle, J.R.: A taxonomy of illocutionary acts. Language, mind, and knowledge 07 (1975). Accepted 2017–03-16T18:32:14Z Publisher: University of Minnesota Press, Minneapolis

  18. Sultan, M.A., Bethard, S., Sumner, T.: Back to basics for monolingual alignment: exploiting word similarity and contextual evidence. Trans. Assoc. Comput. Linguist. 2, 219–230 (2014).

    CrossRef  Google Scholar 

  19. Wahl, A., Gries, S.T.: Computational extraction of formulaic sequences from corpora. Comput. Phraseol. 24, 83 (2020)

    Google Scholar 

  20. Wise, M.J.: Neweyes: a system for comparing biological sequences using the running Karp-Rabin greedy string-tiling algorithm. In: Proceedings. International Conference on Intelligent Systems for Molecular Biology, vol. 3, pp. 393–401 (1995)

    Google Scholar 

  21. Woźniak, J.: Pragmatische Phraseologismen in ausgewählten Rechtstexten-ein Systematisierungsversuch. Lingwistyka Stosowana/Applied Linguistics/Angewandte Linguistik, pp. 149–162 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Frieda Josi .

Editor information

Editors and Affiliations

A Appendices

A Appendices

1.1 A.1 Sources for Case Law Corpus

  1. 1.

    Bundesgerichtshof (BGH) – Decisions from criminal law:

1.2 A.2 Sources for Contract Corpus

  1. 1.

    Stadtverwaltung Hansestadt Hamburg – City administration of Hamburg:

  2. 2.

    Stadtverwaltung Bremen – City administration of Bremen:, Keyword: Vertrag

  3. 3.

    Cooperation contracts between universities and also between universities and service providers: We searched specifically for contract files on university websites and added them to Contract corpus.

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Josi, F., Wartena, C., Heid, U. (2021). Representing Standard Text Formulations as Directed Graphs. In: Barney Smith, E.H., Pal, U. (eds) Document Analysis and Recognition – ICDAR 2021 Workshops. ICDAR 2021. Lecture Notes in Computer Science(), vol 12917. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86158-2

  • Online ISBN: 978-3-030-86159-9

  • eBook Packages: Computer ScienceComputer Science (R0)