Language-Independent Methods for Compiling Monolingual Lexical Data

  • Christian Biemann
  • Stefan Bordag
  • Gerhard Heyer
  • Uwe Quasthoff
  • Christian Wolff
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2945)


In this paper we describe a flexible, portable and language-independent infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different applications for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (


Semantic Relation Word Form Query Expansion Language Resource Small World Property 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [Biemann et al. 2003a]
    Biemann, C., Quasthoff, U., Böhm, K., Wolff, C.: Automatic discovery and Aggregation of Compound Names for the Use in Knowledge Representations. Journal of Universal Computer Science (JUCS) 9(6), 530–541 (2003)Google Scholar
  2. [Biemann et al. 2003b]
    Biemann, C., Bordag, S., Quasthoff, U.: Lernen von paradigmatischen Relationen auf iterierten Kollokationen. In: Proceedings of GermeNet Workshop 2003, Tübingen, Germany (2003)Google Scholar
  3. [Barabasi 2000]
    Barabasi, A.L., et al.: Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281), 70–77 (2000)Google Scholar
  4. [Bolshakov 2004]
    Bolshakov, I.A.: Getting One’s First Million..Collocations. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 226–239. Springer, Heidelberg (2004)Google Scholar
  5. [Bolshakov & Gelbukh 2002]
    Bolshakov, I.A., Gelbukh, A.: Heuristics-based replenishment of collocation databases. In: Ranchhod, E., Mamede, N.J. (eds.) PorTAL 2002. LNCS (LNAI), vol. 2389, pp. 25–32. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  6. [Bordag 2002]
    Bordag, S.: Sentence Co-occurrences as Small-World Graphs: A solution to Automatic Lexical Disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 329–332. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. [Brants 2000]
    Brants, T.: TnT - A Statistical Part-of-Speech Tagger. In: Proceedings of the Sixth Applied Natural Language Processing Conference, ANLP 2000, Seattle, WA (2000)Google Scholar
  8. [Davidson & Harel 1006]
    Davidson, R., Harel, D.: Drawing Graphs Nicely Using Simulated Annealing. ACM Transactions on Graphics 15(4), 301–331 (1996)CrossRefGoogle Scholar
  9. [Gelbukh et al. 2000]
    Gelbukh, A., Sidorov, G., Bolshakov, I.A.: Dictionary-based Method for Coherence Maintenance in Man-Machine Dialogue with Indirect Antecedents and Ellipses. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 357–362. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  10. [Gelbukh et al. 2004]
    Gelbukh, A., Sidorov, G., Han, S.-Y., Hernández-Rubio, E.: Automatic Enrichment of Very Large Dictionary of Word Combinations on the Basis of Dependency Formalism. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 430–437. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  11. [Jansen et al. 2000]
    Jansen, B.J., et al.: Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web. Information Processing & Management 36(2), 207–227 (2000)CrossRefGoogle Scholar
  12. [Kleinberg et al. 1999]
    Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The web as a graph: Measurements, models, and methods. In: Asano, T., Imai, H., Lee, D.T., Nakano, S.-i., Tokuyama, T. (eds.) COCOON 1999. LNCS, vol. 1627, pp. 1–18. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  13. [Läuter & Quasthoff 1999]
    Läuter, M., Quasthoff, U.: Kollokationen und semantisches Clustering. In: Gippert, J. (ed.) Multilinguale Corpora. Codierung, Strukturierung, Analyse. Proc. 11. GLDV-Jahrestagung, pp. 34–41. Enigma Corporation, Prague (1999)Google Scholar
  14. [Miller 1990]
    Miller, G.A.: Wordnet - an on-line lexical database. International Journal of Lexikography 3(4), 235–312 (1990)CrossRefGoogle Scholar
  15. [Quasthoff 1998a]
    Quasthoff, U.: Tools for Automatic Lexicon Maintenance: Acquisition, Error Correction, and the Generation of Missing Values. In: Proc. First International Conference on Language Resources & Evaluation [LREC], Granada, May 1998, vol. II, pp. 853–856 (1998)Google Scholar
  16. [Quasthoff 1998]
    Quasthoff, U.: Projekt der deutsche Wortschatz. In: Heyer, G., Wolff, C. (eds.) Linguistik und neue Medien, pp. 93–99. Dt. Universitätsverlag, Wiesbaden (1998)Google Scholar
  17. [Quasthoff et al. 2002]
    Quasthoff, U., Biemann, C., Wolff, C.: Named Entity Learning and Verification: EM in large Corpora. In: Proceedings of CoNNL 2002, Taipei, Taiwan (2002)Google Scholar
  18. [Quasthoff et al. 2003]
    Quasthoff, U., Richter, M., Wolff, C.: Medienalalyse und Visualisierung – Auswertung von Online-Pressetexten durch Text Mining. In: Seewald-Heeg, U. (ed.) Sprachtechnologie für die multilinguale Kommunikation, Proceedings of GLDV 2003, Sankt Augustin (2003)Google Scholar
  19. [Rapp 2002]
    Rapp, R.: The Computation of Word Association: Comparing Syntagmatic and Paradigmatic Approaches. In: Proceedings of COLING 2002, Taipei, Taiwan (2002)Google Scholar
  20. [Saussure 1916]
    de Saussure, F.: Cours de Linguistique Générale, Paris, Payot (1916)Google Scholar
  21. [Schmidt 1999]
    Schmidt, F.: Automatische Ermittlung semantischer Zusammenhänge lexikalischer Einheiten und deren graphische Darstellung, Diplomarbeit, Universität Leipzig (1999)Google Scholar
  22. [Silverstein et al. 1999]
    Silverstein, C., et al.: Analysis of a Very Large Web Search Engine Query Log. SIGIR Forum 33(1), 6–12 (1999)CrossRefMathSciNetGoogle Scholar
  23. [Steyvers & Tenenbaum 2002]
    Steyvers, M., Tenenbaum, J.B.: The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cognitive Science (2002)Google Scholar
  24. [Strogatz 1998]
    Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998)CrossRefGoogle Scholar
  25. [Voorhees & Harman 1999]
    Voorhees, E., Harman, D. (eds.): Overview of the Seventh Text REtrieval Conference (TREC-7). Proc. TREC-7. The Seventh Text REtrieval Conference, Gaithersburg/MD: NIST [= NIST Special Publication 500-242] (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Christian Biemann
    • 1
  • Stefan Bordag
    • 1
  • Gerhard Heyer
    • 1
  • Uwe Quasthoff
    • 1
  • Christian Wolff
    • 2
  1. 1.Computer Science Institute, NLP Dept.Leipzig UniversityLeipzigGermany
  2. 2.University of Regensburg PT 3.3.48Regensburg

Personalised recommendations