Skip to main content

Language-Independent Methods for Compiling Monolingual Lexical Data

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2945))

Abstract

In this paper we describe a flexible, portable and language-independent infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different applications for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (http://wortschatz.uni-leipzig.de).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Biemann, C., Quasthoff, U., Böhm, K., Wolff, C.: Automatic discovery and Aggregation of Compound Names for the Use in Knowledge Representations. Journal of Universal Computer Science (JUCS) 9(6), 530–541 (2003)

    Google Scholar 

  2. Biemann, C., Bordag, S., Quasthoff, U.: Lernen von paradigmatischen Relationen auf iterierten Kollokationen. In: Proceedings of GermeNet Workshop 2003, Tübingen, Germany (2003)

    Google Scholar 

  3. Barabasi, A.L., et al.: Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281), 70–77 (2000)

    Google Scholar 

  4. Bolshakov, I.A.: Getting One’s First Million..Collocations. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 226–239. Springer, Heidelberg (2004)

    Google Scholar 

  5. Bolshakov, I.A., Gelbukh, A.: Heuristics-based replenishment of collocation databases. In: Ranchhod, E., Mamede, N.J. (eds.) PorTAL 2002. LNCS (LNAI), vol. 2389, pp. 25–32. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  6. Bordag, S.: Sentence Co-occurrences as Small-World Graphs: A solution to Automatic Lexical Disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 329–332. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  7. Brants, T.: TnT - A Statistical Part-of-Speech Tagger. In: Proceedings of the Sixth Applied Natural Language Processing Conference, ANLP 2000, Seattle, WA (2000)

    Google Scholar 

  8. Davidson, R., Harel, D.: Drawing Graphs Nicely Using Simulated Annealing. ACM Transactions on Graphics 15(4), 301–331 (1996)

    Article  Google Scholar 

  9. Gelbukh, A., Sidorov, G., Bolshakov, I.A.: Dictionary-based Method for Coherence Maintenance in Man-Machine Dialogue with Indirect Antecedents and Ellipses. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 357–362. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  10. Gelbukh, A., Sidorov, G., Han, S.-Y., Hernández-Rubio, E.: Automatic Enrichment of Very Large Dictionary of Word Combinations on the Basis of Dependency Formalism. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 430–437. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  11. Jansen, B.J., et al.: Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web. Information Processing & Management 36(2), 207–227 (2000)

    Article  Google Scholar 

  12. Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The web as a graph: Measurements, models, and methods. In: Asano, T., Imai, H., Lee, D.T., Nakano, S.-i., Tokuyama, T. (eds.) COCOON 1999. LNCS, vol. 1627, pp. 1–18. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  13. Läuter, M., Quasthoff, U.: Kollokationen und semantisches Clustering. In: Gippert, J. (ed.) Multilinguale Corpora. Codierung, Strukturierung, Analyse. Proc. 11. GLDV-Jahrestagung, pp. 34–41. Enigma Corporation, Prague (1999)

    Google Scholar 

  14. Miller, G.A.: Wordnet - an on-line lexical database. International Journal of Lexikography 3(4), 235–312 (1990)

    Article  Google Scholar 

  15. Quasthoff, U.: Tools for Automatic Lexicon Maintenance: Acquisition, Error Correction, and the Generation of Missing Values. In: Proc. First International Conference on Language Resources & Evaluation [LREC], Granada, May 1998, vol. II, pp. 853–856 (1998)

    Google Scholar 

  16. Quasthoff, U.: Projekt der deutsche Wortschatz. In: Heyer, G., Wolff, C. (eds.) Linguistik und neue Medien, pp. 93–99. Dt. Universitätsverlag, Wiesbaden (1998)

    Google Scholar 

  17. Quasthoff, U., Biemann, C., Wolff, C.: Named Entity Learning and Verification: EM in large Corpora. In: Proceedings of CoNNL 2002, Taipei, Taiwan (2002)

    Google Scholar 

  18. Quasthoff, U., Richter, M., Wolff, C.: Medienalalyse und Visualisierung – Auswertung von Online-Pressetexten durch Text Mining. In: Seewald-Heeg, U. (ed.) Sprachtechnologie für die multilinguale Kommunikation, Proceedings of GLDV 2003, Sankt Augustin (2003)

    Google Scholar 

  19. Rapp, R.: The Computation of Word Association: Comparing Syntagmatic and Paradigmatic Approaches. In: Proceedings of COLING 2002, Taipei, Taiwan (2002)

    Google Scholar 

  20. de Saussure, F.: Cours de Linguistique Générale, Paris, Payot (1916)

    Google Scholar 

  21. Schmidt, F.: Automatische Ermittlung semantischer Zusammenhänge lexikalischer Einheiten und deren graphische Darstellung, Diplomarbeit, Universität Leipzig (1999)

    Google Scholar 

  22. Silverstein, C., et al.: Analysis of a Very Large Web Search Engine Query Log. SIGIR Forum 33(1), 6–12 (1999)

    Article  MathSciNet  Google Scholar 

  23. Steyvers, M., Tenenbaum, J.B.: The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cognitive Science (2002)

    Google Scholar 

  24. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998)

    Article  Google Scholar 

  25. Voorhees, E., Harman, D. (eds.): Overview of the Seventh Text REtrieval Conference (TREC-7). Proc. TREC-7. The Seventh Text REtrieval Conference, Gaithersburg/MD: NIST [= NIST Special Publication 500-242] (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C. (2004). Language-Independent Methods for Compiling Monolingual Lexical Data. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24630-5_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21006-1

  • Online ISBN: 978-3-540-24630-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics