Language-Independent Methods for Compiling Monolingual Lexical Data

Biemann, Christian; Bordag, Stefan; Heyer, Gerhard; Quasthoff, Uwe; Wolff, Christian

doi:10.1007/978-3-540-24630-5_27

Christian Biemann⁵,
Stefan Bordag⁵,
Gerhard Heyer⁵,
Uwe Quasthoff⁵ &
…
Christian Wolff⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2945))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

975 Accesses
26 Citations

Abstract

In this paper we describe a flexible, portable and language-independent infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different applications for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (http://wortschatz.uni-leipzig.de).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Biemann, C., Quasthoff, U., Böhm, K., Wolff, C.: Automatic discovery and Aggregation of Compound Names for the Use in Knowledge Representations. Journal of Universal Computer Science (JUCS) 9(6), 530–541 (2003)
Google Scholar
Biemann, C., Bordag, S., Quasthoff, U.: Lernen von paradigmatischen Relationen auf iterierten Kollokationen. In: Proceedings of GermeNet Workshop 2003, Tübingen, Germany (2003)
Google Scholar
Barabasi, A.L., et al.: Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281), 70–77 (2000)
Google Scholar
Bolshakov, I.A.: Getting One’s First Million..Collocations. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 226–239. Springer, Heidelberg (2004)
Google Scholar
Bolshakov, I.A., Gelbukh, A.: Heuristics-based replenishment of collocation databases. In: Ranchhod, E., Mamede, N.J. (eds.) PorTAL 2002. LNCS (LNAI), vol. 2389, pp. 25–32. Springer, Heidelberg (2002)
Chapter Google Scholar
Bordag, S.: Sentence Co-occurrences as Small-World Graphs: A solution to Automatic Lexical Disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 329–332. Springer, Heidelberg (2003)
Chapter Google Scholar
Brants, T.: TnT - A Statistical Part-of-Speech Tagger. In: Proceedings of the Sixth Applied Natural Language Processing Conference, ANLP 2000, Seattle, WA (2000)
Google Scholar
Davidson, R., Harel, D.: Drawing Graphs Nicely Using Simulated Annealing. ACM Transactions on Graphics 15(4), 301–331 (1996)
Article Google Scholar
Gelbukh, A., Sidorov, G., Bolshakov, I.A.: Dictionary-based Method for Coherence Maintenance in Man-Machine Dialogue with Indirect Antecedents and Ellipses. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 357–362. Springer, Heidelberg (2000)
Chapter Google Scholar
Gelbukh, A., Sidorov, G., Han, S.-Y., Hernández-Rubio, E.: Automatic Enrichment of Very Large Dictionary of Word Combinations on the Basis of Dependency Formalism. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 430–437. Springer, Heidelberg (2004)
Chapter Google Scholar
Jansen, B.J., et al.: Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web. Information Processing & Management 36(2), 207–227 (2000)
Article Google Scholar
Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The web as a graph: Measurements, models, and methods. In: Asano, T., Imai, H., Lee, D.T., Nakano, S.-i., Tokuyama, T. (eds.) COCOON 1999. LNCS, vol. 1627, pp. 1–18. Springer, Heidelberg (1999)
Chapter Google Scholar
Läuter, M., Quasthoff, U.: Kollokationen und semantisches Clustering. In: Gippert, J. (ed.) Multilinguale Corpora. Codierung, Strukturierung, Analyse. Proc. 11. GLDV-Jahrestagung, pp. 34–41. Enigma Corporation, Prague (1999)
Google Scholar
Miller, G.A.: Wordnet - an on-line lexical database. International Journal of Lexikography 3(4), 235–312 (1990)
Article Google Scholar
Quasthoff, U.: Tools for Automatic Lexicon Maintenance: Acquisition, Error Correction, and the Generation of Missing Values. In: Proc. First International Conference on Language Resources & Evaluation [LREC], Granada, May 1998, vol. II, pp. 853–856 (1998)
Google Scholar
Quasthoff, U.: Projekt der deutsche Wortschatz. In: Heyer, G., Wolff, C. (eds.) Linguistik und neue Medien, pp. 93–99. Dt. Universitätsverlag, Wiesbaden (1998)
Google Scholar
Quasthoff, U., Biemann, C., Wolff, C.: Named Entity Learning and Verification: EM in large Corpora. In: Proceedings of CoNNL 2002, Taipei, Taiwan (2002)
Google Scholar
Quasthoff, U., Richter, M., Wolff, C.: Medienalalyse und Visualisierung – Auswertung von Online-Pressetexten durch Text Mining. In: Seewald-Heeg, U. (ed.) Sprachtechnologie für die multilinguale Kommunikation, Proceedings of GLDV 2003, Sankt Augustin (2003)
Google Scholar
Rapp, R.: The Computation of Word Association: Comparing Syntagmatic and Paradigmatic Approaches. In: Proceedings of COLING 2002, Taipei, Taiwan (2002)
Google Scholar
de Saussure, F.: Cours de Linguistique Générale, Paris, Payot (1916)
Google Scholar
Schmidt, F.: Automatische Ermittlung semantischer Zusammenhänge lexikalischer Einheiten und deren graphische Darstellung, Diplomarbeit, Universität Leipzig (1999)
Google Scholar
Silverstein, C., et al.: Analysis of a Very Large Web Search Engine Query Log. SIGIR Forum 33(1), 6–12 (1999)
Article MathSciNet Google Scholar
Steyvers, M., Tenenbaum, J.B.: The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cognitive Science (2002)
Google Scholar
Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998)
Article Google Scholar
Voorhees, E., Harman, D. (eds.): Overview of the Seventh Text REtrieval Conference (TREC-7). Proc. TREC-7. The Seventh Text REtrieval Conference, Gaithersburg/MD: NIST [= NIST Special Publication 500-242] (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Institute, NLP Dept., Leipzig University, Augustusplatz 10/11, 04109, Leipzig, Germany
Christian Biemann, Stefan Bordag, Gerhard Heyer & Uwe Quasthoff
University of Regensburg PT 3.3.48, 93040, Regensburg
Christian Wolff

Authors

Christian Biemann
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Bordag
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Heyer
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Quasthoff
View author publications
You can also search for this author in PubMed Google Scholar
Christian Wolff
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C. (2004). Language-Independent Methods for Compiling Monolingual Lexical Data. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-24630-5_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics