Abstract
In this paper we describe a flexible, portable and language-independent infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different applications for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (http://wortschatz.uni-leipzig.de).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Biemann, C., Quasthoff, U., Böhm, K., Wolff, C.: Automatic discovery and Aggregation of Compound Names for the Use in Knowledge Representations. Journal of Universal Computer Science (JUCS) 9(6), 530–541 (2003)
Biemann, C., Bordag, S., Quasthoff, U.: Lernen von paradigmatischen Relationen auf iterierten Kollokationen. In: Proceedings of GermeNet Workshop 2003, Tübingen, Germany (2003)
Barabasi, A.L., et al.: Scale-free characteristics of random networks: the topology of the World-wide web. Physica A (281), 70–77 (2000)
Bolshakov, I.A.: Getting One’s First Million..Collocations. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 226–239. Springer, Heidelberg (2004)
Bolshakov, I.A., Gelbukh, A.: Heuristics-based replenishment of collocation databases. In: Ranchhod, E., Mamede, N.J. (eds.) PorTAL 2002. LNCS (LNAI), vol. 2389, pp. 25–32. Springer, Heidelberg (2002)
Bordag, S.: Sentence Co-occurrences as Small-World Graphs: A solution to Automatic Lexical Disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 329–332. Springer, Heidelberg (2003)
Brants, T.: TnT - A Statistical Part-of-Speech Tagger. In: Proceedings of the Sixth Applied Natural Language Processing Conference, ANLP 2000, Seattle, WA (2000)
Davidson, R., Harel, D.: Drawing Graphs Nicely Using Simulated Annealing. ACM Transactions on Graphics 15(4), 301–331 (1996)
Gelbukh, A., Sidorov, G., Bolshakov, I.A.: Dictionary-based Method for Coherence Maintenance in Man-Machine Dialogue with Indirect Antecedents and Ellipses. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 357–362. Springer, Heidelberg (2000)
Gelbukh, A., Sidorov, G., Han, S.-Y., Hernández-Rubio, E.: Automatic Enrichment of Very Large Dictionary of Word Combinations on the Basis of Dependency Formalism. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 430–437. Springer, Heidelberg (2004)
Jansen, B.J., et al.: Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web. Information Processing & Management 36(2), 207–227 (2000)
Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The web as a graph: Measurements, models, and methods. In: Asano, T., Imai, H., Lee, D.T., Nakano, S.-i., Tokuyama, T. (eds.) COCOON 1999. LNCS, vol. 1627, pp. 1–18. Springer, Heidelberg (1999)
Läuter, M., Quasthoff, U.: Kollokationen und semantisches Clustering. In: Gippert, J. (ed.) Multilinguale Corpora. Codierung, Strukturierung, Analyse. Proc. 11. GLDV-Jahrestagung, pp. 34–41. Enigma Corporation, Prague (1999)
Miller, G.A.: Wordnet - an on-line lexical database. International Journal of Lexikography 3(4), 235–312 (1990)
Quasthoff, U.: Tools for Automatic Lexicon Maintenance: Acquisition, Error Correction, and the Generation of Missing Values. In: Proc. First International Conference on Language Resources & Evaluation [LREC], Granada, May 1998, vol. II, pp. 853–856 (1998)
Quasthoff, U.: Projekt der deutsche Wortschatz. In: Heyer, G., Wolff, C. (eds.) Linguistik und neue Medien, pp. 93–99. Dt. Universitätsverlag, Wiesbaden (1998)
Quasthoff, U., Biemann, C., Wolff, C.: Named Entity Learning and Verification: EM in large Corpora. In: Proceedings of CoNNL 2002, Taipei, Taiwan (2002)
Quasthoff, U., Richter, M., Wolff, C.: Medienalalyse und Visualisierung – Auswertung von Online-Pressetexten durch Text Mining. In: Seewald-Heeg, U. (ed.) Sprachtechnologie für die multilinguale Kommunikation, Proceedings of GLDV 2003, Sankt Augustin (2003)
Rapp, R.: The Computation of Word Association: Comparing Syntagmatic and Paradigmatic Approaches. In: Proceedings of COLING 2002, Taipei, Taiwan (2002)
de Saussure, F.: Cours de Linguistique Générale, Paris, Payot (1916)
Schmidt, F.: Automatische Ermittlung semantischer Zusammenhänge lexikalischer Einheiten und deren graphische Darstellung, Diplomarbeit, Universität Leipzig (1999)
Silverstein, C., et al.: Analysis of a Very Large Web Search Engine Query Log. SIGIR Forum 33(1), 6–12 (1999)
Steyvers, M., Tenenbaum, J.B.: The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cognitive Science (2002)
Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998)
Voorhees, E., Harman, D. (eds.): Overview of the Seventh Text REtrieval Conference (TREC-7). Proc. TREC-7. The Seventh Text REtrieval Conference, Gaithersburg/MD: NIST [= NIST Special Publication 500-242] (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Biemann, C., Bordag, S., Heyer, G., Quasthoff, U., Wolff, C. (2004). Language-Independent Methods for Compiling Monolingual Lexical Data. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-24630-5_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive