Abstract
This chapter first analyzes the general relation between linguistic analysis and computational method. As a familiar example, automatic word form recognition is used. The example exhibits a number of properties which are methodologically characteristic of all components of grammar. We then show methods for investigating the frequency distribution of words in natural language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
These considerations should be seen in light of the fact that it is tempting – and therefore common – to quickly write a little hack for a given task. A hack is an ad hoc piece of programming code which works, but has no clear algebraic or methodological basis. Hacks are a negative example of a smart solution (Sect. 2.3).
In computational linguistics, as in other fields, little hacks quickly grow into giant hacks. They may do the job in some narrowly defined application, but their value for linguistic or computational theory is limited at best. Furthermore, they are extremely costly in the long run because even their own authors encounter ever increasing problems in debugging and maintenance, not to mention upscaling within a given language or applications to new languages.
- 2.
To be ensured by the theory of language.
- 3.
To be ensured by the principle of type transparency, Sect. 9.3.
- 4.
Especially valuable for improving the generality of the system was the handling of the Korean writing system Hangul. The various options for coding Hangul are described in Lee (1994) .
- 5.
The use of Lisp in the 1988 version allowed for a quick implementation. It employed the same motor for word form recognition and syntax, and was tested on sizable fragments of English and German morphology and syntax. It had the disadvantage, however, that the rules of respective grammars were defined as Lisp functions, for which reason the system lacked an abstract, declarative formulation.
The 1990 C version was the first to provide a declarative specification of the allo- and the combi rules based on regular expressions (RegExp) in a tabular ASCII format, interpreted automatically and compiled in C.
The 1992 reimplementation aimed at improvements in the pattern matching and the trie structure. It did not get wider use, however, because it did not allow for the necessary interaction between the levels of surface and category, and had omitted an interface to syntax.
The 1994 reimplementation repaired the deficiencies of the 1992 version and was tested on sizable amounts of data in German, French, and Korean. These experiences resulted in many formal and computational innovations.
The 1995 implementation of the Malaga system took a new approach to the handling of pattern matching, using attribute-value structures. Malaga provides a uniform framework for simultaneous morphological, syntactic, and semantic parsing and generation.
- 6.
This is commented on by Jespersen (1921), pp. 341–346.
- 7.
The size of Kaeding’s (1897/1898) corpus was thus more than ten times larger and the number of types twice as large as that of the computer-based 1973 Limas-Korpus (see below).
- 8.
Named after Brown University in Rhode Island, where Francis was teaching at the time.
- 9.
The Lancaster-Oslo/Bergen corpus was compiled under the direction of Leech and Johannson. Cf. Hofland and Johannson (1980).
- 10.
See Hess et al. (1983).
- 11.
- 12.
The difficulties of constructing balanced and representative corpora are avoided in ‘opportunistic corpora,’ i.e., text collections which contain whatever is easily available. The idea is that the users themselves construct their own corpora by choosing from the opportunistic corpus specific amounts of specific kinds of texts needed for the purpose at hand. This requires, however, that the texts in the opportunistic corpus are preprocessed into a uniform format and classified with respect to their domain, origin, and various other parameters important to different users.
For example, a user must be able to automatically select a certain year of a newspaper, to exclude certain sections such as private ads, to select specific sections such as the sports page, the editorial, etc. To provide this kind of functionality in a fast growing opportunistic corpus of various different kinds of texts is labor-intensive – and therefore largely left undone.
With most of the corpus work left to the users, the resulting ‘private corpora’ are usually quite small. Also, if the users are content with whatever the opportunistic corpus makes available, the lexically more interesting domains like medicine, physics, law, etc., will be left unanalyzed.
- 13.
The following type numbers refer to the surfaces of the word forms in the BNC. The numbers published by the authors of the BNC, in contrast, refer to tagged word forms (Sect. 15.5). According to the latter method of counting, the BNC contains 921 073 types.
The strictly surface-based ranking underlying the table 15.4.2 was determined by Marco Zierl at the CL lab of Friedrich Alexander University Erlangen Nürnberg (CLUE).
- 14.
A language should be continuously documented in corpus linguistics. This may be done by building a reference corpus for a certain year. Then each following year, a monitor corpus is built, showing the changes in the language. The reference corpus and the monitor corpora must be based on the same structure. They should consist of a central part for mainstream, average language, as in daily newspapers and other genres, complemented by domain-specific parts for medicine, law, physics, etc., which may be compiled from associated journals.
- 15.
From Greek, ‘said once.’
- 16.
Also known as the Estoup-Zipf-Mandelbrot Law. The earliest observation of the phenomenon was made by the Frenchman Estoup (1916), who – like Kaeding – worked on improving stenography. After doing a statistical analysis of human speech at Bell Labs, Condon (1928) also noted a constant relation between rank and frequency. Mandelbrot et al. (1957), famous for his work on fractals, worked on mathematical refinements of Zipf’s Law. Cf. Piotrovskij et al. (1985).
- 17.
- 18.
For simplicity, this example is based on the BNC data. Strictly speaking, however, Zipf’s law applies only to individual texts and not to corpora.
- 19.
The plot of log(frequency) (y axis) vs. log(rank) (x axis) approximates a straight line of slope −1.
- 20.
The consequences of the tagset choice on the results of the corpus analysis are mentioned in Greenbaum and Yibin (1994), p. 34.
- 21.
- 22.
Unfortunately, neither Leech (1995) nor Burnard (ed.) (1995) specify what exactly constitutes an error in tagging the BNC. A new project to improve the tagger was started in June 1995, however. Called ‘The British National Corpus Tag Enhancement Project’, its results were originally scheduled to be made available in September 1996. Instead, it was apparently superseded by the massive manual postediting effort published as the BNC XML edition (Burnard (ed.) 2007).
- 23.
It is for this reason that a surface may be classified in several different ways, depending on its various environments in the corpus.
References
Bergenholtz, H. (1989) “Korpusproblematik in der Computerlinguistik. Konstruktionsprinzipien und Repräsentativität,” in H. Steger (ed.)
Beutel, B. (1997) Malaga 4.0, CLUE-Manual, CL Lab, Friedrich Alexander Universität Erlangen Nürnberg
Biber, D. (1993) “Representativeness in Corpus Design,” Literary and Linguistic Computing 8.4
Bröker, N. (1991) Optimierung von LA-Morph im Rahmen einer Anwendung auf das Deutsche, CLUE-betreute Diplomarbeit der Informatik, Friedrich Alexander Universität Erlangen Nürnberg
Brown, P., S. Della Pietra, V. Della Pietra, and R. Mercer (1991) “Word Sense Disambiguation Using Statistical Methods,” Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, 1991, 264–270
Brown, P., V. Della Pietra, et al. (1992) “Class-based n-gram Models of Natural Language,” Paper read at the Pisa Conference on European Corpus Resources
Burnard, L. (ed.) (1995) Users Reference Guide British National Corpus Version 1.0, Oxford: Oxford University Computing Services
Burnard, L. (ed.) (2007) Reference Guide for the British National Corpus (XML Edition), Oxford: Oxford University Computing Services
Church, K., and R.L. Mercer. (1993) “Introduction to the Special Issue on Computational Linguistics Using Large Corpora,” Computational Linguistics 19.1:1–24
Condon, E.U. (1928) “Statistics of Vocabulary,” Science 68.1733:300
DeRose, S. (1988) “Grammatical Category Disambiguation by Statistical Optimization,” Computational Linguistics 14.1:31–39
Estoup, J.B. (1916) Gammes Sténographiques, 4th edn. Paris: Institut Sténographique
Francis, W.N. (1980) “A Tagged Corpus: Problems and Prospects,” in S. Greenbaum, G. Leech, and J. Svartvik (eds.), 192–209
Garside, R., G. Leech, and G. Sampson (1987) The Computational Analysis of English, London: Longman
Greenbaum, S., and N. Yibin (1994) “Tagging the British ICE Corpus: English Word Classes,” in N. Oostdijk and P. de Haan (eds.)
Hess, K., J. Brustkern, and W. Lenders (1983) Maschinenlesbare deutsche Wörterbücher, Tübingen: Max Niemeyer Verlag
Hofland, K., and S. Johansson (1980) Word Frequencies in British and American English, London: Longman
Jespersen, O. (1921) Language. Its Nature, Development, and Origin, reprinted in the Norton Library 1964, New York: W.W. Norton and Company
Joos, M. (1936) “Review of G.K. Zipf: The Psycho-biology of Language,” Language 12:3
Kaeding, W. (1897/1898) Häufigkeitswörterbuch der deutschen Sprache, Steglitz
Kučera, H., and W.N. Francis (1967) Computational Analysis of Present-Day English, Providence: Brown University Press
Lee, K.-Y. (1994) “Hangul, the Korean Writing System, and Its Computational Treatment,” LDV-Forum 11.2:26–43
Leech, G., R. Garside, and E. Atwell (1983) “The Automatic Grammatical Tagging of the LOB Corpus,” ICAME Journal 7:13–33
Leech, G. (1995) “A Brief User’s Guide to the Grammatical Tagging of the British National Corpus,” http://www.natcorp.ox.ac.uk/docs/gramtag.html , September 7 (2013)
Lorenz, O., and G. Schüller. (1994) “Präsentation von LA-MORPH,” LDV-Forum 11.1:39–51, reprinted in R. Hausser (ed.), 1996
Mandelbrot, B., L. Apostel, and A. Morf (1957) “Linguistique Statistique Macroscopique,” in J. Piaget (ed.), Logique, Langage et Théorie de L’information, Paris: Bibliothèque scientifique internationale, Études d’épistémologie génétique
Marshall, I. (1983) “Choice of Grammatical Word-Class Without Global Syntactic Analysis: Tagging Words in the LOB Corpus,” Computers and the Humanities 17:139–150
Marshall, I. (1987) “Tag Selection Using Probabilistic Methods,” in Garside et al. (eds.)
Meier, H. (1964) Deutsche Sprachstatistik, Vol. 1, Hildesheim
Oostdijk, N. (1988) “A Corpus Linguistic Approach to Linguistic Variation,” in G. Dixon (ed.), Literary and Linguistic Computing, Vol. 3.1
Oostdijk, N., and P. de Haan (1994) Corpus-based Research into Language, Amsterdam-Atlanta: Editions Rodopi
Piotrovskij, R.G., K.B. Bektaev, and A.A. Piotrovskaja (1985) Mathematische Linguistik, translated from Russian, Bochum: Brockmeyer
Sharman, R. (1990) Hidden Markov Model Methods for Word Tagging, Report 214, IBM UK Scientific Centre, Winchester
Zipf, G.K. (1935) The Psycho-biology of Language, Oxford: Houghton, Mifflin
Zipf, G.K. (1949) Human Behavior and the Principle of Least Effort, Oxford: Addison-Wesley
Author information
Authors and Affiliations
Exercises
Exercises
Section 15.1
-
1.
What is the definition of a grammar system?
-
2.
What is the function of the formal algorithm in a grammar system?
-
3.
Why does a grammar system require more than its formal algorithm?
-
4.
Why must a grammar system be integrated into a theory of language?
-
5.
Explain the methodological reason why a grammar system must have an efficient implementation on the computer.
-
6.
Why is a modular separation of grammar system, implementation, and application necessary? Why do they have to be closely correlated?
-
7.
What differences exist between the various implementations of LA Morph, and what do they have in common?
Section 15.2
-
1.
Explain the linguistic motivation of a distinctive categorization, using examples.
-
2.
What are multicats and why do they necessitate an extension of the basic algorithm of LAG?
-
3.
Compare list-based and attribute-based matching in LA Morph.
-
4.
What motivates the development of subtheoretical variants?
-
5.
Why is the transition to a new subtheoretical variant labor-intensive?
-
6.
What is the difference between a subtheoretical variant and a derived formalism?
Section 15.3
-
1.
For which purpose did Kaeding investigate the frequency distribution of German?
-
2.
What is a representative, balanced corpus?
-
3.
List Kučera and Francis’ desiderata for constructing corpora.
-
4.
Explain the distinction between the types and the tokens of a corpus.
Section 15.4
-
1.
Describe the correlation of type and token frequency in the BNC.
-
2.
What is the percentage of hapax legomena in the BNC?
-
3.
In what sense are high frequency word forms of low significance and low frequency word forms of high significance?
-
4.
What is Zipf’s law?
Section 15.5
-
1.
What motivates the choice of a tagset for statistical corpus analysis?
-
2.
Why is it necessary for the statistical analysis of a corpus to tag a core corpus by hand?
-
3.
What is the error rate of the statistical BNC tagger CLAWS4? Does it refer to types or tokens? Is it high or low?
-
4.
Why does statistical tagging substantially increase the number of types in a corpus? Are these additional types real or spurious?
-
5.
Is statistical tagging a smart or a solid solution?
-
6.
What is the role of the preprocessor in the outcome of the statistical analysis of a corpus? Explain your answer using concrete examples.
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Hausser, R. (2014). Corpus Analysis. In: Foundations of Computational Linguistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41431-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-41431-2_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41430-5
Online ISBN: 978-3-642-41431-2
eBook Packages: Computer ScienceComputer Science (R0)