Corpus Analysis

Hausser, Roland

doi:10.1007/978-3-642-41431-2_15

Roland Hausser²

2534 Accesses

Abstract

This chapter first analyzes the general relation between linguistic analysis and computational method. As a familiar example, automatic word form recognition is used. The example exhibits a number of properties which are methodologically characteristic of all components of grammar. We then show methods for investigating the frequency distribution of words in natural language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
These considerations should be seen in light of the fact that it is tempting – and therefore common – to quickly write a little hack for a given task. A hack is an ad hoc piece of programming code which works, but has no clear algebraic or methodological basis. Hacks are a negative example of a smart solution (Sect. 2.3).
In computational linguistics, as in other fields, little hacks quickly grow into giant hacks. They may do the job in some narrowly defined application, but their value for linguistic or computational theory is limited at best. Furthermore, they are extremely costly in the long run because even their own authors encounter ever increasing problems in debugging and maintenance, not to mention upscaling within a given language or applications to new languages.
2.
To be ensured by the theory of language.
3.
To be ensured by the principle of type transparency, Sect. 9.3.
4.
Especially valuable for improving the generality of the system was the handling of the Korean writing system Hangul. The various options for coding Hangul are described in Lee (1994) .
5.
The use of Lisp in the 1988 version allowed for a quick implementation. It employed the same motor for word form recognition and syntax, and was tested on sizable fragments of English and German morphology and syntax. It had the disadvantage, however, that the rules of respective grammars were defined as Lisp functions, for which reason the system lacked an abstract, declarative formulation.
The 1990 C version was the first to provide a declarative specification of the allo- and the combi rules based on regular expressions (RegExp) in a tabular ASCII format, interpreted automatically and compiled in C.
The 1992 reimplementation aimed at improvements in the pattern matching and the trie structure. It did not get wider use, however, because it did not allow for the necessary interaction between the levels of surface and category, and had omitted an interface to syntax.
The 1994 reimplementation repaired the deficiencies of the 1992 version and was tested on sizable amounts of data in German, French, and Korean. These experiences resulted in many formal and computational innovations.
The 1995 implementation of the Malaga system took a new approach to the handling of pattern matching, using attribute-value structures. Malaga provides a uniform framework for simultaneous morphological, syntactic, and semantic parsing and generation.
6.
This is commented on by Jespersen (1921), pp. 341–346.
7.
The size of Kaeding’s (1897/1898) corpus was thus more than ten times larger and the number of types twice as large as that of the computer-based 1973 Limas-Korpus (see below).
8.
Named after Brown University in Rhode Island, where Francis was teaching at the time.
9.
The Lancaster-Oslo/Bergen corpus was compiled under the direction of Leech and Johannson. Cf. Hofland and Johannson (1980).
10.
See Hess et al. (1983).
11.
Cf. Bergenholtz (1989), Biber (1993), Oostdijk and de Haan (1994).
12.
The difficulties of constructing balanced and representative corpora are avoided in ‘opportunistic corpora,’ i.e., text collections which contain whatever is easily available. The idea is that the users themselves construct their own corpora by choosing from the opportunistic corpus specific amounts of specific kinds of texts needed for the purpose at hand. This requires, however, that the texts in the opportunistic corpus are preprocessed into a uniform format and classified with respect to their domain, origin, and various other parameters important to different users.
For example, a user must be able to automatically select a certain year of a newspaper, to exclude certain sections such as private ads, to select specific sections such as the sports page, the editorial, etc. To provide this kind of functionality in a fast growing opportunistic corpus of various different kinds of texts is labor-intensive – and therefore largely left undone.
With most of the corpus work left to the users, the resulting ‘private corpora’ are usually quite small. Also, if the users are content with whatever the opportunistic corpus makes available, the lexically more interesting domains like medicine, physics, law, etc., will be left unanalyzed.
13.
The following type numbers refer to the surfaces of the word forms in the BNC. The numbers published by the authors of the BNC, in contrast, refer to tagged word forms (Sect. 15.5). According to the latter method of counting, the BNC contains 921 073 types.
The strictly surface-based ranking underlying the table 15.4.2 was determined by Marco Zierl at the CL lab of Friedrich Alexander University Erlangen Nürnberg (CLUE).
14.
A language should be continuously documented in corpus linguistics. This may be done by building a reference corpus for a certain year. Then each following year, a monitor corpus is built, showing the changes in the language. The reference corpus and the monitor corpora must be based on the same structure. They should consist of a central part for mainstream, average language, as in daily newspapers and other genres, complemented by domain-specific parts for medicine, law, physics, etc., which may be compiled from associated journals.
15.
From Greek, ‘said once.’
16.
Also known as the Estoup-Zipf-Mandelbrot Law. The earliest observation of the phenomenon was made by the Frenchman Estoup (1916), who – like Kaeding – worked on improving stenography. After doing a statistical analysis of human speech at Bell Labs, Condon (1928) also noted a constant relation between rank and frequency. Mandelbrot et al. (1957), famous for his work on fractals, worked on mathematical refinements of Zipf’s Law. Cf. Piotrovskij et al. (1985).
17.
Criticized, among others, by Joos (1936) on empirical grounds. See also the reply in Zipf (1949).
18.
For simplicity, this example is based on the BNC data. Strictly speaking, however, Zipf’s law applies only to individual texts and not to corpora.
19.
The plot of log(frequency) (y axis) vs. log(rank) (x axis) approximates a straight line of slope −1.
20.
The consequences of the tagset choice on the results of the corpus analysis are mentioned in Greenbaum and Yibin (1994), p. 34.
21.
The use of HMMs for the grammatical tagging of corpora is described in, e.g., Leech et al. (1983), Marshall (1983), DeRose (1988), Sharman (1990), Brown et al. (1991, 1992). See also Church and Mercer (1993).
22.
Unfortunately, neither Leech (1995) nor Burnard (ed.) (1995) specify what exactly constitutes an error in tagging the BNC. A new project to improve the tagger was started in June 1995, however. Called ‘The British National Corpus Tag Enhancement Project’, its results were originally scheduled to be made available in September 1996. Instead, it was apparently superseded by the massive manual postediting effort published as the BNC XML edition (Burnard (ed.) 2007).
23.
It is for this reason that a surface may be classified in several different ways, depending on its various environments in the corpus.

References

Bergenholtz, H. (1989) “Korpusproblematik in der Computerlinguistik. Konstruktionsprinzipien und Repräsentativität,” in H. Steger (ed.)
Google Scholar
Beutel, B. (1997) Malaga 4.0, CLUE-Manual, CL Lab, Friedrich Alexander Universität Erlangen Nürnberg
Google Scholar
Biber, D. (1993) “Representativeness in Corpus Design,” Literary and Linguistic Computing 8.4
Google Scholar
Bröker, N. (1991) Optimierung von LA-Morph im Rahmen einer Anwendung auf das Deutsche, CLUE-betreute Diplomarbeit der Informatik, Friedrich Alexander Universität Erlangen Nürnberg
Google Scholar
Brown, P., S. Della Pietra, V. Della Pietra, and R. Mercer (1991) “Word Sense Disambiguation Using Statistical Methods,” Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, 1991, 264–270
Chapter Google Scholar
Brown, P., V. Della Pietra, et al. (1992) “Class-based n-gram Models of Natural Language,” Paper read at the Pisa Conference on European Corpus Resources
Google Scholar
Burnard, L. (ed.) (1995) Users Reference Guide British National Corpus Version 1.0, Oxford: Oxford University Computing Services
Google Scholar
Burnard, L. (ed.) (2007) Reference Guide for the British National Corpus (XML Edition), Oxford: Oxford University Computing Services
Google Scholar
Church, K., and R.L. Mercer. (1993) “Introduction to the Special Issue on Computational Linguistics Using Large Corpora,” Computational Linguistics 19.1:1–24
Google Scholar
Condon, E.U. (1928) “Statistics of Vocabulary,” Science 68.1733:300
Article Google Scholar
DeRose, S. (1988) “Grammatical Category Disambiguation by Statistical Optimization,” Computational Linguistics 14.1:31–39
Google Scholar
Estoup, J.B. (1916) Gammes Sténographiques, 4th edn. Paris: Institut Sténographique
Google Scholar
Francis, W.N. (1980) “A Tagged Corpus: Problems and Prospects,” in S. Greenbaum, G. Leech, and J. Svartvik (eds.), 192–209
Google Scholar
Garside, R., G. Leech, and G. Sampson (1987) The Computational Analysis of English, London: Longman
Google Scholar
Greenbaum, S., and N. Yibin (1994) “Tagging the British ICE Corpus: English Word Classes,” in N. Oostdijk and P. de Haan (eds.)
Google Scholar
Hess, K., J. Brustkern, and W. Lenders (1983) Maschinenlesbare deutsche Wörterbücher, Tübingen: Max Niemeyer Verlag
Google Scholar
Hofland, K., and S. Johansson (1980) Word Frequencies in British and American English, London: Longman
Google Scholar
Jespersen, O. (1921) Language. Its Nature, Development, and Origin, reprinted in the Norton Library 1964, New York: W.W. Norton and Company
Google Scholar
Joos, M. (1936) “Review of G.K. Zipf: The Psycho-biology of Language,” Language 12:3
Article Google Scholar
Kaeding, W. (1897/1898) Häufigkeitswörterbuch der deutschen Sprache, Steglitz
Google Scholar
Kučera, H., and W.N. Francis (1967) Computational Analysis of Present-Day English, Providence: Brown University Press
Google Scholar
Lee, K.-Y. (1994) “Hangul, the Korean Writing System, and Its Computational Treatment,” LDV-Forum 11.2:26–43
Google Scholar
Leech, G., R. Garside, and E. Atwell (1983) “The Automatic Grammatical Tagging of the LOB Corpus,” ICAME Journal 7:13–33
Google Scholar
Leech, G. (1995) “A Brief User’s Guide to the Grammatical Tagging of the British National Corpus,” http://www.natcorp.ox.ac.uk/docs/gramtag.html , September 7 (2013)
Lorenz, O., and G. Schüller. (1994) “Präsentation von LA-MORPH,” LDV-Forum 11.1:39–51, reprinted in R. Hausser (ed.), 1996
Google Scholar
Mandelbrot, B., L. Apostel, and A. Morf (1957) “Linguistique Statistique Macroscopique,” in J. Piaget (ed.), Logique, Langage et Théorie de L’information, Paris: Bibliothèque scientifique internationale, Études d’épistémologie génétique
Google Scholar
Marshall, I. (1983) “Choice of Grammatical Word-Class Without Global Syntactic Analysis: Tagging Words in the LOB Corpus,” Computers and the Humanities 17:139–150
Article Google Scholar
Marshall, I. (1987) “Tag Selection Using Probabilistic Methods,” in Garside et al. (eds.)
Google Scholar
Meier, H. (1964) Deutsche Sprachstatistik, Vol. 1, Hildesheim
Google Scholar
Oostdijk, N. (1988) “A Corpus Linguistic Approach to Linguistic Variation,” in G. Dixon (ed.), Literary and Linguistic Computing, Vol. 3.1
Google Scholar
Oostdijk, N., and P. de Haan (1994) Corpus-based Research into Language, Amsterdam-Atlanta: Editions Rodopi
Google Scholar
Piotrovskij, R.G., K.B. Bektaev, and A.A. Piotrovskaja (1985) Mathematische Linguistik, translated from Russian, Bochum: Brockmeyer
Google Scholar
Sharman, R. (1990) Hidden Markov Model Methods for Word Tagging, Report 214, IBM UK Scientific Centre, Winchester
Google Scholar
Zipf, G.K. (1935) The Psycho-biology of Language, Oxford: Houghton, Mifflin
Google Scholar
Zipf, G.K. (1949) Human Behavior and the Principle of Least Effort, Oxford: Addison-Wesley
Google Scholar

Download references

Author information

Authors and Affiliations

Abteilung für Computerlinguistik, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
Roland Hausser

Authors

Roland Hausser
View author publications
You can also search for this author in PubMed Google Scholar

Exercises

Section 15.1

1.
What is the definition of a grammar system?
2.
What is the function of the formal algorithm in a grammar system?
3.
Why does a grammar system require more than its formal algorithm?
4.
Why must a grammar system be integrated into a theory of language?
5.
Explain the methodological reason why a grammar system must have an efficient implementation on the computer.
6.
Why is a modular separation of grammar system, implementation, and application necessary? Why do they have to be closely correlated?
7.
What differences exist between the various implementations of LA Morph, and what do they have in common?

Section 15.2

1.
Explain the linguistic motivation of a distinctive categorization, using examples.
2.
What are multicats and why do they necessitate an extension of the basic algorithm of LAG?
3.
Compare list-based and attribute-based matching in LA Morph.
4.
What motivates the development of subtheoretical variants?
5.
Why is the transition to a new subtheoretical variant labor-intensive?
6.
What is the difference between a subtheoretical variant and a derived formalism?

Section 15.3

1.
For which purpose did Kaeding investigate the frequency distribution of German?
2.
What is a representative, balanced corpus?
3.
List Kučera and Francis’ desiderata for constructing corpora.
4.
Explain the distinction between the types and the tokens of a corpus.

Section 15.4

1.
Describe the correlation of type and token frequency in the BNC.
2.
What is the percentage of hapax legomena in the BNC?
3.
In what sense are high frequency word forms of low significance and low frequency word forms of high significance?
4.
What is Zipf’s law?

Section 15.5

1.
What motivates the choice of a tagset for statistical corpus analysis?
2.
Why is it necessary for the statistical analysis of a corpus to tag a core corpus by hand?
3.
What is the error rate of the statistical BNC tagger CLAWS4? Does it refer to types or tokens? Is it high or low?
4.
Why does statistical tagging substantially increase the number of types in a corpus? Are these additional types real or spurious?
5.
Is statistical tagging a smart or a solid solution?
6.
What is the role of the preprocessor in the outcome of the statistical analysis of a corpus? Explain your answer using concrete examples.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hausser, R. (2014). Corpus Analysis. In: Foundations of Computational Linguistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41431-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-41431-2_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41430-5
Online ISBN: 978-3-642-41431-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Corpus Analysis

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Exercises

Exercises

Section 15.1

Section 15.2

Section 15.3

Section 15.4

Section 15.5

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation