Identifying Content and Function Words in Non-Annotated Corpora

Steiner, Petra

doi:10.1007/978-3-322-81289-6_23

Petra Steiner²

Part of the book series: Sprachwissenschaft ((SPRAWI))

416 Accesses

Abstract

In every corpus of natural language texts there are some tendencies which occur due to common properties of language, as for example, the principle of least effort. One of those phenomema is a typical distribution of frequency classes: a relatively small number of word types covers the bulk of text, while on the other hand a huge part of the vocabulary occurs only one time. The latter types are called singletons or hapax legomena.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alexejew, P. M. 1973. Häufigkeitswörterbuch der englischen Subsprache der Elektronik. In: L. Hoffmann (ed.). Sprachstatistik. Sammlung Akademie-Verlag 22. Berlin: Akademie-Verlag. 192–205.
Google Scholar
Altmann, G. 1988. Wiederholungen in Texten. Quantitative Linguistics, vol. 36. Bochum: Brockmeyer.
Google Scholar
Altmann, G. 1992. Das Problem der Datenhomogenität. In: B. Rieger (ed.). Glottometrika 13. Quantitative Linguistics, vol. 49. Bochum: Brockmeyer. 287–298.
Google Scholar
Baayen, H. and E. Sproat. 1996. Estimating Lexical Priors for Low-Frequency Morphologically Ambiguous Forms. Computational Linguistics 22 (2): 155–166.
Google Scholar
Biber, D. 1988. Varidiion across Speech and Writing. Cambridge: Cambridge University Press.
Book Google Scholar
Biber, D. 1994a. An Analytical Framework for Register Studies. In: D. Biber and E. Finegan (eds.). Sociolinguistic Perspectives on Register. Oxford: Oxford University Press. 31–56.
Google Scholar
Biber, D. 1994b. Representativeness in Corpus Design. In: N. C. Antonio Zampolli and M. Palmer (eds.). Current Issues in Computational Linguistics: Essays in Honour of Don Walker. Pisa, Dordrecht: Giardini, Kluwer. 377–407.
Google Scholar
Chitashvili, R. J. and R. H. Baayen. 1993. Word Frequency Distributions. In: L. Hrebicek and G. Altmann (eds.). Quantitative Text Analysis. Quantitative linguistics, vol. 52. Trier: WVT Wissenschaftlicher Verlag Trier. 54–135.
Google Scholar
Choi, S.-W. 2000. Some Statistical Properties and Zipf’s Law in Korean Text Corpus. Journal of Quantitative Linguistics 7 (1): 19–30.
Article Google Scholar
Church, K. W. and W. A. Gale. 1995. Poisson Mixtures. Natural Language Engineering 1 (2): 163–190.
Article Google Scholar
Dermatos, E. and G. Kokkinas. 1995. Automatic Stochastic Tagging of Natural Language Texts. Computational Linguistics 21 (2): 137–163.
Google Scholar
Hammerl, R. 1990. Untersuchungen zur Verteilung der Wortarten im Text. In: L. Hrebiöek (ed.). Glottometrika 11. Quantitative Linguistics, vol. 42. Bochum: Brockmeyer. 142–156.
Google Scholar
Harris, Z. S. 1946/1970. From Morpheme to Utterance. Language 22(3): 161–183. [Reprint in: Harris, Z. S. Papers in Structural and Transformational Linguistics. Dordrecht-Holland: Reidel Publishing Company, 1970. (Formal Linguistics Series 1). 100–125. and Reprint in: Harris, Z. S. Papers on Syntax. Dordrecht-Holland: Reidel Publishing Company, 1981. (Synthese Language Library 14). 45–70.].
Google Scholar
Harris, Z. S. 1951. Structural Linguistics. Chicago, London: University of Chicago Press.
Google Scholar
Harris, Z. S. 1954/1970. Distributional Structure. Word Linguistics 10(2–3): 146–162. [Reprint in: Harris, Z. S. Papers in Structural and Transformational Linguistics. Dordrecht-Holland: Reidel Publishing Company, 1970. (Formal Linguistics Series 1). 775–794. and Reprint in: Harris, Z. S. Papers on Syntax. Dordrecht-Holland: Reidel Publishing Company, 1981. (Synthese Language Library 14). 3–22.].
Google Scholar
Hoffmann, L. and R. G. Piotrowski. 1979. Beiträge zur Sprachstatistik. Linguistische Studien. Leipzig: VEB Verlag Enzyklopädie Leipzig.
Google Scholar
Katz, S. M. 1996. Distribution of content words and phrases in text and language modelling. Natural Language Engineering 2 (1): 15–59.
Article Google Scholar
Kempgen, S. 1995. Russische Sprachstatistik. Systematischer Uberblick und Bibliographie. Vorträge und Abhandlungen zur Slavistik, vol. 26. München: Sagner.
Google Scholar
Manning, C. D. and H. Schütze. 2001. Foundations of Statistical Natural Language Processing. 4., improved ed. Cambridge, Massachusetts, London: MIT Press.
Google Scholar
Orlov, J. K. 1982. Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie “Sprache-Rede” in der statistischen Linguistik). In: M. G. B. Orlov, Jurij K. and I. S. Nadarejsvili (eds.). Sprache, Text, Kunst. Quantitative Analysen. Quantitative Linguistics, vol. 15. Bochum: Brockmeyer. 1–55.
Google Scholar
Rapp, R. 1996. Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz. Sprache und Computer, vol. 16. Hildesheim,Zürich,New York: Olms.
Google Scholar
Rapp, R. 2002. Computersimulation sprachlicher Intuition. In: L. Cyrus, H. Feddes, F. Schumacher and P. Steiner (eds.). Sprache: Zwischen Theorie und Technologie/Language: Between Theory and Technology. Wiesbaden: Deutscher Universitätsverlag.
Google Scholar
Schütze, H. 1993. Part-of-Speech Induction from Scratch. In: 31st Annual Meeting of the As-sociation for Computational Linguistics, 22–26 June 1993. Ohio State University, Columbus, Ohio, USA. Proceedings. Columbus, Ohio, USA. 251–258.
Chapter Google Scholar
Schütze, H. 1995. Distributional Part-of-Speech-Tagging. In: Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics. 27–31 March 1995, University College Dublin. Belfield, Dublin, Ireland. 141–148. URL http://xxx.lanl.gov/abs/cmp-lg/9503009
Google Scholar
Schütze, H. 1997. Ambiguity Resolution in Language Learning: Computational and Cognitive Models. CSLI Lecture Notes, vol. 71. Stanford: CSLI Publications.
Google Scholar
Steiner, P. 1995. Das Münster Tagging Projekt — Die Entwicklung und Evaluation der Münsteraner Tagsets. Sprache und Datenverarbeitung 2: 19–38.
Google Scholar
Steiner, P. 1996. Anforderungen und Probleme beim Taggen deutscher Zeitungstexte. In: H. Feldweg and E. W. Hinrichs (eds.). Lexikon und Text. Wiederverwendbare Methoden und Ressourcen zur linguistischen Erschließung des Deutschen. Lexicographica Series Maior, vol. 73. Tübingen: Niemeyer. 205–215.
Google Scholar
Steiner, P. 2002a. Das revidierte Münsteraner Tagset/Deutsch (MT/D). Beschreibung, Anwendung, Beispiele und Problemfälle. Technical report. Münster: Arbeitsbereich Linguistik, Universität Münster. URL http://santana.uni-muenster.de/Publications/tagbeschr_final.ps
Google Scholar
Steiner, P. 2002b. Wortarten und Korpus. Automatische Wortartenklassifikation durch distributioneile und quantitative Verfahren. Ph.D. thesis, Englische Philologie, Universität Münster.
Google Scholar
Xlex-Team. 2002. Xlex. URL http://xlex.uni-muenster.de/.
Google Scholar

Download references

Author information

Authors and Affiliations

Berkeley, USA
Petra Steiner

Authors

Petra Steiner
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Münster, Deutschland
Lea Cyrus , Hendrik Feddes , Frank Schumacher & Petra Steiner , , &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Steiner, P. (2003). Identifying Content and Function Words in Non-Annotated Corpora. In: Cyrus, L., Feddes, H., Schumacher, F., Steiner, P. (eds) Sprache zwischen Theorie und Technologie / Language between Theory and Technology. Sprachwissenschaft. Deutscher Universitätsverlag, Wiesbaden. https://doi.org/10.1007/978-3-322-81289-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-322-81289-6_23
Publisher Name: Deutscher Universitätsverlag, Wiesbaden
Print ISBN: 978-3-8244-4513-4
Online ISBN: 978-3-322-81289-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics