Skip to main content

Part of the book series: Sprachwissenschaft ((SPRAWI))

  • 416 Accesses

Abstract

In every corpus of natural language texts there are some tendencies which occur due to common properties of language, as for example, the principle of least effort. One of those phenomema is a typical distribution of frequency classes: a relatively small number of word types covers the bulk of text, while on the other hand a huge part of the vocabulary occurs only one time. The latter types are called singletons or hapax legomena.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Alexejew, P. M. 1973. Häufigkeitswörterbuch der englischen Subsprache der Elektronik. In: L. Hoffmann (ed.). Sprachstatistik. Sammlung Akademie-Verlag 22. Berlin: Akademie-Verlag. 192–205.

    Google Scholar 

  • Altmann, G. 1988. Wiederholungen in Texten. Quantitative Linguistics, vol. 36. Bochum: Brockmeyer.

    Google Scholar 

  • Altmann, G. 1992. Das Problem der Datenhomogenität. In: B. Rieger (ed.). Glottometrika 13. Quantitative Linguistics, vol. 49. Bochum: Brockmeyer. 287–298.

    Google Scholar 

  • Baayen, H. and E. Sproat. 1996. Estimating Lexical Priors for Low-Frequency Morphologically Ambiguous Forms. Computational Linguistics 22 (2): 155–166.

    Google Scholar 

  • Biber, D. 1988. Varidiion across Speech and Writing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Biber, D. 1994a. An Analytical Framework for Register Studies. In: D. Biber and E. Finegan (eds.). Sociolinguistic Perspectives on Register. Oxford: Oxford University Press. 31–56.

    Google Scholar 

  • Biber, D. 1994b. Representativeness in Corpus Design. In: N. C. Antonio Zampolli and M. Palmer (eds.). Current Issues in Computational Linguistics: Essays in Honour of Don Walker. Pisa, Dordrecht: Giardini, Kluwer. 377–407.

    Google Scholar 

  • Chitashvili, R. J. and R. H. Baayen. 1993. Word Frequency Distributions. In: L. Hrebicek and G. Altmann (eds.). Quantitative Text Analysis. Quantitative linguistics, vol. 52. Trier: WVT Wissenschaftlicher Verlag Trier. 54–135.

    Google Scholar 

  • Choi, S.-W. 2000. Some Statistical Properties and Zipf’s Law in Korean Text Corpus. Journal of Quantitative Linguistics 7 (1): 19–30.

    Article  Google Scholar 

  • Church, K. W. and W. A. Gale. 1995. Poisson Mixtures. Natural Language Engineering 1 (2): 163–190.

    Article  Google Scholar 

  • Dermatos, E. and G. Kokkinas. 1995. Automatic Stochastic Tagging of Natural Language Texts. Computational Linguistics 21 (2): 137–163.

    Google Scholar 

  • Hammerl, R. 1990. Untersuchungen zur Verteilung der Wortarten im Text. In: L. Hrebiöek (ed.). Glottometrika 11. Quantitative Linguistics, vol. 42. Bochum: Brockmeyer. 142–156.

    Google Scholar 

  • Harris, Z. S. 1946/1970. From Morpheme to Utterance. Language 22(3): 161–183. [Reprint in: Harris, Z. S. Papers in Structural and Transformational Linguistics. Dordrecht-Holland: Reidel Publishing Company, 1970. (Formal Linguistics Series 1). 100–125. and Reprint in: Harris, Z. S. Papers on Syntax. Dordrecht-Holland: Reidel Publishing Company, 1981. (Synthese Language Library 14). 45–70.].

    Google Scholar 

  • Harris, Z. S. 1951. Structural Linguistics. Chicago, London: University of Chicago Press.

    Google Scholar 

  • Harris, Z. S. 1954/1970. Distributional Structure. Word Linguistics 10(2–3): 146–162. [Reprint in: Harris, Z. S. Papers in Structural and Transformational Linguistics. Dordrecht-Holland: Reidel Publishing Company, 1970. (Formal Linguistics Series 1). 775–794. and Reprint in: Harris, Z. S. Papers on Syntax. Dordrecht-Holland: Reidel Publishing Company, 1981. (Synthese Language Library 14). 3–22.].

    Google Scholar 

  • Hoffmann, L. and R. G. Piotrowski. 1979. Beiträge zur Sprachstatistik. Linguistische Studien. Leipzig: VEB Verlag Enzyklopädie Leipzig.

    Google Scholar 

  • Katz, S. M. 1996. Distribution of content words and phrases in text and language modelling. Natural Language Engineering 2 (1): 15–59.

    Article  Google Scholar 

  • Kempgen, S. 1995. Russische Sprachstatistik. Systematischer Uberblick und Bibliographie. Vorträge und Abhandlungen zur Slavistik, vol. 26. München: Sagner.

    Google Scholar 

  • Manning, C. D. and H. Schütze. 2001. Foundations of Statistical Natural Language Processing. 4., improved ed. Cambridge, Massachusetts, London: MIT Press.

    Google Scholar 

  • Orlov, J. K. 1982. Linguostatistik: Aufstellung von Sprachnormen oder Analyse des Redeprozesses? (Die Antinomie “Sprache-Rede” in der statistischen Linguistik). In: M. G. B. Orlov, Jurij K. and I. S. Nadarejsvili (eds.). Sprache, Text, Kunst. Quantitative Analysen. Quantitative Linguistics, vol. 15. Bochum: Brockmeyer. 1–55.

    Google Scholar 

  • Rapp, R. 1996. Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz. Sprache und Computer, vol. 16. Hildesheim,Zürich,New York: Olms.

    Google Scholar 

  • Rapp, R. 2002. Computersimulation sprachlicher Intuition. In: L. Cyrus, H. Feddes, F. Schumacher and P. Steiner (eds.). Sprache: Zwischen Theorie und Technologie/Language: Between Theory and Technology. Wiesbaden: Deutscher Universitätsverlag.

    Google Scholar 

  • Schütze, H. 1993. Part-of-Speech Induction from Scratch. In: 31st Annual Meeting of the As-sociation for Computational Linguistics, 22–26 June 1993. Ohio State University, Columbus, Ohio, USA. Proceedings. Columbus, Ohio, USA. 251–258.

    Chapter  Google Scholar 

  • Schütze, H. 1995. Distributional Part-of-Speech-Tagging. In: Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics. 27–31 March 1995, University College Dublin. Belfield, Dublin, Ireland. 141–148. URL http://xxx.lanl.gov/abs/cmp-lg/9503009

    Google Scholar 

  • Schütze, H. 1997. Ambiguity Resolution in Language Learning: Computational and Cognitive Models. CSLI Lecture Notes, vol. 71. Stanford: CSLI Publications.

    Google Scholar 

  • Steiner, P. 1995. Das Münster Tagging Projekt — Die Entwicklung und Evaluation der Münsteraner Tagsets. Sprache und Datenverarbeitung 2: 19–38.

    Google Scholar 

  • Steiner, P. 1996. Anforderungen und Probleme beim Taggen deutscher Zeitungstexte. In: H. Feldweg and E. W. Hinrichs (eds.). Lexikon und Text. Wiederverwendbare Methoden und Ressourcen zur linguistischen Erschließung des Deutschen. Lexicographica Series Maior, vol. 73. Tübingen: Niemeyer. 205–215.

    Google Scholar 

  • Steiner, P. 2002a. Das revidierte Münsteraner Tagset/Deutsch (MT/D). Beschreibung, Anwendung, Beispiele und Problemfälle. Technical report. Münster: Arbeitsbereich Linguistik, Universität Münster. URL http://santana.uni-muenster.de/Publications/tagbeschr_final.ps

    Google Scholar 

  • Steiner, P. 2002b. Wortarten und Korpus. Automatische Wortartenklassifikation durch distributioneile und quantitative Verfahren. Ph.D. thesis, Englische Philologie, Universität Münster.

    Google Scholar 

  • Xlex-Team. 2002. Xlex. URL http://xlex.uni-muenster.de/.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Deutscher Universitäts-Verlag GmbH, Wiesbaden,

About this chapter

Cite this chapter

Steiner, P. (2003). Identifying Content and Function Words in Non-Annotated Corpora. In: Cyrus, L., Feddes, H., Schumacher, F., Steiner, P. (eds) Sprache zwischen Theorie und Technologie / Language between Theory and Technology. Sprachwissenschaft. Deutscher Universitätsverlag, Wiesbaden. https://doi.org/10.1007/978-3-322-81289-6_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-322-81289-6_23

  • Publisher Name: Deutscher Universitätsverlag, Wiesbaden

  • Print ISBN: 978-3-8244-4513-4

  • Online ISBN: 978-3-322-81289-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics