Skip to main content

Part of the book series: Linguistica Computazionale ((LICO,volume 9))

Abstract

The present paper addresses a number of issues related to achieving ‘representativeness’ in linguistic corpus design, including: discussion of what it means to `represent’ a language, definition of the target population, stratified versus proportional sampling of a language, sampling within texts, and issues relating to the required sample size (number of texts) of a corpus. The paper distinguishes among various ways that linguistic features can be distributed within and across texts; it analyzes the distributions of several particular features, and it discusses the implications of these distributions for corpus design.

The paper argues that theoretical research should be prior in corpus design, to identify the situational parameters that distinguish among texts in a speech community, and to identify the types of linguistic features that will be analyzed in the corpus. These theoretical considerations should be complemented by empirical investigations of linguistic variation in a pilot corpus of texts, as a basis for specific sampling decisions. The actual construction of a corpus would then proceed in cycles: the original design based on theoretical and pilot-study analyses, followed by collection of texts, followed by further empirical investigations of linguistic variation and revision of the design.

I would like to thank Edward Finegan for his many helpful comments on an earlier draft of this paper. A modified version of this paper was distributed for the Pisa Workshop on Textual Corpora, held at the University of Pisa (January 1992), and discussions with several of the workshop participants were also helpful in revising the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Biber, Douglas. 1986. Spoken and written textual dimensions in English: resolving the contradictory findings. Language 62. 384–414.

    Article  Google Scholar 

  2. Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  3. Biber, Douglas. 1989. A typology of English texts. Linguistics, 27. 3–43.

    Article  Google Scholar 

  4. Biber, Douglas. 1990. Methodological issues regarding corpus-based analyses of linguistic variation. Literary and Linguistic Computing, 5.

    Google Scholar 

  5. Biber, Douglas. 1992. On the complexity of discourse complexity: A multidimensional analysis. Discourse Processes, 15. 133–163.

    Article  Google Scholar 

  6. Biber, Douglas. 1993a. An analytical framework for register studies, Sociolinguistic perspectives on register ed. by D. Biber, and E. Finegan, New York: Oxford University Press. (in press).

    Google Scholar 

  7. Biber, Douglas. 1993b. Register variation and corpus design. To appear in Computational Linguistics.

    Google Scholar 

  8. Biber, Douglas, and Mohamed Hared. 1992. Dimensions of register variation in Somali. Language Variation and Change 4. 41–75.

    Article  Google Scholar 

  9. Brown, Penelope, and Colin Fraser. 1979. Speech as a marker of situation. Social markers in speech, ed. by Klaus R. Scherer and Howard Giles, 33–62. Cambridge: Cambridge University Press.

    Google Scholar 

  10. Duranti, Alessandro. 1985. Sociocultural dimensions of discourse. Handbook of discourse analysis (Vol. 1), ed. by Teun van Dijk, 193–230. New York: Academic Press.

    Google Scholar 

  11. Francis, W. Nelson, and Henry Ku6era. 1964/1979. Manual of information to accompany A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. Department of Linguistics, Brown University.

    Google Scholar 

  12. Halliday, Michael A.K., and Ruqaiya Hasan. 1989. Language, context, and text: Aspects of language in a social-semiotic perspective. Oxford: Oxford University Press.

    Google Scholar 

  13. Henry, Gary T. 1990. Practical sampling. Newbury Park, CA: Sage.

    Google Scholar 

  14. Hymes, Dell H. 1974. Foundations in sociolinguistics. Philadelphia: University of Pennsylvania Press.

    Google Scholar 

  15. Johansson, Stig, Geoffrey N. Leech, and Helen Goodluck. 1978. Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Department of English, University of Oslo.

    Google Scholar 

  16. Kalton, Graham. 1983. Introduction to survey sampling. Newbury Park, CA: Sage.

    Google Scholar 

  17. Sudman, Seymour. 1976. Applied sampling. New York: Academic Press.

    Google Scholar 

  18. Svartvik, Jan, and Randolph Quirk (eds.). 1980. A corpus of English conversation. Lund: C.W.K. Gleerup.

    Google Scholar 

  19. Williams, Bill. 1978. A sampler on sampling. New York: John Wiley and Sons.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Antonio Zampolli Nicoletta Calzolari Martha Palmer

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Biber, D. (1994). Representativeness in Corpus Design. In: Zampolli, A., Calzolari, N., Palmer, M. (eds) Current Issues in Computational Linguistics: In Honour of Don Walker. Linguistica Computazionale, vol 9. Springer, Dordrecht. https://doi.org/10.1007/978-0-585-35958-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-0-585-35958-8_20

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-0-7923-2998-5

  • Online ISBN: 978-0-585-35958-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics