Skip to main content

Bootstrapping Techniques

  • Chapter
  • First Online:
A Practical Handbook of Corpus Linguistics

Abstract

Bootstrapping is a statistical technique that relies on randomly sampling with replacement from a set of observed values. Bootstrapping makes it possible to measure the accuracy and reliability of sample estimates and is often recommended for small samples and samples with unknown or non-normal distributions. In corpus linguistics, bootstrapping has also been proposed as a method for quantifying the degree of homogeneity in a corpus sample, for validation of statistical results, and as a methodological step in random decision forests, an advanced classification method. However, to date bootstrapping techniques have seldom been used with corpus data. We argue in this chapter that bootstrapping is underused in corpus linguistics, and that quantitative corpus linguists would do well to add this tool to their repertoire. This chapter includes an introduction to the fundamentals—both conceptual and practical—of bootstrapping methods. We address several applications of bootstrapping, including the measurement of sample estimate accuracy, the validation of statistical models, the estimation of corpus homogeneity, and random forests. We include an overview of two representative studies that have successfully used bootstrapping techniques with corpus data. Finally, we demonstrate how to perform bootstrapping on corpus data using R, and how to visualize and interpret the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    It should be noted that there are different ways of writing code in R to perform bootstrapping. Another good approach would be to create a loop that stores results from bootstrapped samples, rather than use a function.

  2. 2.

    Because bootstrapping is based on random re-samples, the output will be slightly different each time we repeat the procedure.

  3. 3.

    The bias-corrected and accelerated (BCa) method is preferred because it corrects for bias and skewness in the bootstrap distribution (see Davison and Hinkley 1997).

  4. 4.

    This can be compared with the original 95% CI of [−0.555, −0.289].

  5. 5.

    Compare with the original 95% CI of [0.508, 0.711].

References

  • Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Berez, A. L., & Gries, S. T. (2009). In defense of corpus-based methods: A behavioral profile analysis of polysemous get in English. In S. Moran, D. S. Tanner, & M. Scanlon (Eds.), Proceedings of the 24th Northwest linguistics conference. University of Washington working papers in linguistics (Vol. 27, pp. 157–166). Seattle, WA: Department of Linguistics.

    Google Scholar 

  • Bernaisch, T., Gries, S. T., & Mukherjee, J. (2014). The dative alternation in South Asian English (Eds.), Modelling predictors and predicting prototypes. English World-Wide, 35(1), 7–31.

    Article  Google Scholar 

  • Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Biber, D. (1993). Representativeness in corpus design. Literary and linguistic computing, 8(4), 243–257.

    Article  Google Scholar 

  • Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non) utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4), 439–464.

    Article  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

    Article  Google Scholar 

  • Brezina, V., & Gablasova, D. (2013). Is there a core general vocabulary? Introducing the new general service list. Applied Linguistics, 36(1), 1–22.

    Article  Google Scholar 

  • Burch, B., Egbert, J., & Biber, D. (2017). Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 3, 189–216.

    Article  Google Scholar 

  • Chernick, M. R. (1999). Bootstrap methods: A practitioner’s guide (Wiley series in probability and statistics). Hoboken, NJ: Wiley.

    Google Scholar 

  • Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Deshors, S. C., & Gries, S. T. (2016). Profiling verb complementation constructions across New Englishes. International Journal of Corpus Linguistics, 21(2), 192–218.

    Article  Google Scholar 

  • Efron, B. (1979). Computers and the theory of statistics: Thinking the unthinkable. SIAM Review, 21(4), 460–480.

    Article  Google Scholar 

  • Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185.

    Article  Google Scholar 

  • Efron, B. (1992). Bootstrap methods: Another look at the jackknife. In Breakthroughs in statistics (pp. 569–593). New York: Springer.

    Chapter  Google Scholar 

  • Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician, 37(1), 36–48.

    Google Scholar 

  • Efron, B., & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1, 54–75.

    Google Scholar 

  • Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap: Monographs on statistics and applied probability (Vol. 57). New York/London: Chapman and Hall/CRC.

    Book  Google Scholar 

  • Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The 632+ bootstrap method. Journal of the American Statistical Association, 92(438), 548–560.

    Google Scholar 

  • Egbert, J. (2015). Publication type and discipline variation in published academic writing: Investigating statistical interaction in corpus data. International Journal of Corpus Linguistics, 20(1), 1–29.

    Article  Google Scholar 

  • Egbert, J., & LaFlair, G. T. (2018). Statistics for categorical and distribution-free data. In A. Phakiti, P. I. De Costa, L. Plonsky, & S. Starfield (Eds.), The Palgrave handbook of applied linguistics research methodology. New York: Palgrave.

    Google Scholar 

  • Egbert, J., Biber, D., & Gray, B. (forthcoming). Towards representativeness in corpus design. Cambridge: Cambridge University Press.

    Google Scholar 

  • Gardner, D., & Davies, M. (2013). A new academic vocabulary list. Applied Linguistics, 35(3), 305–327.

    Article  Google Scholar 

  • Gries, S. T. (2006). Exploring variability within and between corpora: Some methodological considerations. Corpora, 1(2), 109–151.

    Article  Google Scholar 

  • Gries, S. T. (2010). Behavioral profiles A fine-grained and quantitative approach. The Mental Lexicon, 5(3), 323–346.

    Article  Google Scholar 

  • Gries, S. T. (2013). Statistics for linguistics with R: A practical introduction (2nd rev. ed.). Berlin: De Gruyter Mouton.

    Book  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning (2nd ed.). New York: Springer.

    Google Scholar 

  • Heller, B., Szmrecsanyi, B., & Grafmiller, J. (2017). Stability and fluidity in syntactic variation world-wide: The genitive alternation across varieties of English. Journal of English Linguistics, 45(1), 3–27.

    Article  Google Scholar 

  • Hinneburg, A., Mannila, H., Kaislaniemi, S., Nevalainen, T., & Raumolin-Brunberg, H. (2007). How to handle small samples: Bootstrap and Bayesian methods in the analysis of linguistic change. Literary and linguistic computing, 22(2), 137–150.

    Article  Google Scholar 

  • Ho, T. K. (2002). A data complexity analysis of comparative advantages of decision Forest constructors. Pattern Analysis and Applications, 5, 102–112.

    Article  Google Scholar 

  • LaFlair, G. T., Egbert, J., & Plonsky, L. (2015). A practical guide to bootstrapping descriptive statistics, correlations, t tests, and ANOVAs. In Advancing quantitative methods in second language research (Vol. 46). New York: Routledge.

    Google Scholar 

  • Lijffijt, J., Papapetrou, P., Puolamäki, K., & Mannila, H. (2011). Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. Machine Learning and Knowledge Discovery in Databases, 341–357.

    Google Scholar 

  • Lijffijt, J., Säily, T., & Nevalainen, T. (2012). CEECing the baseline: Lexical stability and significant change in a historical corpus. In Studies in variation, contacts and change in English (Vol. 10). Research unit for variation, contacts and change in English (VARIENG).

    Google Scholar 

  • Lijffijt, J., Nevalainen, T., Säily, T., Papapetrou, P., Puolamäki, K., & Mannila, H. (2016). Significance testing of word frequencies in corpora. Literary and Linguistic Computing, 31(2), 374–397.

    Article  Google Scholar 

  • Mannila, H., Nevalainen, T., & Raumolin-Brunberg, H. (2013). Quantifying variation and estimating the effects of sample size on the frequencies of linguistic variables. In M. Krug & J. Schlüter (Eds.), Research methods in language variation and change (pp. 337–360). Cambridge: Cambridge University Press.

    Chapter  Google Scholar 

  • Nation, I. S. P. (2016). Making and using word lists for language learning and testing. Philadelphia, PA: John Benjamins Publishing.

    Book  Google Scholar 

  • Plonsky, L., Egbert, J., & LaFlair, G. (2015). Bootstrapping in applied linguistics: Assessing its potential using shared data. Applied Linguistics, 36(5), 591–610.

    Google Scholar 

  • Säily, T. (2014). Sociolinguistic variation in English derivational productivity: Studies and methods in diachronic corpus linguistics. Mémoires de la Société Néophilologique de Helsinki.

    Google Scholar 

  • Szmrecsanyi, B., Biber, D., Egbert, J., & Franco, K. (2016a). Towards more accountability: Modeling ternary genitive variation in late modern English. Language Variation and Change, 28(1), 1–29.

    Article  Google Scholar 

  • Szmrecsanyi, B., Grafmiller, J., Heller, B., & Röthlisberger, M. (2016b). Around the world in three alternations. English World-Wide, 37(2), 109–137.

    Article  Google Scholar 

  • Wagner, S. E., Hesson, A., Bybel, K., & Little, H. (2015). Quantifying the referential function of general extenders in north American English. Language in Society, 44(5), 705–731.

    Article  Google Scholar 

  • Wolk, C., Bresnan, J., Rosenbach, A., & Szmrecsanyi, B. (2013). Dative and genitive variability in Late Modern English: Exploring cross-constructional variation and change. Diachronica, 30(3), 382–419.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jesse Egbert .

Editor information

Editors and Affiliations

1 Electronic Supplementary Materials

Further Reading

Further Reading

  • LaFlair, G.T., Egbert, J., and Plonsky, L. 2015. A practical guide to bootstrapping descriptive statistics, correlations, t tests, and ANOVAs. Advancing Quantitative Methods in Second Language Research , 46, New York: Routledge.

This chapter motivates the use of bootstrapping for analyzing data in second language research. It covers practical considerations to account for when using bootstrapping and provides step-by-step demonstrations of how to analyze data to calculate bootstrap means, standard deviations, correlation coefficients, t-statistics, and F-statistics.

  • Egbert, J., and LaFlair, G.T. 2018. Statistics for categorical and distribution free data. In Handbook of Applied Linguistics Research Methodology , eds. De Costa, P., Phakiti, A., Starfield, S., and Plonsky, L. London: Palgrave Macmillan.

This chapter focuses generally on statistical techniques for analyzing non-traditional and distribution-free data, including a section on the use of the non-parametric bootstrap to analyze such data and the advantages of using bootstrapping over alternatives such as permutation tests.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Egbert, J., Plonsky, L. (2020). Bootstrapping Techniques. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_24

Download citation

Publish with us

Policies and ethics