Abstract
Bootstrapping is a statistical technique that relies on randomly sampling with replacement from a set of observed values. Bootstrapping makes it possible to measure the accuracy and reliability of sample estimates and is often recommended for small samples and samples with unknown or non-normal distributions. In corpus linguistics, bootstrapping has also been proposed as a method for quantifying the degree of homogeneity in a corpus sample, for validation of statistical results, and as a methodological step in random decision forests, an advanced classification method. However, to date bootstrapping techniques have seldom been used with corpus data. We argue in this chapter that bootstrapping is underused in corpus linguistics, and that quantitative corpus linguists would do well to add this tool to their repertoire. This chapter includes an introduction to the fundamentals—both conceptual and practical—of bootstrapping methods. We address several applications of bootstrapping, including the measurement of sample estimate accuracy, the validation of statistical models, the estimation of corpus homogeneity, and random forests. We include an overview of two representative studies that have successfully used bootstrapping techniques with corpus data. Finally, we demonstrate how to perform bootstrapping on corpus data using R, and how to visualize and interpret the results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
It should be noted that there are different ways of writing code in R to perform bootstrapping. Another good approach would be to create a loop that stores results from bootstrapped samples, rather than use a function.
- 2.
Because bootstrapping is based on random re-samples, the output will be slightly different each time we repeat the procedure.
- 3.
The bias-corrected and accelerated (BCa) method is preferred because it corrects for bias and skewness in the bootstrap distribution (see Davison and Hinkley 1997).
- 4.
This can be compared with the original 95% CI of [−0.555, −0.289].
- 5.
Compare with the original 95% CI of [0.508, 0.711].
References
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.
Berez, A. L., & Gries, S. T. (2009). In defense of corpus-based methods: A behavioral profile analysis of polysemous get in English. In S. Moran, D. S. Tanner, & M. Scanlon (Eds.), Proceedings of the 24th Northwest linguistics conference. University of Washington working papers in linguistics (Vol. 27, pp. 157–166). Seattle, WA: Department of Linguistics.
Bernaisch, T., Gries, S. T., & Mukherjee, J. (2014). The dative alternation in South Asian English (Eds.), Modelling predictors and predicting prototypes. English World-Wide, 35(1), 7–31.
Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
Biber, D. (1993). Representativeness in corpus design. Literary and linguistic computing, 8(4), 243–257.
Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non) utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4), 439–464.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Brezina, V., & Gablasova, D. (2013). Is there a core general vocabulary? Introducing the new general service list. Applied Linguistics, 36(1), 1–22.
Burch, B., Egbert, J., & Biber, D. (2017). Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 3, 189–216.
Chernick, M. R. (1999). Bootstrap methods: A practitioner’s guide (Wiley series in probability and statistics). Hoboken, NJ: Wiley.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge: Cambridge University Press.
Deshors, S. C., & Gries, S. T. (2016). Profiling verb complementation constructions across New Englishes. International Journal of Corpus Linguistics, 21(2), 192–218.
Efron, B. (1979). Computers and the theory of statistics: Thinking the unthinkable. SIAM Review, 21(4), 460–480.
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185.
Efron, B. (1992). Bootstrap methods: Another look at the jackknife. In Breakthroughs in statistics (pp. 569–593). New York: Springer.
Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician, 37(1), 36–48.
Efron, B., & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1, 54–75.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap: Monographs on statistics and applied probability (Vol. 57). New York/London: Chapman and Hall/CRC.
Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The 632+ bootstrap method. Journal of the American Statistical Association, 92(438), 548–560.
Egbert, J. (2015). Publication type and discipline variation in published academic writing: Investigating statistical interaction in corpus data. International Journal of Corpus Linguistics, 20(1), 1–29.
Egbert, J., & LaFlair, G. T. (2018). Statistics for categorical and distribution-free data. In A. Phakiti, P. I. De Costa, L. Plonsky, & S. Starfield (Eds.), The Palgrave handbook of applied linguistics research methodology. New York: Palgrave.
Egbert, J., Biber, D., & Gray, B. (forthcoming). Towards representativeness in corpus design. Cambridge: Cambridge University Press.
Gardner, D., & Davies, M. (2013). A new academic vocabulary list. Applied Linguistics, 35(3), 305–327.
Gries, S. T. (2006). Exploring variability within and between corpora: Some methodological considerations. Corpora, 1(2), 109–151.
Gries, S. T. (2010). Behavioral profiles A fine-grained and quantitative approach. The Mental Lexicon, 5(3), 323–346.
Gries, S. T. (2013). Statistics for linguistics with R: A practical introduction (2nd rev. ed.). Berlin: De Gruyter Mouton.
Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning (2nd ed.). New York: Springer.
Heller, B., Szmrecsanyi, B., & Grafmiller, J. (2017). Stability and fluidity in syntactic variation world-wide: The genitive alternation across varieties of English. Journal of English Linguistics, 45(1), 3–27.
Hinneburg, A., Mannila, H., Kaislaniemi, S., Nevalainen, T., & Raumolin-Brunberg, H. (2007). How to handle small samples: Bootstrap and Bayesian methods in the analysis of linguistic change. Literary and linguistic computing, 22(2), 137–150.
Ho, T. K. (2002). A data complexity analysis of comparative advantages of decision Forest constructors. Pattern Analysis and Applications, 5, 102–112.
LaFlair, G. T., Egbert, J., & Plonsky, L. (2015). A practical guide to bootstrapping descriptive statistics, correlations, t tests, and ANOVAs. In Advancing quantitative methods in second language research (Vol. 46). New York: Routledge.
Lijffijt, J., Papapetrou, P., Puolamäki, K., & Mannila, H. (2011). Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. Machine Learning and Knowledge Discovery in Databases, 341–357.
Lijffijt, J., Säily, T., & Nevalainen, T. (2012). CEECing the baseline: Lexical stability and significant change in a historical corpus. In Studies in variation, contacts and change in English (Vol. 10). Research unit for variation, contacts and change in English (VARIENG).
Lijffijt, J., Nevalainen, T., Säily, T., Papapetrou, P., Puolamäki, K., & Mannila, H. (2016). Significance testing of word frequencies in corpora. Literary and Linguistic Computing, 31(2), 374–397.
Mannila, H., Nevalainen, T., & Raumolin-Brunberg, H. (2013). Quantifying variation and estimating the effects of sample size on the frequencies of linguistic variables. In M. Krug & J. Schlüter (Eds.), Research methods in language variation and change (pp. 337–360). Cambridge: Cambridge University Press.
Nation, I. S. P. (2016). Making and using word lists for language learning and testing. Philadelphia, PA: John Benjamins Publishing.
Plonsky, L., Egbert, J., & LaFlair, G. (2015). Bootstrapping in applied linguistics: Assessing its potential using shared data. Applied Linguistics, 36(5), 591–610.
Säily, T. (2014). Sociolinguistic variation in English derivational productivity: Studies and methods in diachronic corpus linguistics. Mémoires de la Société Néophilologique de Helsinki.
Szmrecsanyi, B., Biber, D., Egbert, J., & Franco, K. (2016a). Towards more accountability: Modeling ternary genitive variation in late modern English. Language Variation and Change, 28(1), 1–29.
Szmrecsanyi, B., Grafmiller, J., Heller, B., & Röthlisberger, M. (2016b). Around the world in three alternations. English World-Wide, 37(2), 109–137.
Wagner, S. E., Hesson, A., Bybel, K., & Little, H. (2015). Quantifying the referential function of general extenders in north American English. Language in Society, 44(5), 705–731.
Wolk, C., Bresnan, J., Rosenbach, A., & Szmrecsanyi, B. (2013). Dative and genitive variability in Late Modern English: Exploring cross-constructional variation and change. Diachronica, 30(3), 382–419.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic Supplementary Materials
Further Reading
Further Reading
-
LaFlair, G.T., Egbert, J., and Plonsky, L. 2015. A practical guide to bootstrapping descriptive statistics, correlations, t tests, and ANOVAs. Advancing Quantitative Methods in Second Language Research , 46, New York: Routledge.
This chapter motivates the use of bootstrapping for analyzing data in second language research. It covers practical considerations to account for when using bootstrapping and provides step-by-step demonstrations of how to analyze data to calculate bootstrap means, standard deviations, correlation coefficients, t-statistics, and F-statistics.
-
Egbert, J., and LaFlair, G.T. 2018. Statistics for categorical and distribution free data. In Handbook of Applied Linguistics Research Methodology , eds. De Costa, P., Phakiti, A., Starfield, S., and Plonsky, L. London: Palgrave Macmillan.
This chapter focuses generally on statistical techniques for analyzing non-traditional and distribution-free data, including a section on the use of the non-parametric bootstrap to analyze such data and the advantages of using bootstrapping over alternatives such as permutation tests.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Egbert, J., Plonsky, L. (2020). Bootstrapping Techniques. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-030-46216-1_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)