Bootstrapping Techniques

Egbert, Jesse; Plonsky, Luke

doi:10.1007/978-3-030-46216-1_24

Jesse Egbert³ &
Luke Plonsky³

2207 Accesses
6 Citations

Abstract

Bootstrapping is a statistical technique that relies on randomly sampling with replacement from a set of observed values. Bootstrapping makes it possible to measure the accuracy and reliability of sample estimates and is often recommended for small samples and samples with unknown or non-normal distributions. In corpus linguistics, bootstrapping has also been proposed as a method for quantifying the degree of homogeneity in a corpus sample, for validation of statistical results, and as a methodological step in random decision forests, an advanced classification method. However, to date bootstrapping techniques have seldom been used with corpus data. We argue in this chapter that bootstrapping is underused in corpus linguistics, and that quantitative corpus linguists would do well to add this tool to their repertoire. This chapter includes an introduction to the fundamentals—both conceptual and practical—of bootstrapping methods. We address several applications of bootstrapping, including the measurement of sample estimate accuracy, the validation of statistical models, the estimation of corpus homogeneity, and random forests. We include an overview of two representative studies that have successfully used bootstrapping techniques with corpus data. Finally, we demonstrate how to perform bootstrapping on corpus data using R, and how to visualize and interpret the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
It should be noted that there are different ways of writing code in R to perform bootstrapping. Another good approach would be to create a loop that stores results from bootstrapped samples, rather than use a function.
2.
Because bootstrapping is based on random re-samples, the output will be slightly different each time we repeat the procedure.
3.
The bias-corrected and accelerated (BCa) method is preferred because it corrects for bias and skewness in the bootstrap distribution (see Davison and Hinkley 1997).
4.
This can be compared with the original 95% CI of [−0.555, −0.289].
5.
Compare with the original 95% CI of [0.508, 0.711].

References

Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.
Book Google Scholar
Berez, A. L., & Gries, S. T. (2009). In defense of corpus-based methods: A behavioral profile analysis of polysemous get in English. In S. Moran, D. S. Tanner, & M. Scanlon (Eds.), Proceedings of the 24th Northwest linguistics conference. University of Washington working papers in linguistics (Vol. 27, pp. 157–166). Seattle, WA: Department of Linguistics.
Google Scholar
Bernaisch, T., Gries, S. T., & Mukherjee, J. (2014). The dative alternation in South Asian English (Eds.), Modelling predictors and predicting prototypes. English World-Wide, 35(1), 7–31.
Article Google Scholar
Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
Book Google Scholar
Biber, D. (1993). Representativeness in corpus design. Literary and linguistic computing, 8(4), 243–257.
Article Google Scholar
Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non) utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4), 439–464.
Article Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Article Google Scholar
Brezina, V., & Gablasova, D. (2013). Is there a core general vocabulary? Introducing the new general service list. Applied Linguistics, 36(1), 1–22.
Article Google Scholar
Burch, B., Egbert, J., & Biber, D. (2017). Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 3, 189–216.
Article Google Scholar
Chernick, M. R. (1999). Bootstrap methods: A practitioner’s guide (Wiley series in probability and statistics). Hoboken, NJ: Wiley.
Google Scholar
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge: Cambridge University Press.
Book Google Scholar
Deshors, S. C., & Gries, S. T. (2016). Profiling verb complementation constructions across New Englishes. International Journal of Corpus Linguistics, 21(2), 192–218.
Article Google Scholar
Efron, B. (1979). Computers and the theory of statistics: Thinking the unthinkable. SIAM Review, 21(4), 460–480.
Article Google Scholar
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185.
Article Google Scholar
Efron, B. (1992). Bootstrap methods: Another look at the jackknife. In Breakthroughs in statistics (pp. 569–593). New York: Springer.
Chapter Google Scholar
Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician, 37(1), 36–48.
Google Scholar
Efron, B., & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1, 54–75.
Google Scholar
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap: Monographs on statistics and applied probability (Vol. 57). New York/London: Chapman and Hall/CRC.
Book Google Scholar
Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The 632+ bootstrap method. Journal of the American Statistical Association, 92(438), 548–560.
Google Scholar
Egbert, J. (2015). Publication type and discipline variation in published academic writing: Investigating statistical interaction in corpus data. International Journal of Corpus Linguistics, 20(1), 1–29.
Article Google Scholar
Egbert, J., & LaFlair, G. T. (2018). Statistics for categorical and distribution-free data. In A. Phakiti, P. I. De Costa, L. Plonsky, & S. Starfield (Eds.), The Palgrave handbook of applied linguistics research methodology. New York: Palgrave.
Google Scholar
Egbert, J., Biber, D., & Gray, B. (forthcoming). Towards representativeness in corpus design. Cambridge: Cambridge University Press.
Google Scholar
Gardner, D., & Davies, M. (2013). A new academic vocabulary list. Applied Linguistics, 35(3), 305–327.
Article Google Scholar
Gries, S. T. (2006). Exploring variability within and between corpora: Some methodological considerations. Corpora, 1(2), 109–151.
Article Google Scholar
Gries, S. T. (2010). Behavioral profiles A fine-grained and quantitative approach. The Mental Lexicon, 5(3), 323–346.
Article Google Scholar
Gries, S. T. (2013). Statistics for linguistics with R: A practical introduction (2nd rev. ed.). Berlin: De Gruyter Mouton.
Book Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning (2nd ed.). New York: Springer.
Google Scholar
Heller, B., Szmrecsanyi, B., & Grafmiller, J. (2017). Stability and fluidity in syntactic variation world-wide: The genitive alternation across varieties of English. Journal of English Linguistics, 45(1), 3–27.
Article Google Scholar
Hinneburg, A., Mannila, H., Kaislaniemi, S., Nevalainen, T., & Raumolin-Brunberg, H. (2007). How to handle small samples: Bootstrap and Bayesian methods in the analysis of linguistic change. Literary and linguistic computing, 22(2), 137–150.
Article Google Scholar
Ho, T. K. (2002). A data complexity analysis of comparative advantages of decision Forest constructors. Pattern Analysis and Applications, 5, 102–112.
Article Google Scholar
LaFlair, G. T., Egbert, J., & Plonsky, L. (2015). A practical guide to bootstrapping descriptive statistics, correlations, t tests, and ANOVAs. In Advancing quantitative methods in second language research (Vol. 46). New York: Routledge.
Google Scholar
Lijffijt, J., Papapetrou, P., Puolamäki, K., & Mannila, H. (2011). Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. Machine Learning and Knowledge Discovery in Databases, 341–357.
Google Scholar
Lijffijt, J., Säily, T., & Nevalainen, T. (2012). CEECing the baseline: Lexical stability and significant change in a historical corpus. In Studies in variation, contacts and change in English (Vol. 10). Research unit for variation, contacts and change in English (VARIENG).
Google Scholar
Lijffijt, J., Nevalainen, T., Säily, T., Papapetrou, P., Puolamäki, K., & Mannila, H. (2016). Significance testing of word frequencies in corpora. Literary and Linguistic Computing, 31(2), 374–397.
Article Google Scholar
Mannila, H., Nevalainen, T., & Raumolin-Brunberg, H. (2013). Quantifying variation and estimating the effects of sample size on the frequencies of linguistic variables. In M. Krug & J. Schlüter (Eds.), Research methods in language variation and change (pp. 337–360). Cambridge: Cambridge University Press.
Chapter Google Scholar
Nation, I. S. P. (2016). Making and using word lists for language learning and testing. Philadelphia, PA: John Benjamins Publishing.
Book Google Scholar
Plonsky, L., Egbert, J., & LaFlair, G. (2015). Bootstrapping in applied linguistics: Assessing its potential using shared data. Applied Linguistics, 36(5), 591–610.
Google Scholar
Säily, T. (2014). Sociolinguistic variation in English derivational productivity: Studies and methods in diachronic corpus linguistics. Mémoires de la Société Néophilologique de Helsinki.
Google Scholar
Szmrecsanyi, B., Biber, D., Egbert, J., & Franco, K. (2016a). Towards more accountability: Modeling ternary genitive variation in late modern English. Language Variation and Change, 28(1), 1–29.
Article Google Scholar
Szmrecsanyi, B., Grafmiller, J., Heller, B., & Röthlisberger, M. (2016b). Around the world in three alternations. English World-Wide, 37(2), 109–137.
Article Google Scholar
Wagner, S. E., Hesson, A., Bybel, K., & Little, H. (2015). Quantifying the referential function of general extenders in north American English. Language in Society, 44(5), 705–731.
Article Google Scholar
Wolk, C., Bresnan, J., Rosenbach, A., & Szmrecsanyi, B. (2013). Dative and genitive variability in Late Modern English: Exploring cross-constructional variation and change. Diachronica, 30(3), 382–419.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Northern Arizona University, Flagstaff, AZ, USA
Jesse Egbert & Luke Plonsky

Authors

Jesse Egbert
View author publications
You can also search for this author in PubMed Google Scholar
Luke Plonsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jesse Egbert .

Editor information

Editors and Affiliations

FNRS Centre for English Corpus Linguistics, Language and Communication Institute, UCLouvain, Louvain-la-Neuve, Belgium
Magali Paquot
Department of Linguistics, University of California, Santa Barbara, CA, USA
Stefan Th. Gries

1 Electronic Supplementary Materials

24_bootstrapping_AWE (ZIP 3 kb)

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Egbert, J., Plonsky, L. (2020). Bootstrapping Techniques. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-46216-1_24
Published: 05 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)

Publish with us

Policies and ethics

Bootstrapping Techniques

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic Supplementary Materials

24_bootstrapping_AWE (ZIP 3 kb)

Further Reading

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Bootstrapping Techniques

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic Supplementary Materials

24_bootstrapping_AWE (ZIP 3 kb)

Further Reading

Further Reading

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation