Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity in Web Corpora

  • Marina SantiniEmail author
  • Wiktor Strandqvist
  • Mikael Nyström
  • Marjan Alirezai
  • Arne Jönsson
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 903)


Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.



This research was supported by E-care@home, a “SIDUS - Strong Distributed Research Environment” project funded by the Swedish Knowledge Foundation. Project website:


  1. 1.
    Baroni, M., Bernardini, S.: BootCat: bootstrapping corpora and terms from the web. In: LREC (2004)Google Scholar
  2. 2.
    Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43(3), 209–226 (2009)CrossRefGoogle Scholar
  3. 3.
    Biber, D.: Representativeness in corpus design. Literary Linguist. Comput. 8(4), 243–257 (1993)CrossRefGoogle Scholar
  4. 4.
    Church, K.W., Gale, W.A.: Poisson mixtures. Nat. Lang. Eng. 1(2), 163–190 (1995)CrossRefGoogle Scholar
  5. 5.
    Ciaramita, M., Baroni, M.: A figure of merit for the evaluation of web-corpus randomness. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (2006)Google Scholar
  6. 6.
    Desagulier, G.: Corpus Linguistics and Statistics with R. Springer, Heidelberg (2017). Scholar
  7. 7.
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)Google Scholar
  8. 8.
    Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S.: Introducing and evaluating ukWaC, a very large web-derived corpus of English. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can We Beat Google, pp. 47–54 (2008)Google Scholar
  9. 9.
    Fletcher, W.H.: Implementing a BNC-compare-able web corpus. Building and Exploring Web Corpora, pp. 43–56 (2007)Google Scholar
  10. 10.
    Gries, S.T.: Elementary statistical testing with R. In: Krug, M., Schlüter, J. (eds.) Research Methods in Language Variation and change (2013)Google Scholar
  11. 11.
    Gustafson-Capková, S., Hartmann, B.: Manual of the Stockholm Umeå corpus version 2.0. Stockholm University (2006)Google Scholar
  12. 12.
    Irvine, A., Callison-Burch, C.: A comprehensive analysis of bilingual lexicon induction. Comput. Linguist. 43(2), 273–310 (2017)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Katz, S.M.: Distribution of content words and phrases in text and language modelling. Nat. Lang. Eng. 2(1), 15–59 (1996)CrossRefGoogle Scholar
  14. 14.
    Kilgarriff, A.: Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133 (2001)CrossRefGoogle Scholar
  15. 15.
    Kilgarriff, A.: Simple maths for keywords. In: Proceedings of the Corpus Linguistics Conference, Liverpool, UK (2009)Google Scholar
  16. 16.
    Kilgarriff, A.: Comparable corpora within and across languages, word frequency lists and the KELLY project. In: Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, pp. 1–5 (2010)Google Scholar
  17. 17.
    Pierrehumbert, J.B.: Burstiness of verbs and derived nouns. In: Santos, D., Lindén, K., Ng’ang’a, W. (eds.) Shall We Play the Festschrift Game?, pp. 99–115. Springer, Heidelberg (2012). Scholar
  18. 18.
    Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceedings of the workshop on Comparing Corpora, pp. 1–6. Association for Computational Linguistics (2000)Google Scholar
  19. 19.
    Santini, M., Jönsson, A., Nyström, M., Alireza, M.: A web corpus for eCare: collection, lay annotation and learning-First results. In: Proceedings of the 2nd International Workshop on Language Technologies and Applications (LTA17). FedCSIS (2017)Google Scholar
  20. 20.
    Sharoff, S.: Know thy corpus! Exploring frequency distributions in large corpora. In: Diab, M., Villavicencio, A. (eds.) Essays in Honor of Adam Kilgarriff. Text Speech and Language Technology Series. Springer, Heidelberg (2017)Google Scholar
  21. 21.
    Strandqvist, W., Santini, M., Lind, L., Jönsson, A.: Towards a quality assessment of web corpora for language technology applications. In: Proceedings of TISLID18 - Languages For Digital Lives and Cultures. Ghent University, Belgium (2018)Google Scholar
  22. 22.
    Wong, W., Liu, W., Bennamoun, M.: Constructing specialised corpora through analysing domain representativeness of websites. Lang. Resour. Eval. 45(2), 209–241 (2011)CrossRefGoogle Scholar
  23. 23.
    Zhao, Z., Mei, Q.: Questions about questions: an empirical analysis of information needs on twitter. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1545–1556. ACM (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Marina Santini
    • 1
    Email author
  • Wiktor Strandqvist
    • 1
    • 2
  • Mikael Nyström
    • 1
    • 2
  • Marjan Alirezai
    • 3
  • Arne Jönsson
    • 1
    • 2
  1. 1.RISE SICSLinköpingSweden
  2. 2.Linköping UniversityLinköpingSweden
  3. 3.Örebro UniversityÖrebroSweden

Personalised recommendations