Skip to main content

On the Characteristics of Language Tags on the Web

  • Conference paper
  • First Online:
Passive and Active Measurement (PAM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 10771))

Included in the following conference series:

Abstract

The Internet is a global phenomenon. To support broad use of Internet applications such as the World Wide Web, character encodings have been developed for many scripts of the world’s languages and there are standard mechanisms for indicating that content is in a particular language and/or tailored to a particular region. In this paper we study the empirical characteristics of language tags used in HTTP transactions and in web pages to indicate the language of the content and possibly the script, region, and other information. To support our analysis, we develop a new algorithm to infer the value of a missing language tag for elements used to link to alternative language content. We analyze the top-level page for websites in the Alexa Top 1 Million, from six geographic perspectives. We find that one third of all pages do not include any language tags, that half of the remaining sites are tagged with English (en), and that about 10 K sites have malformed tags. We observe that 80 K sites are multilingual, and that there are hundreds of sites that offer content in the tens of languages. Besides malformed tags, we find numerous instances of correctly formed but likely erroneous language tags by using a Naïve Bayes-based language detection library and comparing its output with a given page’s language tag(s). Lastly, we comment on differences in language tags observed for the same site but from different geographic vantage points or by using different client language preferences via the HTTP Accept-Language header.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The structure and valid values of language tags are specified in IETF BCP 47 [9] and the IANA language subtag registry [1], respectively, as discussed below.

  2. 2.

    https://searchenginewatch.com/sew/howto/238631/localization-for-international-search-engine-optimization.

  3. 3.

    https://github.com/jsommers/weblingo.

  4. 4.

    https://searchenginewatch.com/sew/how-to/2232347/a-simple-guide-to-using-rel-alternate-hreflang-x.

  5. 5.

    http://docs.python-requests.org/en/master/.

  6. 6.

    http://s3.amazonaws.com/alexa-static/top-1m.csv.zip.

  7. 7.

    http://download.geonames.org/export/dump/countryInfo.txt.

  8. 8.

    https://www.crummy.com/software/BeautifulSoup/.

  9. 9.

    http://lxml.de.

  10. 10.

    https://github.com/jsommers/langtags.

  11. 11.

    https://github.com/LuminosoInsight/langcodes.

  12. 12.

    https://github.com/aboSamoor/pycld2.

  13. 13.

    http://www.unesco.org/languages-atlas/.

References

  1. IANA Language Subtag Registry. https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

  2. World Wide Web Consortium. Internationalization techniques: Authoring HTML and CSS, January 2016. https://www.w3.org/International/techniques/authoring-html

  3. Abbate, J.: Inventing the Internet. MIT Press, Cambridge (2000)

    Google Scholar 

  4. Fielding, R., Reschke, J.: RFC 7230: Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing, June 2014. https://tools.ietf.org/html/rfc7230

  5. Fielding, R., Reschke, J.: RFC 7231: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content, June 2014. https://tools.ietf.org/html/rfc7231

  6. Grefenstette, G., Nioche, J.: Estimation of English and non-English language use on the WWW. In: Content-Based Multimedia Information Access, vol. 1, pp. 237–246 (2000)

    Google Scholar 

  7. Ishida, R.: Language tags in HTML and XML. https://www.w3.org/International/articles/language-tags/

  8. Ishida, R.: Declaring language in HTML (2014). https://www.w3.org/International/questions/qa-html-language-declarations

  9. Phillips, A., Davis, M.: Tags for Identifying Language, September 2009. https://www.rfc-editor.org/rfc/bcp/bcp47.txt

  10. Ishida, R.: Choosing a Language Tag (2016). https://www.w3.org/International/questions/qa-choosing-language-tags

  11. Thomas, C., Kline, J., Barford, P.: IntegraTag: a framework for high-fidelity web client measurement. In: 2016 28th International Teletraffic Congress (ITC 28), vol. 1, pp. 278–285 (2016)

    Google Scholar 

  12. Xu, F.: Multilingual WWW. Knowledge-based information retrieval and filtering from the web 746, 165 (2003)

    Google Scholar 

Download references

Acknowledgments

We thank Alex Nie ‘20 and Ryan Rios ‘20, who contributed to earlier stages of this work for their summer research. We also thank Ram Durairajan for insightful comments on this work, as well as the anonymous reviewers. Lastly, we thank the Colgate Research Council, which provided partial support for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joel Sommers .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sommers, J. (2018). On the Characteristics of Language Tags on the Web. In: Beverly, R., Smaragdakis, G., Feldmann, A. (eds) Passive and Active Measurement. PAM 2018. Lecture Notes in Computer Science(), vol 10771. Springer, Cham. https://doi.org/10.1007/978-3-319-76481-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-76481-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-76480-1

  • Online ISBN: 978-3-319-76481-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics