Skip to main content

Charset Encoding Detection of HTML Documents

A Practical Experience

  • Conference paper
  • First Online:
Information Retrieval Technology (AIRS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9460))

Included in the following conference series:

  • 855 Accesses

Abstract

Charset encoding detection is a primary task in various web-based systems, such as web browsers, email clients, and search engines. In this paper, we present a new hybrid technique for charset encoding detection for HTML documents. Our approach consists of two phases: “Markup Elimination” and “Ensemble Classification”. The Markup Elimination phase is based on the hypothesis that charset encoding detection is more accurate when the markups are removed from the main content. Therefore, HTML markups and other structural data such as scripts and styles are separated from the rendered texts of the HTML documents using a decoding-encoding trick which preserves the integrity of the byte sequence. In the Ensemble Classification phase, we leverage two well-known charset encoding detection tools, namely Mozilla CharDet and IBM ICU, and combine their outputs based on their estimated domain of expertise. Results show that the proposed technique significantly improves the accuracy of charset encoding detection over both Mozilla CharDet and IBM ICU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/shabanali-faghani/IUST-HTMLCharDet .

References

  1. Russell, G., Lapalme, G., Plamondon, P.: Automatic identification of language and encoding. In: Rapport Scientifique, Laboratoire de Recherche Appliquée en Linguistique Informatique (RALI), Université de Montréal, Canada (2003)

    Google Scholar 

  2. Kim, S., Park, J.: Automatic Detection of Character Encoding and Language, Technical Report, Machine Learning, Stanford University (2007)

    Google Scholar 

  3. Tang, F.Y.F.: Mozilla Charset Detectors, Mozilla (2008). http://www-archive.mozilla.org/projects/intl/chardet.html

  4. ICU - International Components for Unicode, IBM (2014). http://site.icu-project.org

  5. Kikui, G.I.: Identifying, the coding system and language, of on-line documents on the internet. In: Proceedings of the 16th Conference on Computational linguistics, vol. 2, pp. 652–657 (1996)

    Google Scholar 

  6. Charset detection, Wikipedia (2014). http://en.wikipedia.org/wiki/Charset_detection

  7. Character Data Representation Architecture, IBM (2013). http://www.ibm.com/software/globalization/cdra/index.jsp

  8. Whistler, K., Davis, M., Freytag, A.: Unicode Technical Report# 17, Character Encoding Model, The Unicode Consortium (2008). http://www.unicode.org/reports/tr17

  9. Dürst, M.J., Yergeau, F., Ishida, R., Wolf, M., Texin, T.: Character Model for the World Wide Web 1.0: Fundamentals, W3C (2005). http://www.w3.org/TR/charmod

  10. Historical trends in the usage of character encodings for websites, W3Techs (2015). http://w3techs.com/technologies/history_overview/character_encoding

  11. Dürst, M.: Checking the character encoding using the validator, W3C (2003). http://www.w3.org/International/questions/qa-validator-charset-check.en

  12. Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 91–106. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Hamed Kordestanchi, who proposed the language-wise evaluation, and Mojtaba Akbarzade for his guidance during this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shabanali Faghani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Faghani, S., Hadian, A., Minaei-Bidgoli, B. (2015). Charset Encoding Detection of HTML Documents. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds) Information Retrieval Technology. AIRS 2015. Lecture Notes in Computer Science(), vol 9460. Springer, Cham. https://doi.org/10.1007/978-3-319-28940-3_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28940-3_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28939-7

  • Online ISBN: 978-3-319-28940-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics