Advertisement

Charset Encoding Detection of HTML Documents

A Practical Experience
  • Shabanali FaghaniEmail author
  • Ali Hadian
  • Behrouz Minaei-Bidgoli
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9460)

Abstract

Charset encoding detection is a primary task in various web-based systems, such as web browsers, email clients, and search engines. In this paper, we present a new hybrid technique for charset encoding detection for HTML documents. Our approach consists of two phases: “Markup Elimination” and “Ensemble Classification”. The Markup Elimination phase is based on the hypothesis that charset encoding detection is more accurate when the markups are removed from the main content. Therefore, HTML markups and other structural data such as scripts and styles are separated from the rendered texts of the HTML documents using a decoding-encoding trick which preserves the integrity of the byte sequence. In the Ensemble Classification phase, we leverage two well-known charset encoding detection tools, namely Mozilla CharDet and IBM ICU, and combine their outputs based on their estimated domain of expertise. Results show that the proposed technique significantly improves the accuracy of charset encoding detection over both Mozilla CharDet and IBM ICU.

Keywords

Charset encoding HTML markups Multilingual environments 

Notes

Acknowledgements

The authors would like to thank Hamed Kordestanchi, who proposed the language-wise evaluation, and Mojtaba Akbarzade for his guidance during this work.

References

  1. 1.
    Russell, G., Lapalme, G., Plamondon, P.: Automatic identification of language and encoding. In: Rapport Scientifique, Laboratoire de Recherche Appliquée en Linguistique Informatique (RALI), Université de Montréal, Canada (2003)Google Scholar
  2. 2.
    Kim, S., Park, J.: Automatic Detection of Character Encoding and Language, Technical Report, Machine Learning, Stanford University (2007)Google Scholar
  3. 3.
    Tang, F.Y.F.: Mozilla Charset Detectors, Mozilla (2008). http://www-archive.mozilla.org/projects/intl/chardet.html
  4. 4.
    ICU - International Components for Unicode, IBM (2014). http://site.icu-project.org
  5. 5.
    Kikui, G.I.: Identifying, the coding system and language, of on-line documents on the internet. In: Proceedings of the 16th Conference on Computational linguistics, vol. 2, pp. 652–657 (1996)Google Scholar
  6. 6.
    Charset detection, Wikipedia (2014). http://en.wikipedia.org/wiki/Charset_detection
  7. 7.
    Character Data Representation Architecture, IBM (2013). http://www.ibm.com/software/globalization/cdra/index.jsp
  8. 8.
    Whistler, K., Davis, M., Freytag, A.: Unicode Technical Report# 17, Character Encoding Model, The Unicode Consortium (2008). http://www.unicode.org/reports/tr17
  9. 9.
    Dürst, M.J., Yergeau, F., Ishida, R., Wolf, M., Texin, T.: Character Model for the World Wide Web 1.0: Fundamentals, W3C (2005). http://www.w3.org/TR/charmod
  10. 10.
    Historical trends in the usage of character encodings for websites, W3Techs (2015). http://w3techs.com/technologies/history_overview/character_encoding
  11. 11.
    Dürst, M.: Checking the character encoding using the validator, W3C (2003). http://www.w3.org/International/questions/qa-validator-charset-check.en
  12. 12.
    Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 91–106. Springer, Heidelberg (2002)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Shabanali Faghani
    • 1
    Email author
  • Ali Hadian
    • 1
  • Behrouz Minaei-Bidgoli
    • 1
  1. 1.Department of Computer EngineeringIran University of Science and TechnologyTehranIran

Personalised recommendations