Charset Encoding Detection of HTML Documents

Faghani, Shabanali; Hadian, Ali; Minaei-Bidgoli, Behrouz

doi:10.1007/978-3-319-28940-3_17

Shabanali Faghani¹⁹,
Ali Hadian¹⁹ &
Behrouz Minaei-Bidgoli¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9460))

Included in the following conference series:

AIRS

855 Accesses

Abstract

Charset encoding detection is a primary task in various web-based systems, such as web browsers, email clients, and search engines. In this paper, we present a new hybrid technique for charset encoding detection for HTML documents. Our approach consists of two phases: “Markup Elimination” and “Ensemble Classification”. The Markup Elimination phase is based on the hypothesis that charset encoding detection is more accurate when the markups are removed from the main content. Therefore, HTML markups and other structural data such as scripts and styles are separated from the rendered texts of the HTML documents using a decoding-encoding trick which preserves the integrity of the byte sequence. In the Ensemble Classification phase, we leverage two well-known charset encoding detection tools, namely Mozilla CharDet and IBM ICU, and combine their outputs based on their estimated domain of expertise. Results show that the proposed technique significantly improves the accuracy of charset encoding detection over both Mozilla CharDet and IBM ICU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/shabanali-faghani/IUST-HTMLCharDet .

References

Russell, G., Lapalme, G., Plamondon, P.: Automatic identification of language and encoding. In: Rapport Scientifique, Laboratoire de Recherche Appliquée en Linguistique Informatique (RALI), Université de Montréal, Canada (2003)
Google Scholar
Kim, S., Park, J.: Automatic Detection of Character Encoding and Language, Technical Report, Machine Learning, Stanford University (2007)
Google Scholar
Tang, F.Y.F.: Mozilla Charset Detectors, Mozilla (2008). http://www-archive.mozilla.org/projects/intl/chardet.html
ICU - International Components for Unicode, IBM (2014). http://site.icu-project.org
Kikui, G.I.: Identifying, the coding system and language, of on-line documents on the internet. In: Proceedings of the 16th Conference on Computational linguistics, vol. 2, pp. 652–657 (1996)
Google Scholar
Charset detection, Wikipedia (2014). http://en.wikipedia.org/wiki/Charset_detection
Character Data Representation Architecture, IBM (2013). http://www.ibm.com/software/globalization/cdra/index.jsp
Whistler, K., Davis, M., Freytag, A.: Unicode Technical Report# 17, Character Encoding Model, The Unicode Consortium (2008). http://www.unicode.org/reports/tr17
Dürst, M.J., Yergeau, F., Ishida, R., Wolf, M., Texin, T.: Character Model for the World Wide Web 1.0: Fundamentals, W3C (2005). http://www.w3.org/TR/charmod
Historical trends in the usage of character encodings for websites, W3Techs (2015). http://w3techs.com/technologies/history_overview/character_encoding
Dürst, M.: Checking the character encoding using the validator, W3C (2003). http://www.w3.org/International/questions/qa-validator-charset-check.en
Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 91–106. Springer, Heidelberg (2002)
Chapter Google Scholar

Download references

Acknowledgements

The authors would like to thank Hamed Kordestanchi, who proposed the language-wise evaluation, and Mojtaba Akbarzade for his guidance during this work.

Author information

Authors and Affiliations

Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
Shabanali Faghani, Ali Hadian & Behrouz Minaei-Bidgoli

Authors

Shabanali Faghani
View author publications
You can also search for this author in PubMed Google Scholar
Ali Hadian
View author publications
You can also search for this author in PubMed Google Scholar
Behrouz Minaei-Bidgoli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shabanali Faghani .

Editor information

Editors and Affiliations

Science and Engineering Faculty, Queensland University of Technology, Brisbane, Australia
Guido Zuccon
Brisbane, Queensland, Australia
Shlomo Geva
University of Tsukuba, Ibaraki, Japan
Hideo Joho
RMIT University, Melbourne, Australia
Falk Scholer
School of Computer Engineering, Nanyang Technological University, Singapore, Singapore
Aixin Sun
Tianjin University, China
Peng Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Faghani, S., Hadian, A., Minaei-Bidgoli, B. (2015). Charset Encoding Detection of HTML Documents. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds) Information Retrieval Technology. AIRS 2015. Lecture Notes in Computer Science(), vol 9460. Springer, Cham. https://doi.org/10.1007/978-3-319-28940-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-28940-3_17
Published: 22 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28939-7
Online ISBN: 978-3-319-28940-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics