Abstract
With increased usage of web and mobile communication, language, form and content of present day web documents have tended to become more and more complex with inclusion of different kinds of media files and multi-lingual texts. The mnemonics way of mapping consonants for digits, adding vowels suitably so that words could be formed is a multilingual generator and if properly done can generate universal language. Thus, in Telugu language all words end with vowels, in Tamil language each basis character is spelt same, irrespective of its joints. Extracting content and information to reach the web surfer in an easy way, has become important and with this as basis, a generic method is developed from basic pixel level so that applicability can be effective. Converting pixel matrix into selected attributes and later a modified form of pattern matching is used to assess the content. The development, being generic, results are given for character and word level performance, with four different languages used in the data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
T. Gottron, Content code blurring: A new approach to content extraction, DEXA ’08: 19th International Workshop on Database and Expert Systems Applications. IEEE Computer Society, pp. 29–33 (2008)
S. Gupta, G. Kaiser, D. Neistadt, G. Grimm, in DOM Based Content Extraction of HTML Documents. WWW ’03: Proceedings of the 12th International Conference on World Wide Web (ACM Press, New York, NY, USA, 2003), pp. 207–214
J. Moreno, K. Deschacht, M. Moens, in Language Independent Content Extraction from Web Pages. Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pp. 50–55, 2009
D. Pinto, M. Branstein, R. Coleman, W.B. Croft, M. King, W. Li, X. Wei, in QuASM: A System for Question Answering Using Semi-structured Data. JCDL ’02: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital libraries (ACM Press, New York, NY, USA, 2002), pp. 46–55
C. Mantratzis, M. Orgun, S. Cassidy, in Separating XHTML Content from Navigation Clutter Using DOM-structure Block Analysis. HYPERTEXT ’05: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia (ACM Press, New York, NY, USA, 2005), pp. 145–147
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Prakash, K.B., Dorai Rangaswamy, M. (2019). Content Extraction Studies for Multilingual Unstructured Web Documents. In: Chattopadhyay, S., Roy, T., Sengupta, S., Berger-Vachon, C. (eds) Modelling and Simulation in Science, Technology and Engineering Mathematics. MS-17 2017. Advances in Intelligent Systems and Computing, vol 749. Springer, Cham. https://doi.org/10.1007/978-3-319-74808-5_58
Download citation
DOI: https://doi.org/10.1007/978-3-319-74808-5_58
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74807-8
Online ISBN: 978-3-319-74808-5
eBook Packages: EngineeringEngineering (R0)