Content Extraction Studies for Multilingual Unstructured Web Documents

Prakash, Kolla Bhanu; Dorai Rangaswamy, M. A.

doi:10.1007/978-3-319-74808-5_58

Kolla Bhanu Prakash¹⁸ &
M. A. Dorai Rangaswamy¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 749))

Included in the following conference series:

International Conference on Modelling and Simulation

740 Accesses
1 Citations

Abstract

With increased usage of web and mobile communication, language, form and content of present day web documents have tended to become more and more complex with inclusion of different kinds of media files and multi-lingual texts. The mnemonics way of mapping consonants for digits, adding vowels suitably so that words could be formed is a multilingual generator and if properly done can generate universal language. Thus, in Telugu language all words end with vowels, in Tamil language each basis character is spelt same, irrespective of its joints. Extracting content and information to reach the web surfer in an easy way, has become important and with this as basis, a generic method is developed from basic pixel level so that applicability can be effective. Converting pixel matrix into selected attributes and later a modified form of pattern matching is used to assess the content. The development, being generic, results are given for character and word level performance, with four different languages used in the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method

A Multimodal Approach to Relevance and Pertinence of Documents

Textual Information Localization and Retrieval in Document Images Based on Quadtree Decomposition

References

T. Gottron, Content code blurring: A new approach to content extraction, DEXA ’08: 19th International Workshop on Database and Expert Systems Applications. IEEE Computer Society, pp. 29–33 (2008)
Google Scholar
S. Gupta, G. Kaiser, D. Neistadt, G. Grimm, in DOM Based Content Extraction of HTML Documents. WWW ’03: Proceedings of the 12th International Conference on World Wide Web (ACM Press, New York, NY, USA, 2003), pp. 207–214
Google Scholar
J. Moreno, K. Deschacht, M. Moens, in Language Independent Content Extraction from Web Pages. Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pp. 50–55, 2009
Google Scholar
D. Pinto, M. Branstein, R. Coleman, W.B. Croft, M. King, W. Li, X. Wei, in QuASM: A System for Question Answering Using Semi-structured Data. JCDL ’02: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital libraries (ACM Press, New York, NY, USA, 2002), pp. 46–55
Google Scholar
C. Mantratzis, M. Orgun, S. Cassidy, in Separating XHTML Content from Navigation Clutter Using DOM-structure Block Analysis. HYPERTEXT ’05: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia (ACM Press, New York, NY, USA, 2005), pp. 145–147
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur, India
Kolla Bhanu Prakash
St.Peters University, Avadi, Chennai, India
M. A. Dorai Rangaswamy

Authors

Kolla Bhanu Prakash
View author publications
You can also search for this author in PubMed Google Scholar
M. A. Dorai Rangaswamy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kolla Bhanu Prakash .

Editor information

Editors and Affiliations

Department of Electrical Engineering, Ghani Khan Choudhury Institute of Engineering and Technology, Malda, West Bengal, India
Surajit Chattopadhyay
Department of Electrical Engineering, MCKV Institute of Engineering, Howrah, West Bengal, India
Tamal Roy
University of Calcutta, Kolkata, West Bengal, India
Samarjit Sengupta
University of Lyon, Lyon, France
Christian Berger-Vachon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prakash, K.B., Dorai Rangaswamy, M. (2019). Content Extraction Studies for Multilingual Unstructured Web Documents. In: Chattopadhyay, S., Roy, T., Sengupta, S., Berger-Vachon, C. (eds) Modelling and Simulation in Science, Technology and Engineering Mathematics. MS-17 2017. Advances in Intelligent Systems and Computing, vol 749. Springer, Cham. https://doi.org/10.1007/978-3-319-74808-5_58

Download citation

DOI: https://doi.org/10.1007/978-3-319-74808-5_58
Published: 25 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74807-8
Online ISBN: 978-3-319-74808-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics