Skip to main content

Content Extraction Studies for Multilingual Unstructured Web Documents

  • Conference paper
  • First Online:
Modelling and Simulation in Science, Technology and Engineering Mathematics (MS-17 2017)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 749))

Included in the following conference series:

Abstract

With increased usage of web and mobile communication, language, form and content of present day web documents have tended to become more and more complex with inclusion of different kinds of media files and multi-lingual texts. The mnemonics way of mapping consonants for digits, adding vowels suitably so that words could be formed is a multilingual generator and if properly done can generate universal language. Thus, in Telugu language all words end with vowels, in Tamil language each basis character is spelt same, irrespective of its joints. Extracting content and information to reach the web surfer in an easy way, has become important and with this as basis, a generic method is developed from basic pixel level so that applicability can be effective. Converting pixel matrix into selected attributes and later a modified form of pattern matching is used to assess the content. The development, being generic, results are given for character and word level performance, with four different languages used in the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. T. Gottron, Content code blurring: A new approach to content extraction, DEXA ’08: 19th International Workshop on Database and Expert Systems Applications. IEEE Computer Society, pp. 29–33 (2008)

    Google Scholar 

  2. S. Gupta, G. Kaiser, D. Neistadt, G. Grimm, in DOM Based Content Extraction of HTML Documents. WWW ’03: Proceedings of the 12th International Conference on World Wide Web (ACM Press, New York, NY, USA, 2003), pp. 207–214

    Google Scholar 

  3. J. Moreno, K. Deschacht, M. Moens, in Language Independent Content Extraction from Web Pages. Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pp. 50–55, 2009

    Google Scholar 

  4. D. Pinto, M. Branstein, R. Coleman, W.B. Croft, M. King, W. Li, X. Wei, in QuASM: A System for Question Answering Using Semi-structured Data. JCDL ’02: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital libraries (ACM Press, New York, NY, USA, 2002), pp. 46–55

    Google Scholar 

  5. C. Mantratzis, M. Orgun, S. Cassidy, in Separating XHTML Content from Navigation Clutter Using DOM-structure Block Analysis. HYPERTEXT ’05: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia (ACM Press, New York, NY, USA, 2005), pp. 145–147

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kolla Bhanu Prakash .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Prakash, K.B., Dorai Rangaswamy, M. (2019). Content Extraction Studies for Multilingual Unstructured Web Documents. In: Chattopadhyay, S., Roy, T., Sengupta, S., Berger-Vachon, C. (eds) Modelling and Simulation in Science, Technology and Engineering Mathematics. MS-17 2017. Advances in Intelligent Systems and Computing, vol 749. Springer, Cham. https://doi.org/10.1007/978-3-319-74808-5_58

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-74808-5_58

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-74807-8

  • Online ISBN: 978-3-319-74808-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics