Data Engineered Content Extraction Studies for Indian Web Pages

Kolla, Bhanu Prakash; Raman, Arun Raja

doi:10.1007/978-981-10-8055-5_45

Data Engineered Content Extraction Studies for Indian Web Pages

Bhanu Prakash Kolla¹⁸ &
Arun Raja Raman¹⁹

Conference paper
First Online: 04 July 2018

965 Accesses
3 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 711))

Abstract

The recent innovations in the Internet and cellular communications have opened many interesting and exciting areas of social and research activity, and one of the basic driving forces for this is the Web page containing data in different forms. Data can be in mobile or Internet based and can be online or off-line and normally of sizes ranging from kilo to terabytes. In the Indian context, these can relate to computer-generated, printed, or archived data in different languages and dialects. The present study is focused on applying engineering aspects to data so that a smart set is used to generate content in a short period, so that further developments can be easier. After a brief overview on the complexities of Indian Web pages and current approaches in data mining, a basic pixel-based approach is developed along with data reduction and abstraction to be used with classification approaches for content extraction. During data reduction, engineering approach based on organizing and adapting for suitable inputs for classification is highlighted, and a case study is given here for analysis.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

A. Busch, W. W. Boles and S. Sridharan, “Texture for Script Identification”. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No.11, IEEE Computer Society, 2005, pp. 1720–1732.
Article Google Scholar
Deng Cai, Yu Shipeng and Wen Jirong, (2003) “VIPS: a vision-based page segmentation algorithm”, Microsoft Technical Report, MSR-TR-2003-79, 406–417.
Google Scholar
S. Kavitha, P. Shivakumara, G. Hemantha Kumar and C. L. Tan, “A Robust Script Identification System For Historical Indian Document Images”, Malaysian Journal of Computer Science. Vol. 28(4), 2015, pp 283–300.
Article Google Scholar
P. Krishnan, N. Sankaran, A. K. Singh and C. V. Jawahar, “Towards a robust OCR system for Indic scripts”. Document Analysis Systems, IEEE, April 2014, pp. 141–145.
Google Scholar
Maha Al-Yahya, Sawsan Al-Malak, Luluh Aldhubayi, “Ontological Lexicon Enrichment: The Badea System For Semi-Automated Extraction Of Antonymy Relations From Arabic Language Corpora”, Malaysian Journal of Computer Science. Vol. 29(1), 2016, pp 56–73.
Article Google Scholar
Kolla Bhanu Prakash, Dorai RangaSwamy, M, A, Raja Raman, Arun (2012), ANN for Multi-lingual Regional Web Communication, ICONIP 2012, Part V, LNCS 7667, pp. 473–478.
Chapter Google Scholar
Kolla Bhanu Prakash, Dorai RangaSwamy, M, A, Raja Raman, Arun (2012), Statistical Interpretation for Mining Hybrid Regional Web Documents, ICIP 2012, CCIS 292, pp. 503–512.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science Engineering, Koneru Lakshmaiah Education Foundation, Green Fields, Vaddeswaram, Guntur, 522502, Andhra Pradesh, India
Bhanu Prakash Kolla
Department of Structural Engineering, IIT Madras, Chennai, 600036, Tamil Nadu, India
Arun Raja Raman

Authors

Bhanu Prakash Kolla
View author publications
You can also search for this author in PubMed Google Scholar
Arun Raja Raman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bhanu Prakash Kolla .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering & Information Technology, Veer Surendra Sai University of Technology, Sambalpur, Odisha, India
Himansu Sekhar Behera
Department of Computer Science and Engineering, Sri Sivani College of Engineering (SSCE), Srikakulam, Andhra Pradesh, India
Janmenjoy Nayak
Department of Computer Application, Veer Surendra Sai University of Technology, Sambalpur, Odisha, India
Bighnaraj Naik
Machine Intelligence Research (MIR) Lab, Auburn, WA, USA
Ajith Abraham

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kolla, B.P., Raman, A.R. (2019). Data Engineered Content Extraction Studies for Indian Web Pages. In: Behera, H., Nayak, J., Naik, B., Abraham, A. (eds) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, vol 711. Springer, Singapore. https://doi.org/10.1007/978-981-10-8055-5_45

Download citation

DOI: https://doi.org/10.1007/978-981-10-8055-5_45
Published: 04 July 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8054-8
Online ISBN: 978-981-10-8055-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics