Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx)

AL-Ghuribi, Sumaia Mohammed; Alshomrani, Saleh

doi:10.1007/s13369-014-1530-8

Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx)

Research Article - Computer Engineering and Computer Science
Published: 03 January 2015

Volume 40, pages 501–518, (2015)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Sumaia Mohammed AL-Ghuribi¹ &
Saleh Alshomrani¹

139 Accesses
3 Citations
Explore all metrics

Abstract

Extracting useful Web content is a major step in data mining. The Web content extraction process is very important for many technologies or uses as a preprocessing of many systems such as crawlers and indexers. Additionally, the extracted content is needed by the end users especially for blind and visually impaired users. It aims to extract useful and meaningful data from Webpages that are surrounded with various clutters such as advertisements and navigation menus. Many extraction algorithms are designed for English Language and perform less efficient and less accurate in Arabic language. In this paper, a bi-languages mining algorithm for extracting Web contents called BiLEx is presented. It extracts useful Web content from Arabic and English Webpages in the approximately same level of efficiency and accuracy. An experiment is made for 600 Webpages which are chosen randomly from 30 different Websites to test the proposed algorithm performance and efficiency. Results prove that BiLEx algorithm gives high precision, recall, and F1-measure for both Arabic and English Webpages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Chakrabarti S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, Burlington (2003)
Google Scholar
Eichmann, D.: The RBSE spider-balancing effective search against web load. In: First World Wide Web Conference, Geneva, Switzerland April 20 1994
Qureshi,P.A.R.; Memon, N: Hybrid model of content extraction. J. Comput. Syst. Sci. 78(4), 1248–1257 (2012); ISSN 0022-0000
Weninger, T.; Hsu, W.; Han, J.: CETR: content extraction via tag ratios. In: Proceedings of the 19th International Conference on World Wide Web, WWW ’10, ACM, New York, NY, USA, pp. 971–980 (2010)
Rahman, A.F.R.; Alam, H.; Hartono, R.: Content extraction from HTML documents. In: 1st International Workshop on Web Document Analysis (WDA2001) (2001)
Al-Ghuribi, S.M.; Alshomrani, S.: A comprehensive survey on web content extraction algorithms and techniques. In: 2013 International Conference on Information Science and Applications (ICISA), 24–26 June 2013, pp. 1–5. doi:10.1109/ICISA.2013.6579445
Kushmerick, N; Weld, D.S.; Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (1997)
Liu, L.; Pu, C.; Han, W.: XWRAP: an XML-enabled wrapper construction system for web information sources. In: 16th International Conference on Data Engineering, 2000 Proceedings. pp. 611–621 (2000)
Tripathy, A.K.; Joshi, N.; Thomas, S.; Shetty, S.; Thomas, N.: VEDD—a visual wrapper for extraction of data using DOM tree. In: 2012 International Conference on Communication, Information & Computing Technology (ICCICT), pp. 1–6, 19–20 Oct (2012)
Bar-Yossef, Z.; Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web, pp. 580–591. ACM, Honolulu, Hawaii, USA. 1-58113-449-5/02/0005 (2002)
Chakrabarti, D.; Kumar, R.; Punera, K.: Page-level template detection via isotonic smoothing. In: Proceedings of the 16th International Conference on World Wide Web (WWW ’07), pp. 61–70. ACM, New York, NY, USA (2007)
Chen, L.; Ye, S.; Li, X.: Template detection for large scale search engines. In: SAC, pp. 1094–1098. ACM (2006)
Hong J.L.; Fauzi F.: Tree wrap-data extraction using tree matching algorithm. Majlesi J. Electr. Eng. Iran 4(2), 43–55 (2010)
Google Scholar
Lei, F.; Yao, M.; Ying, J., X.; Hao, Y.: Web content extraction based on webpage layout analysis. In: 2010 Second International Conference on Information Technology and Computer Science (ITCS), pp. 40–43, 24–25 July (2010)
Wang, Y.; Fang, B.; Cheng, X.; Guo, L.; Xu, H.: Incremental web page template detection. In: Proceedings of the WWW, pp. 1247–1248 (2008)
Debnath, S.; Mitra, P.; Giles C.L: Identifying content blocks from web documents. In: Proceedings of the 15th ISMIS 2005 Conference, pp. 285–293 (2005)
Kao H.Y., Lin S.-H., Ho J.-M., Chen M.-S.: Mining web informative structures and contents based on entropy analysis. IEEE Trans. Knowl. Data Eng. 16(1), 41–55 (2004)
Article Google Scholar
Li Z., Ng W.K., Sun A.: Web data extraction based on structural similarity. Knowl. Inf. Syst. 8(4), 438–461 (2005)
Article Google Scholar
Pasternack, J.; Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: WWW, pp. 971–980. ACM (2009)
Zachariasova, M.; Hudec, R.; Benco, M.; Kamencay, P.: Automatic extraction of non-textual information in web document and their classification. In: 35th International Conference on Telecommunications and Signal Processing (TSP), 2012. pp. 753–757, 3–4 July 2012
Cai, D.; Yu, S.; Wen, J.-R.; Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X.; Orlowska, M.E.; Zhang, Y. (eds.) Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications (APWeb’03), pp. 406–417. Springer, Berlin, Heidelberg (2003)
Hong, J.L.; Siew, E.-G.; Egerton, S.: ViWER- data extraction for search engine results pages using visual cue and DOM Tree. In: International Conference on Information Retrieval & Knowledge Management, (CAMP), 2010 pp. 167–172, 17–18 March 2010
Lin, S.-H.; Ho, J.-M.: Discovering informative content blocks from Web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, ACM, New York, NY, USA. pp. 588–593 (2002)
Dingkui, Y., Jihua, S.: Web Content Information Extraction Approach Based on Removing Noise and Content-Features. In: 2010 International Conference on Web Information Systems and Mining (WISM) vol. 1, pp. 246–249, 23–24 Oct 2010
Zhang, B.; Wang; X.: Content extraction from Chinese web page based on title and content dependency tree. J. China Univ. Posts Telecommun. 19(2), 147–151, 189, ISSN 1005-8885 (2012)
Adam, G.; Bouras, C.; Poulopoulos, V.: CUTER: an efficient useful text extraction mechanism. In: International Conference on Advanced Information Networking and Applications Workshops, 2009 (WAINA’09). pp.703–708, 26–29 May 2009
Finn, A.; Kushmerick, N.; Smyth, B.: Fact or fiction: content classification for digital libraries (2001)
Gottron, T.: Content code blurring: a new approach to content extraction. In: 19th International Workshop on Database and Expert Systems Application, 2008 (DEXA’08). pp. 29–33, 1–5 Sept 2008
Gunasundari, R.; Karthikeyan S.: Study of content extraction from web pages base on lin. Int. J. Data Min. Knowl. Manag. Process. 2(3), ISSN 2231-007X (2012)
Gupta, S.; Kaiser, G.; Neistadt, D.; Grimm, P.: DOM-based content extraction of HTML documents. In: Proceedings of the 12th International Conference on World Wide Web, WWW’03, ACM, New York, NY, USA, pp. 207–214 (2003) doi:10.1145/775152.775182
Insa, D.; Silva, J.; Tamarit, S.: Using the words/leafs ratio in the DOM tree for content extraction. J. Log. Algebraic Program. Available online 9 February 2013, ISSN 1567-8326
Mantratzis, C.; Orgun, M.; Cassidy, S.: Separating XHTML content from navigation clutter using DOM-structure block analysis. In: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, HYPERTEXT ’05, ACM, New York, NY, USA, pp. 145–147 (2005)
Mingsheng, H.; Zhijuan, J.; Xiangyu, Z.: An approach for text extraction from web news page. In: 2012 IEEE Symposium on Robotics and Applications (ISRA), 562–565, 3–5 June (2012)
Pinto, D.; Branstein, M.; Coleman, R.; Croft W.B.; King, M.; Li, W.; Wei, X.: QuASM: a system for question answering using semi-structured data. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’02, ACM, New York, NY, USA, pp. 46–55 (2002)
http://www.softpedia.com/get/Programming/Coding-languages-Compilers/Eclipse.shtml
http://jsoup.org/
Al-Ghuribi, S.M.; Alshomrani, S.: Bi-languages mining algorithm for classifying text documents (BiLTc). Int. J. Acad. Res. 6(5) (2014)
Eikvil, L.: Information extraction from world wide web-a survey, Report No. 945 (1999). ISBN: 82-539-0429-0

Download references

Author information

Authors and Affiliations

Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
Sumaia Mohammed AL-Ghuribi & Saleh Alshomrani

Authors

Sumaia Mohammed AL-Ghuribi
View author publications
You can also search for this author in PubMed Google Scholar
Saleh Alshomrani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sumaia Mohammed AL-Ghuribi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

AL-Ghuribi, S.M., Alshomrani, S. Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx). Arab J Sci Eng 40, 501–518 (2015). https://doi.org/10.1007/s13369-014-1530-8

Download citation

Received: 03 March 2014
Accepted: 31 October 2014
Published: 03 January 2015
Issue Date: February 2015
DOI: https://doi.org/10.1007/s13369-014-1530-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx)

Abstract

Access this article

Similar content being viewed by others

Keyphrase extraction using graph-based statistical approach with NLP patterns

Archivist in the machine: paradata for AI-based automation in the archives

Dataset search: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx)

Abstract

Access this article

Similar content being viewed by others

Keyphrase extraction using graph-based statistical approach with NLP patterns

Archivist in the machine: paradata for AI-based automation in the archives

Dataset search: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation