Skip to main content
Log in

Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx)

  • Research Article - Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Extracting useful Web content is a major step in data mining. The Web content extraction process is very important for many technologies or uses as a preprocessing of many systems such as crawlers and indexers. Additionally, the extracted content is needed by the end users especially for blind and visually impaired users. It aims to extract useful and meaningful data from Webpages that are surrounded with various clutters such as advertisements and navigation menus. Many extraction algorithms are designed for English Language and perform less efficient and less accurate in Arabic language. In this paper, a bi-languages mining algorithm for extracting Web contents called BiLEx is presented. It extracts useful Web content from Arabic and English Webpages in the approximately same level of efficiency and accuracy. An experiment is made for 600 Webpages which are chosen randomly from 30 different Websites to test the proposed algorithm performance and efficiency. Results prove that BiLEx algorithm gives high precision, recall, and F1-measure for both Arabic and English Webpages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Chakrabarti S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, Burlington (2003)

    Google Scholar 

  2. Eichmann, D.: The RBSE spider-balancing effective search against web load. In: First World Wide Web Conference, Geneva, Switzerland April 20 1994

  3. Qureshi,P.A.R.; Memon, N: Hybrid model of content extraction. J. Comput. Syst. Sci. 78(4), 1248–1257 (2012); ISSN 0022-0000

  4. Weninger, T.; Hsu, W.; Han, J.: CETR: content extraction via tag ratios. In: Proceedings of the 19th International Conference on World Wide Web, WWW ’10, ACM, New York, NY, USA, pp. 971–980 (2010)

  5. Rahman, A.F.R.; Alam, H.; Hartono, R.: Content extraction from HTML documents. In: 1st International Workshop on Web Document Analysis (WDA2001) (2001)

  6. Al-Ghuribi, S.M.; Alshomrani, S.: A comprehensive survey on web content extraction algorithms and techniques. In: 2013 International Conference on Information Science and Applications (ICISA), 24–26 June 2013, pp. 1–5. doi:10.1109/ICISA.2013.6579445

  7. Kushmerick, N; Weld, D.S.; Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (1997)

  8. Liu, L.; Pu, C.; Han, W.: XWRAP: an XML-enabled wrapper construction system for web information sources. In: 16th International Conference on Data Engineering, 2000 Proceedings. pp. 611–621 (2000)

  9. Tripathy, A.K.; Joshi, N.; Thomas, S.; Shetty, S.; Thomas, N.: VEDD—a visual wrapper for extraction of data using DOM tree. In: 2012 International Conference on Communication, Information & Computing Technology (ICCICT), pp. 1–6, 19–20 Oct (2012)

  10. Bar-Yossef, Z.; Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web, pp. 580–591. ACM, Honolulu, Hawaii, USA. 1-58113-449-5/02/0005 (2002)

  11. Chakrabarti, D.; Kumar, R.; Punera, K.: Page-level template detection via isotonic smoothing. In: Proceedings of the 16th International Conference on World Wide Web (WWW ’07), pp. 61–70. ACM, New York, NY, USA (2007)

  12. Chen, L.; Ye, S.; Li, X.: Template detection for large scale search engines. In: SAC, pp. 1094–1098. ACM (2006)

  13. Hong J.L.; Fauzi F.: Tree wrap-data extraction using tree matching algorithm. Majlesi J. Electr. Eng. Iran 4(2), 43–55 (2010)

    Google Scholar 

  14. Lei, F.; Yao, M.; Ying, J., X.; Hao, Y.: Web content extraction based on webpage layout analysis. In: 2010 Second International Conference on Information Technology and Computer Science (ITCS), pp. 40–43, 24–25 July (2010)

  15. Wang, Y.; Fang, B.; Cheng, X.; Guo, L.; Xu, H.: Incremental web page template detection. In: Proceedings of the WWW, pp. 1247–1248 (2008)

  16. Debnath, S.; Mitra, P.; Giles C.L: Identifying content blocks from web documents. In: Proceedings of the 15th ISMIS 2005 Conference, pp. 285–293 (2005)

  17. Kao H.Y., Lin S.-H., Ho J.-M., Chen M.-S.: Mining web informative structures and contents based on entropy analysis. IEEE Trans. Knowl. Data Eng. 16(1), 41–55 (2004)

    Article  Google Scholar 

  18. Li Z., Ng W.K., Sun A.: Web data extraction based on structural similarity. Knowl. Inf. Syst. 8(4), 438–461 (2005)

    Article  Google Scholar 

  19. Pasternack, J.; Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: WWW, pp. 971–980. ACM (2009)

  20. Zachariasova, M.; Hudec, R.; Benco, M.; Kamencay, P.: Automatic extraction of non-textual information in web document and their classification. In: 35th International Conference on Telecommunications and Signal Processing (TSP), 2012. pp. 753–757, 3–4 July 2012

  21. Cai, D.; Yu, S.; Wen, J.-R.; Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X.; Orlowska, M.E.; Zhang, Y. (eds.) Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications (APWeb’03), pp. 406–417. Springer, Berlin, Heidelberg (2003)

  22. Hong, J.L.; Siew, E.-G.; Egerton, S.: ViWER- data extraction for search engine results pages using visual cue and DOM Tree. In: International Conference on Information Retrieval & Knowledge Management, (CAMP), 2010 pp. 167–172, 17–18 March 2010

  23. Lin, S.-H.; Ho, J.-M.: Discovering informative content blocks from Web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, ACM, New York, NY, USA. pp. 588–593 (2002)

  24. Dingkui, Y., Jihua, S.: Web Content Information Extraction Approach Based on Removing Noise and Content-Features. In: 2010 International Conference on Web Information Systems and Mining (WISM) vol. 1, pp. 246–249, 23–24 Oct 2010

  25. Zhang, B.; Wang; X.: Content extraction from Chinese web page based on title and content dependency tree. J. China Univ. Posts Telecommun. 19(2), 147–151, 189, ISSN 1005-8885 (2012)

  26. Adam, G.; Bouras, C.; Poulopoulos, V.: CUTER: an efficient useful text extraction mechanism. In: International Conference on Advanced Information Networking and Applications Workshops, 2009 (WAINA’09). pp.703–708, 26–29 May 2009

  27. Finn, A.; Kushmerick, N.; Smyth, B.: Fact or fiction: content classification for digital libraries (2001)

  28. Gottron, T.: Content code blurring: a new approach to content extraction. In: 19th International Workshop on Database and Expert Systems Application, 2008 (DEXA’08). pp. 29–33, 1–5 Sept 2008

  29. Gunasundari, R.; Karthikeyan S.: Study of content extraction from web pages base on lin. Int. J. Data Min. Knowl. Manag. Process. 2(3), ISSN 2231-007X (2012)

  30. Gupta, S.; Kaiser, G.; Neistadt, D.; Grimm, P.: DOM-based content extraction of HTML documents. In: Proceedings of the 12th International Conference on World Wide Web, WWW’03, ACM, New York, NY, USA, pp. 207–214 (2003) doi:10.1145/775152.775182

  31. Insa, D.; Silva, J.; Tamarit, S.: Using the words/leafs ratio in the DOM tree for content extraction. J. Log. Algebraic Program. Available online 9 February 2013, ISSN 1567-8326

  32. Mantratzis, C.; Orgun, M.; Cassidy, S.: Separating XHTML content from navigation clutter using DOM-structure block analysis. In: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, HYPERTEXT ’05, ACM, New York, NY, USA, pp. 145–147 (2005)

  33. Mingsheng, H.; Zhijuan, J.; Xiangyu, Z.: An approach for text extraction from web news page. In: 2012 IEEE Symposium on Robotics and Applications (ISRA), 562–565, 3–5 June (2012)

  34. Pinto, D.; Branstein, M.; Coleman, R.; Croft W.B.; King, M.; Li, W.; Wei, X.: QuASM: a system for question answering using semi-structured data. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’02, ACM, New York, NY, USA, pp. 46–55 (2002)

  35. http://www.softpedia.com/get/Programming/Coding-languages-Compilers/Eclipse.shtml

  36. http://jsoup.org/

  37. Al-Ghuribi, S.M.; Alshomrani, S.: Bi-languages mining algorithm for classifying text documents (BiLTc). Int. J. Acad. Res. 6(5) (2014)

  38. Eikvil, L.: Information extraction from world wide web-a survey, Report No. 945 (1999). ISBN: 82-539-0429-0

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sumaia Mohammed AL-Ghuribi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

AL-Ghuribi, S.M., Alshomrani, S. Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx). Arab J Sci Eng 40, 501–518 (2015). https://doi.org/10.1007/s13369-014-1530-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-014-1530-8

Keywords

Navigation