Abstract
Extracting useful Web content is a major step in data mining. The Web content extraction process is very important for many technologies or uses as a preprocessing of many systems such as crawlers and indexers. Additionally, the extracted content is needed by the end users especially for blind and visually impaired users. It aims to extract useful and meaningful data from Webpages that are surrounded with various clutters such as advertisements and navigation menus. Many extraction algorithms are designed for English Language and perform less efficient and less accurate in Arabic language. In this paper, a bi-languages mining algorithm for extracting Web contents called BiLEx is presented. It extracts useful Web content from Arabic and English Webpages in the approximately same level of efficiency and accuracy. An experiment is made for 600 Webpages which are chosen randomly from 30 different Websites to test the proposed algorithm performance and efficiency. Results prove that BiLEx algorithm gives high precision, recall, and F1-measure for both Arabic and English Webpages.
Similar content being viewed by others
References
Chakrabarti S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, Burlington (2003)
Eichmann, D.: The RBSE spider-balancing effective search against web load. In: First World Wide Web Conference, Geneva, Switzerland April 20 1994
Qureshi,P.A.R.; Memon, N: Hybrid model of content extraction. J. Comput. Syst. Sci. 78(4), 1248–1257 (2012); ISSN 0022-0000
Weninger, T.; Hsu, W.; Han, J.: CETR: content extraction via tag ratios. In: Proceedings of the 19th International Conference on World Wide Web, WWW ’10, ACM, New York, NY, USA, pp. 971–980 (2010)
Rahman, A.F.R.; Alam, H.; Hartono, R.: Content extraction from HTML documents. In: 1st International Workshop on Web Document Analysis (WDA2001) (2001)
Al-Ghuribi, S.M.; Alshomrani, S.: A comprehensive survey on web content extraction algorithms and techniques. In: 2013 International Conference on Information Science and Applications (ICISA), 24–26 June 2013, pp. 1–5. doi:10.1109/ICISA.2013.6579445
Kushmerick, N; Weld, D.S.; Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (1997)
Liu, L.; Pu, C.; Han, W.: XWRAP: an XML-enabled wrapper construction system for web information sources. In: 16th International Conference on Data Engineering, 2000 Proceedings. pp. 611–621 (2000)
Tripathy, A.K.; Joshi, N.; Thomas, S.; Shetty, S.; Thomas, N.: VEDD—a visual wrapper for extraction of data using DOM tree. In: 2012 International Conference on Communication, Information & Computing Technology (ICCICT), pp. 1–6, 19–20 Oct (2012)
Bar-Yossef, Z.; Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web, pp. 580–591. ACM, Honolulu, Hawaii, USA. 1-58113-449-5/02/0005 (2002)
Chakrabarti, D.; Kumar, R.; Punera, K.: Page-level template detection via isotonic smoothing. In: Proceedings of the 16th International Conference on World Wide Web (WWW ’07), pp. 61–70. ACM, New York, NY, USA (2007)
Chen, L.; Ye, S.; Li, X.: Template detection for large scale search engines. In: SAC, pp. 1094–1098. ACM (2006)
Hong J.L.; Fauzi F.: Tree wrap-data extraction using tree matching algorithm. Majlesi J. Electr. Eng. Iran 4(2), 43–55 (2010)
Lei, F.; Yao, M.; Ying, J., X.; Hao, Y.: Web content extraction based on webpage layout analysis. In: 2010 Second International Conference on Information Technology and Computer Science (ITCS), pp. 40–43, 24–25 July (2010)
Wang, Y.; Fang, B.; Cheng, X.; Guo, L.; Xu, H.: Incremental web page template detection. In: Proceedings of the WWW, pp. 1247–1248 (2008)
Debnath, S.; Mitra, P.; Giles C.L: Identifying content blocks from web documents. In: Proceedings of the 15th ISMIS 2005 Conference, pp. 285–293 (2005)
Kao H.Y., Lin S.-H., Ho J.-M., Chen M.-S.: Mining web informative structures and contents based on entropy analysis. IEEE Trans. Knowl. Data Eng. 16(1), 41–55 (2004)
Li Z., Ng W.K., Sun A.: Web data extraction based on structural similarity. Knowl. Inf. Syst. 8(4), 438–461 (2005)
Pasternack, J.; Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: WWW, pp. 971–980. ACM (2009)
Zachariasova, M.; Hudec, R.; Benco, M.; Kamencay, P.: Automatic extraction of non-textual information in web document and their classification. In: 35th International Conference on Telecommunications and Signal Processing (TSP), 2012. pp. 753–757, 3–4 July 2012
Cai, D.; Yu, S.; Wen, J.-R.; Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X.; Orlowska, M.E.; Zhang, Y. (eds.) Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications (APWeb’03), pp. 406–417. Springer, Berlin, Heidelberg (2003)
Hong, J.L.; Siew, E.-G.; Egerton, S.: ViWER- data extraction for search engine results pages using visual cue and DOM Tree. In: International Conference on Information Retrieval & Knowledge Management, (CAMP), 2010 pp. 167–172, 17–18 March 2010
Lin, S.-H.; Ho, J.-M.: Discovering informative content blocks from Web documents. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, ACM, New York, NY, USA. pp. 588–593 (2002)
Dingkui, Y., Jihua, S.: Web Content Information Extraction Approach Based on Removing Noise and Content-Features. In: 2010 International Conference on Web Information Systems and Mining (WISM) vol. 1, pp. 246–249, 23–24 Oct 2010
Zhang, B.; Wang; X.: Content extraction from Chinese web page based on title and content dependency tree. J. China Univ. Posts Telecommun. 19(2), 147–151, 189, ISSN 1005-8885 (2012)
Adam, G.; Bouras, C.; Poulopoulos, V.: CUTER: an efficient useful text extraction mechanism. In: International Conference on Advanced Information Networking and Applications Workshops, 2009 (WAINA’09). pp.703–708, 26–29 May 2009
Finn, A.; Kushmerick, N.; Smyth, B.: Fact or fiction: content classification for digital libraries (2001)
Gottron, T.: Content code blurring: a new approach to content extraction. In: 19th International Workshop on Database and Expert Systems Application, 2008 (DEXA’08). pp. 29–33, 1–5 Sept 2008
Gunasundari, R.; Karthikeyan S.: Study of content extraction from web pages base on lin. Int. J. Data Min. Knowl. Manag. Process. 2(3), ISSN 2231-007X (2012)
Gupta, S.; Kaiser, G.; Neistadt, D.; Grimm, P.: DOM-based content extraction of HTML documents. In: Proceedings of the 12th International Conference on World Wide Web, WWW’03, ACM, New York, NY, USA, pp. 207–214 (2003) doi:10.1145/775152.775182
Insa, D.; Silva, J.; Tamarit, S.: Using the words/leafs ratio in the DOM tree for content extraction. J. Log. Algebraic Program. Available online 9 February 2013, ISSN 1567-8326
Mantratzis, C.; Orgun, M.; Cassidy, S.: Separating XHTML content from navigation clutter using DOM-structure block analysis. In: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, HYPERTEXT ’05, ACM, New York, NY, USA, pp. 145–147 (2005)
Mingsheng, H.; Zhijuan, J.; Xiangyu, Z.: An approach for text extraction from web news page. In: 2012 IEEE Symposium on Robotics and Applications (ISRA), 562–565, 3–5 June (2012)
Pinto, D.; Branstein, M.; Coleman, R.; Croft W.B.; King, M.; Li, W.; Wei, X.: QuASM: a system for question answering using semi-structured data. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’02, ACM, New York, NY, USA, pp. 46–55 (2002)
http://www.softpedia.com/get/Programming/Coding-languages-Compilers/Eclipse.shtml
Al-Ghuribi, S.M.; Alshomrani, S.: Bi-languages mining algorithm for classifying text documents (BiLTc). Int. J. Acad. Res. 6(5) (2014)
Eikvil, L.: Information extraction from world wide web-a survey, Report No. 945 (1999). ISBN: 82-539-0429-0
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
AL-Ghuribi, S.M., Alshomrani, S. Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx). Arab J Sci Eng 40, 501–518 (2015). https://doi.org/10.1007/s13369-014-1530-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-014-1530-8