Skip to main content

A semantic based Web page classification strategy using multi-layered domain ontology

Abstract

World Wide Web is a continuously growing giant, and within the next few years, Web contents will surely increase tremendously. Hence, there is a great requirement to have algorithms that could accurately classify Web pages. Automatic Web page classification is significantly different from traditional text classification because of the presence of additional information, provided by the HTML structure. Recently, several techniques have been arisen from combinations of artificial intelligence and statistical approaches. However, it is not a simple matter to find an optimal classification technique for Web pages. This paper introduces a novel strategy for vertical Web page classification, which is called Classification using Multi-layered Domain Ontology (CMDO). It employs several Web mining techniques, and depends mainly on proposed multi-layered domain ontology. In order to promote the classification accuracy, CMDO implies a distiller to reject pages related to other domains. CMDO also employs a novel classification technique, which is called Graph Based Classification (GBC). The proposed GBC has pioneering features that other techniques do not have, such as outlier rejection and pruning. Experimental results have shown that CMDO outperforms recent techniques as it introduces better precision, recall, and classification accuracy.

This is a preview of subscription content, access via your institution.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24
Figure 25
Figure 26
Figure 27
Figure 28
Figure 29
Figure 30
Figure 31
Figure 32
Figure 33
Figure 34
Figure 35
Figure 36
Figure 37
Figure 38
Figure 39
Figure 40
Figure 41
Figure 42
Figure 43
Figure 44
Figure 45
Figure 46
Figure 47

References

  1. Alamelu Mangai, J., Milind Wagle, S., Santhosh Kumar, V.: A Novel Web page classification model using an improved k nearest neighbor algorithm. 3rd International Conference on Intelligent Computational Systems, Singapore, pp. 49–53 (2013)

  2. Asirvatham, A. P., Ravi, K. K.: Web page classification based on document structure. Awarded Second Prize in National Level Student Paper Contest conducted by IEEE India Council., (2001)

  3. Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using Web search engines. In: Proceedings of International Conference on World Wide Web, pp. 757–766 (2007)

  4. Cardoso-Cachopo, A.; Improving methods for single-label text categorization. PhD thesis, Technical University of Lisbon (2007)

  5. Chen, R.-C., Hsieh, C.-H.: Web page classification based on a support Vector machine using a weighted vote schema. Expert Syst. Appl. 31(2), 427–435 (2006)

    Article  Google Scholar 

  6. Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)

    Article  Google Scholar 

  7. Cios, K., Swiniarski, R., Pedrycz, W., Kurgan, L.: Unsupervised learning: association rules. In: Data Mining: A Knowledge Discovery Approach, chapter 10, pp. 289–306. Springer-Verlag New York, Inc., Secaucus, NJ (2007)

  8. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  9. Domingue, J., Fensel, D., Hendler, J. A.: Handbook of semantic Web technologies. Springer-Verlag Berlin Heidelberg (2011)

  10. Eilbeck, K., Lewis, S.E., Mungall, C. J., Yandell, M., Stein, L., Durbin, R., Ashburner, M.: The sequence ontology: a tool for the unification of genome annotations. Genome Biol. 6(5), (2005)

  11. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

    MATH  Google Scholar 

  12. Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001)

    Article  Google Scholar 

  13. Gruber, T.R.: A translation approach to portable ontology specification. Knowl. Acquis. 5(2), 199–220 (1993)

    Article  Google Scholar 

  14. Holden, N., Freitas, A. A.: Web page classification with an ant colony algorithm. Parallel Problem Solving from Nature, LNCS, Springer, vol. 3242, pp. 1092–1102 (2004)

  15. Hsu, C.,Chang, C., Lin, C.: A practical guide to support vector classification. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, (2003)

  16. Hu, R., Hu, W.: A novel framework for Web pages classification. In: Proceeding of The 3rd International Conference on Multimedia Technology, ICMT, pp. 1061–1068 (2013)

  17. Jurisica, I., Mylopoulos, J., Yu, E.: Ontologies for knowledge management: an information systems perspective. Knowl. Inf. Syst. 6, 380–401 (2004)

    Article  Google Scholar 

  18. Kaur, P., Kaur, R.: A survey of optimization algorithms for Web page classification. Int. J. Comput. Sci. Technol. IJCST 5(2), 71–75 (2014)

    MATH  Google Scholar 

  19. Kwon, O.-W., Lee, J.-H.:“Web page classification based on k-nearest neighbor approach. Proceedings of the 5th International Workshop on Information Retrieval with Asian languages, pp. 9–15. ACM Press, Hong Kong, China (2000)

  20. Lim, E.H.Y., Liu, J.N.K., Lee, R.S.T.: Knowledge seeker – ontology modeling for information search and management. Intelligent Systems Reference Library,, vol. 8. Springer-Verlag Berlin Heidelberg (2011)

  21. Lin, Y., Jiang, J., Lee, S.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2014)

    Article  Google Scholar 

  22. Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd edn. Springer-Verlag Berlin Heidelberg, (2007)

  23. Liu, Y., Liu, M., Xiang, L., Yang, Q.: Entity-based classification of Web page in search engine. ICADL, LNCS, vol. 5362, pp. 411–412 (2008)

  24. Madsen, R.E., Hansen, L.K., Winther, O.: Singular value decomposition and principal component analysis. Neural Netw. 1, 1–5 (2004)

    Google Scholar 

  25. Mangai, J. A., Wagle, S. M., Kumar, V. S.: A novel Web page classification model using an improved k nearest neighbor algorithm. In: Proceedings of the 3rd International Conference on Intelligent Computational Systems, Singapore (2013)

  26. Meshkizadeh, S., Rahmani, A. M., Dezfuli, M. A.: Web page classification based on URL features and features of sibling pages. IJCSIS 8(2) (2010)

  27. Meusel, R., Petrovski, P., Bizer, C.: The Web data commons microdata, RDFa and microformat dataset series. In: Proceedings of the 13th International Semantic Web Conference (ISWC 2014), pp. 277–292. Springer Berlin Heidelberg, Italy (2014)

  28. Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  29. Neches, R., Fikes, R.E., Finin, T., Gruber, T.R., Senator, T., Swartout, W.R.: Enabling technology for knowledge sharing. AI Mag. 12(3), 36–56 (1991)

    Google Scholar 

  30. Patil, A.S., Pawar, B.V.: Automated classification of Web sites using Naive Bayesian algorithm. In: Proceeding of the International Multi Conference of Engineers and Computer Scientists, Hong Kong, vol. 1 (2012)

  31. Peng, X., Choi, B.: Automatic Web page classification in a dynamic and hierarchical way. In: Proceedings of Second IEEE International Conference on Data Mining, Washington DC, IEEE Computer Society, pp. 386–393 (2002)

  32. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  33. Qiang, G.: An effective algorithm for improving the performance of Naive Bayes for text classification. In: Proceedings of the 2nd International Conference on Computer Research and Development, IEEE, pp. 699–701 (2010)

  34. Saleh, A.I., El Desouky, A.I., Ali, S.H.: Promoting the performance of vertical recommendation systems by applying new classification techniques. Knowl.-Based Syst. 75, 192–223 (2015)

    Article  Google Scholar 

  35. Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., Ma, W.-Y.:“Web-page classification through summarization. In the Proceedings of the 27th annual international ACM SIGIR 04, conference on. Research and Development in Information Retrieval, New York, ACM Press, pp. 242–249, (2004)

  36. Shen, D., Yang, Q., Chen, Z.: Noise reduction through summarization for Web-page classification. Inf. Process. Manag. 43(6), 1735–1747 (2007)

    Article  Google Scholar 

  37. Shibu, S., Vishwakarma, A., Bhargava, N.: A combination approach for Web page classification using page rank and feature selection technique. Int. J. Comput. Theory Eng. 2(6), 897–900 (2010)

    Article  Google Scholar 

  38. Sun, A., Lim, E.-P., Ng, W.-K.:“Web classification using support vector machine. Proceedings of the 4th International Workshop on Web Information and Data Management, pp. 96–99. ACM Press, New York (2002)

  39. Yu, H., Han, J., Chang, K.C.-C.: PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16(1), 70–81 (2004)

    Article  Google Scholar 

  40. Zhang, J.-B., Xu, Z.-M., Xiu, K.-l., Pan, Q.-S.: A Web site classification approach based on its topological structure. Int. J. Asian Lang. Process. 20(2), 75–86 (2012)

    Google Scholar 

  41. Zhi Sam, L., Maarof, M. A., Selamat, A.: Automated Web pages classification with independent component analysis. In: Proceeding of The 2nd Postgraduate Annual Research Seminar, vol. 1, pp. 466–269 (2006)

  42. Zhou, H., Guo, J., Wang, X., Duan, W., Wang, P., Cao, W.: A Web page classification algorithm based on feature selection. J. Inf. Comput. Sci. 12(4), 1549–1556 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arwa E. Abulwafa.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Saleh, A.I., Al Rahmawy, M.F. & Abulwafa, A.E. A semantic based Web page classification strategy using multi-layered domain ontology. World Wide Web 20, 939–993 (2017). https://doi.org/10.1007/s11280-016-0415-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-016-0415-z

Keywords

  • Classification
  • Ontology
  • SVM
  • Naïve Bayes
  • KNN
  • Association rules