Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

A semantic based Web page classification strategy using multi-layered domain ontology

  • 641 Accesses

  • 7 Citations

Abstract

World Wide Web is a continuously growing giant, and within the next few years, Web contents will surely increase tremendously. Hence, there is a great requirement to have algorithms that could accurately classify Web pages. Automatic Web page classification is significantly different from traditional text classification because of the presence of additional information, provided by the HTML structure. Recently, several techniques have been arisen from combinations of artificial intelligence and statistical approaches. However, it is not a simple matter to find an optimal classification technique for Web pages. This paper introduces a novel strategy for vertical Web page classification, which is called Classification using Multi-layered Domain Ontology (CMDO). It employs several Web mining techniques, and depends mainly on proposed multi-layered domain ontology. In order to promote the classification accuracy, CMDO implies a distiller to reject pages related to other domains. CMDO also employs a novel classification technique, which is called Graph Based Classification (GBC). The proposed GBC has pioneering features that other techniques do not have, such as outlier rejection and pruning. Experimental results have shown that CMDO outperforms recent techniques as it introduces better precision, recall, and classification accuracy.

This is a preview of subscription content, log in to check access.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24
Figure 25
Figure 26
Figure 27
Figure 28
Figure 29
Figure 30
Figure 31
Figure 32
Figure 33
Figure 34
Figure 35
Figure 36
Figure 37
Figure 38
Figure 39
Figure 40
Figure 41
Figure 42
Figure 43
Figure 44
Figure 45
Figure 46
Figure 47

References

  1. 1.

    Alamelu Mangai, J., Milind Wagle, S., Santhosh Kumar, V.: A Novel Web page classification model using an improved k nearest neighbor algorithm. 3rd International Conference on Intelligent Computational Systems, Singapore, pp. 49–53 (2013)

  2. 2.

    Asirvatham, A. P., Ravi, K. K.: Web page classification based on document structure. Awarded Second Prize in National Level Student Paper Contest conducted by IEEE India Council., (2001)

  3. 3.

    Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using Web search engines. In: Proceedings of International Conference on World Wide Web, pp. 757–766 (2007)

  4. 4.

    Cardoso-Cachopo, A.; Improving methods for single-label text categorization. PhD thesis, Technical University of Lisbon (2007)

  5. 5.

    Chen, R.-C., Hsieh, C.-H.: Web page classification based on a support Vector machine using a weighted vote schema. Expert Syst. Appl. 31(2), 427–435 (2006)

  6. 6.

    Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)

  7. 7.

    Cios, K., Swiniarski, R., Pedrycz, W., Kurgan, L.: Unsupervised learning: association rules. In: Data Mining: A Knowledge Discovery Approach, chapter 10, pp. 289–306. Springer-Verlag New York, Inc., Secaucus, NJ (2007)

  8. 8.

    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

  9. 9.

    Domingue, J., Fensel, D., Hendler, J. A.: Handbook of semantic Web technologies. Springer-Verlag Berlin Heidelberg (2011)

  10. 10.

    Eilbeck, K., Lewis, S.E., Mungall, C. J., Yandell, M., Stein, L., Durbin, R., Ashburner, M.: The sequence ontology: a tool for the unification of genome annotations. Genome Biol. 6(5), (2005)

  11. 11.

    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)

  12. 12.

    Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 (2001)

  13. 13.

    Gruber, T.R.: A translation approach to portable ontology specification. Knowl. Acquis. 5(2), 199–220 (1993)

  14. 14.

    Holden, N., Freitas, A. A.: Web page classification with an ant colony algorithm. Parallel Problem Solving from Nature, LNCS, Springer, vol. 3242, pp. 1092–1102 (2004)

  15. 15.

    Hsu, C.,Chang, C., Lin, C.: A practical guide to support vector classification. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, (2003)

  16. 16.

    Hu, R., Hu, W.: A novel framework for Web pages classification. In: Proceeding of The 3rd International Conference on Multimedia Technology, ICMT, pp. 1061–1068 (2013)

  17. 17.

    Jurisica, I., Mylopoulos, J., Yu, E.: Ontologies for knowledge management: an information systems perspective. Knowl. Inf. Syst. 6, 380–401 (2004)

  18. 18.

    Kaur, P., Kaur, R.: A survey of optimization algorithms for Web page classification. Int. J. Comput. Sci. Technol. IJCST 5(2), 71–75 (2014)

  19. 19.

    Kwon, O.-W., Lee, J.-H.:“Web page classification based on k-nearest neighbor approach. Proceedings of the 5th International Workshop on Information Retrieval with Asian languages, pp. 9–15. ACM Press, Hong Kong, China (2000)

  20. 20.

    Lim, E.H.Y., Liu, J.N.K., Lee, R.S.T.: Knowledge seeker – ontology modeling for information search and management. Intelligent Systems Reference Library,, vol. 8. Springer-Verlag Berlin Heidelberg (2011)

  21. 21.

    Lin, Y., Jiang, J., Lee, S.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2014)

  22. 22.

    Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd edn. Springer-Verlag Berlin Heidelberg, (2007)

  23. 23.

    Liu, Y., Liu, M., Xiang, L., Yang, Q.: Entity-based classification of Web page in search engine. ICADL, LNCS, vol. 5362, pp. 411–412 (2008)

  24. 24.

    Madsen, R.E., Hansen, L.K., Winther, O.: Singular value decomposition and principal component analysis. Neural Netw. 1, 1–5 (2004)

  25. 25.

    Mangai, J. A., Wagle, S. M., Kumar, V. S.: A novel Web page classification model using an improved k nearest neighbor algorithm. In: Proceedings of the 3rd International Conference on Intelligent Computational Systems, Singapore (2013)

  26. 26.

    Meshkizadeh, S., Rahmani, A. M., Dezfuli, M. A.: Web page classification based on URL features and features of sibling pages. IJCSIS 8(2) (2010)

  27. 27.

    Meusel, R., Petrovski, P., Bizer, C.: The Web data commons microdata, RDFa and microformat dataset series. In: Proceedings of the 13th International Semantic Web Conference (ISWC 2014), pp. 277–292. Springer Berlin Heidelberg, Italy (2014)

  28. 28.

    Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)

  29. 29.

    Neches, R., Fikes, R.E., Finin, T., Gruber, T.R., Senator, T., Swartout, W.R.: Enabling technology for knowledge sharing. AI Mag. 12(3), 36–56 (1991)

  30. 30.

    Patil, A.S., Pawar, B.V.: Automated classification of Web sites using Naive Bayesian algorithm. In: Proceeding of the International Multi Conference of Engineers and Computer Scientists, Hong Kong, vol. 1 (2012)

  31. 31.

    Peng, X., Choi, B.: Automatic Web page classification in a dynamic and hierarchical way. In: Proceedings of Second IEEE International Conference on Data Mining, Washington DC, IEEE Computer Society, pp. 386–393 (2002)

  32. 32.

    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

  33. 33.

    Qiang, G.: An effective algorithm for improving the performance of Naive Bayes for text classification. In: Proceedings of the 2nd International Conference on Computer Research and Development, IEEE, pp. 699–701 (2010)

  34. 34.

    Saleh, A.I., El Desouky, A.I., Ali, S.H.: Promoting the performance of vertical recommendation systems by applying new classification techniques. Knowl.-Based Syst. 75, 192–223 (2015)

  35. 35.

    Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., Ma, W.-Y.:“Web-page classification through summarization. In the Proceedings of the 27th annual international ACM SIGIR 04, conference on. Research and Development in Information Retrieval, New York, ACM Press, pp. 242–249, (2004)

  36. 36.

    Shen, D., Yang, Q., Chen, Z.: Noise reduction through summarization for Web-page classification. Inf. Process. Manag. 43(6), 1735–1747 (2007)

  37. 37.

    Shibu, S., Vishwakarma, A., Bhargava, N.: A combination approach for Web page classification using page rank and feature selection technique. Int. J. Comput. Theory Eng. 2(6), 897–900 (2010)

  38. 38.

    Sun, A., Lim, E.-P., Ng, W.-K.:“Web classification using support vector machine. Proceedings of the 4th International Workshop on Web Information and Data Management, pp. 96–99. ACM Press, New York (2002)

  39. 39.

    Yu, H., Han, J., Chang, K.C.-C.: PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16(1), 70–81 (2004)

  40. 40.

    Zhang, J.-B., Xu, Z.-M., Xiu, K.-l., Pan, Q.-S.: A Web site classification approach based on its topological structure. Int. J. Asian Lang. Process. 20(2), 75–86 (2012)

  41. 41.

    Zhi Sam, L., Maarof, M. A., Selamat, A.: Automated Web pages classification with independent component analysis. In: Proceeding of The 2nd Postgraduate Annual Research Seminar, vol. 1, pp. 466–269 (2006)

  42. 42.

    Zhou, H., Guo, J., Wang, X., Duan, W., Wang, P., Cao, W.: A Web page classification algorithm based on feature selection. J. Inf. Comput. Sci. 12(4), 1549–1556 (2015)

Download references

Author information

Correspondence to Arwa E. Abulwafa.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Saleh, A.I., Al Rahmawy, M.F. & Abulwafa, A.E. A semantic based Web page classification strategy using multi-layered domain ontology. World Wide Web 20, 939–993 (2017). https://doi.org/10.1007/s11280-016-0415-z

Download citation

Keywords

  • Classification
  • Ontology
  • SVM
  • Naïve Bayes
  • KNN
  • Association rules