Skip to main content
Log in

Automatic classification of academic web page types

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Counts of hyperlinks between websites can be unreliable for webometrics studies so researchers have attempted to find alternate counting methods or have tried to identify the reasons why links in websites are created. Manual classification of individual links in websites is infeasible for large webometrics studies, so a more efficient approach to identifying the reasons for link creation is needed to fully harness the potential of hyperlinks for webometrics research. This paper describes a machine learning method to automatically classify hyperlink source and target page types in university websites. 78 % accuracy was achieved for automatically classifying web page types and up to 74 % accuracy for predicting link target page types from link source page characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Almind, T. C., & Ingwersen, P. (1997). Informetric analyses on the world wide web: methodological approaches to “webometrics”. Journal of Documentation, 53(404), 404–426.

    Article  Google Scholar 

  • Bar-Ilan, J. (2004). A microscopic link analysis of academic institutions within a country—the case of Israel. Scientometrics, 59(3), 391–403. Retrieved from http://dx.doi.org/10.1023/B:SCIE.0000018540.33706.c1.

  • Bar-Ilan, J. (2005). What do we know about links and linking? A framework for studying links in academic environments. Information Processing and Management, 41(4), 973–986. doi:10.1016/j.ipm.2004.02.005.

    Article  Google Scholar 

  • Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Wadsworth: Wadsworth International Group.

    MATH  Google Scholar 

  • Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. doi:10.1023/A:1009715923555.

    Article  Google Scholar 

  • Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44(2), 482–494. doi:10.1016/j.dss.2007.06.002.

    Article  Google Scholar 

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(273), 273–297.

    MATH  Google Scholar 

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explorations Newsletter, 11(1), 10–18. doi:10.1145/1656274.1656278.

    Article  Google Scholar 

  • Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2006). Machine learning: a review of classification and combining techniques. Artificial Intelligence Review, 26(3), 159–190. doi:10.1007/s10462-007-9052-3.

    Article  Google Scholar 

  • Liu, B. (2006). Web data mining: exploring hyperlinks, contents, and usage data (data-centric systems and applications). Secaucus: Springer-Verlag New York Inc.

    Google Scholar 

  • Luo, P., Lin, F., Xiong, Y., Zhao, Y., & Shi, Z. (2009). Towards combining web classification and web information extraction: a case study. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1235–1244). ACM. doi:10.1145/1557019.1557152.

  • Qi, X., & Davison, B. D. (2009). Web page classification: features and algorithms. ACM Computing Surveys, 41(2), 1–31. doi:10.1145/1459352.1459357.

    Article  Google Scholar 

  • Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  • Rousseau, R. (1997). Sitations: an exploratory study. International Journal of Scientometrics, Informetrics and Bibliometrics, 1(1). Retrieved from http://www.webcitation.org/5stBoPIrC.

  • Smith, A. (1999). A tale of two web spaces: comparing sites using web impact factors. Journal of Documentation, 55(5), 577–592. Retrieved from http://www.citeulike.org/user/dreymond33/article/6091839.

    Google Scholar 

  • Smith, A. (2003). Classifying links for substantive web impact factors. In Proceedings of the 9th international conference on scientometrics and informetrics (Vol. Dalian, Ch).

  • Thelwall, M. (2001). Extracting macroscopic information from web links. Journal of the American Society for Information Science and Technology, 52(13), 1157–1168. doi:10.1002/asi.1182.

    Article  Google Scholar 

  • Thelwall, M. (2002a). Evidence for the existence of geographic trends in university web site interlinking. Journal of Documentation, 58(5), 563–574.

    Article  Google Scholar 

  • Thelwall, M. (2002b). Conceptualizing documentation on the web: an evaluation of different heuristic-based models for counting links between university web sites. Journal of the American Society for Information Science and Technology, 53(12), 995–1005. doi:10.1002/asi.10135.

    Article  Google Scholar 

  • Thelwall, M. (2003). What is this link doing here? Beginning a fine-grained process of identifying reasons for academic hyperlink creation. Information Research, 8(3). Retrieved from http://informationr.net/ir/8-3/paper151.html.

  • Thelwall, M. (2006). Interpreting social science link analysis research: a theoretical framework. Journal of the Association for Information Science and Technology, 57(1), 60–68. doi:10.1002/asi.v57:1.

    Article  Google Scholar 

  • Vaseleiadou, E., & van den Besselaar, P. (2006). Linking shallow, linking deep. How scientific intermediaries use the web for their network of collaborators. Internationl Journal of Scientometrics, Informetrics and Bibliometrics, 10(1).

  • Vaughan, L. (2005). Mining web hyperlink data for business information: The case of telecommunications equipment companies. In Proceedings of The 1st international conference on signal-image technology & internet-based systems (pp. 190–195).

  • Vaughan, L., & Wu, G. (2004). Links to commercial websites as a source of business information. Scientometrics, 60(3), 487–496. doi:10.1023/B:SCIE.0000034389.14825.bc.

    Article  Google Scholar 

  • Wilkinson, D., Harries, G., Thelwall, M., & Price, L. (2003). Motivations for academic web site interlinking: evidence for the web as a novel source of information on informal scholarly communication. Journal of Information Science, 29(1), 49–56. doi:10.1177/016555150302900105.

    Article  Google Scholar 

Download references

Acknowledgments

This paper is an extension of a paper previously presented at the International Society for Scientometrics and Informetrics (ISSI) conference.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrick Kenekayoro.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kenekayoro, P., Buckley, K. & Thelwall, M. Automatic classification of academic web page types. Scientometrics 101, 1015–1026 (2014). https://doi.org/10.1007/s11192-014-1292-9

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-014-1292-9

Keywords

Navigation