Abstract
Counts of hyperlinks between websites can be unreliable for webometrics studies so researchers have attempted to find alternate counting methods or have tried to identify the reasons why links in websites are created. Manual classification of individual links in websites is infeasible for large webometrics studies, so a more efficient approach to identifying the reasons for link creation is needed to fully harness the potential of hyperlinks for webometrics research. This paper describes a machine learning method to automatically classify hyperlink source and target page types in university websites. 78 % accuracy was achieved for automatically classifying web page types and up to 74 % accuracy for predicting link target page types from link source page characteristics.
Similar content being viewed by others
References
Almind, T. C., & Ingwersen, P. (1997). Informetric analyses on the world wide web: methodological approaches to “webometrics”. Journal of Documentation, 53(404), 404–426.
Bar-Ilan, J. (2004). A microscopic link analysis of academic institutions within a country—the case of Israel. Scientometrics, 59(3), 391–403. Retrieved from http://dx.doi.org/10.1023/B:SCIE.0000018540.33706.c1.
Bar-Ilan, J. (2005). What do we know about links and linking? A framework for studying links in academic environments. Information Processing and Management, 41(4), 973–986. doi:10.1016/j.ipm.2004.02.005.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Wadsworth: Wadsworth International Group.
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. doi:10.1023/A:1009715923555.
Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44(2), 482–494. doi:10.1016/j.dss.2007.06.002.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(273), 273–297.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explorations Newsletter, 11(1), 10–18. doi:10.1145/1656274.1656278.
Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2006). Machine learning: a review of classification and combining techniques. Artificial Intelligence Review, 26(3), 159–190. doi:10.1007/s10462-007-9052-3.
Liu, B. (2006). Web data mining: exploring hyperlinks, contents, and usage data (data-centric systems and applications). Secaucus: Springer-Verlag New York Inc.
Luo, P., Lin, F., Xiong, Y., Zhao, Y., & Shi, Z. (2009). Towards combining web classification and web information extraction: a case study. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1235–1244). ACM. doi:10.1145/1557019.1557152.
Qi, X., & Davison, B. D. (2009). Web page classification: features and algorithms. ACM Computing Surveys, 41(2), 1–31. doi:10.1145/1459352.1459357.
Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco: Morgan Kaufmann Publishers Inc.
Rousseau, R. (1997). Sitations: an exploratory study. International Journal of Scientometrics, Informetrics and Bibliometrics, 1(1). Retrieved from http://www.webcitation.org/5stBoPIrC.
Smith, A. (1999). A tale of two web spaces: comparing sites using web impact factors. Journal of Documentation, 55(5), 577–592. Retrieved from http://www.citeulike.org/user/dreymond33/article/6091839.
Smith, A. (2003). Classifying links for substantive web impact factors. In Proceedings of the 9th international conference on scientometrics and informetrics (Vol. Dalian, Ch).
Thelwall, M. (2001). Extracting macroscopic information from web links. Journal of the American Society for Information Science and Technology, 52(13), 1157–1168. doi:10.1002/asi.1182.
Thelwall, M. (2002a). Evidence for the existence of geographic trends in university web site interlinking. Journal of Documentation, 58(5), 563–574.
Thelwall, M. (2002b). Conceptualizing documentation on the web: an evaluation of different heuristic-based models for counting links between university web sites. Journal of the American Society for Information Science and Technology, 53(12), 995–1005. doi:10.1002/asi.10135.
Thelwall, M. (2003). What is this link doing here? Beginning a fine-grained process of identifying reasons for academic hyperlink creation. Information Research, 8(3). Retrieved from http://informationr.net/ir/8-3/paper151.html.
Thelwall, M. (2006). Interpreting social science link analysis research: a theoretical framework. Journal of the Association for Information Science and Technology, 57(1), 60–68. doi:10.1002/asi.v57:1.
Vaseleiadou, E., & van den Besselaar, P. (2006). Linking shallow, linking deep. How scientific intermediaries use the web for their network of collaborators. Internationl Journal of Scientometrics, Informetrics and Bibliometrics, 10(1).
Vaughan, L. (2005). Mining web hyperlink data for business information: The case of telecommunications equipment companies. In Proceedings of The 1st international conference on signal-image technology & internet-based systems (pp. 190–195).
Vaughan, L., & Wu, G. (2004). Links to commercial websites as a source of business information. Scientometrics, 60(3), 487–496. doi:10.1023/B:SCIE.0000034389.14825.bc.
Wilkinson, D., Harries, G., Thelwall, M., & Price, L. (2003). Motivations for academic web site interlinking: evidence for the web as a novel source of information on informal scholarly communication. Journal of Information Science, 29(1), 49–56. doi:10.1177/016555150302900105.
Acknowledgments
This paper is an extension of a paper previously presented at the International Society for Scientometrics and Informetrics (ISSI) conference.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kenekayoro, P., Buckley, K. & Thelwall, M. Automatic classification of academic web page types. Scientometrics 101, 1015–1026 (2014). https://doi.org/10.1007/s11192-014-1292-9
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-014-1292-9