Abstract
The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
“All” Engineering Resources on the Internet: A Companion Service to EELS, Available (2003), http://eels.lub.lu.se/ae/
Ardö, A., Koch, T.: Automatic Classification Applied to the Full-Text Internet Documents in a Robot-Generated Subject Index. In: Online Information 1999, Proceedings of the 23rd International Online Information Meeting, London, pp. 239–246 (1999)
Attardi, G., Gullì, A., Sebastiani, F.: Automatic Web Page Categorization by Link and Context Analysis. In: Hutchison, C., Lanzarone, G. (eds.) Proceedings of THAI 1999, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105–119 (1999)
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Annual Workshop on Computational Learning Theory, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)
DESIRE : Development of a European Service for Information on Research and Education (2000), Available, http://www.desire.org/
Engineering Electronic Library. Available (2003), http://eels.lub.lu.se/
Fisher, M., Everson, R.: When are Links Useful?: Experiments in Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 41–56. Springer, Heidelberg (2003)
Fürnkranz, J.: Hyperlink Ensembles: A Case Study in Hypertext Classification. Information Fusion 3(4), 299–312 (2002)
Ghani, R., Slattery, S., Yang, Y.: Hypertext Categorization Using Hyperlink Patterns and Metadata. In: Proceedings of ICML 2001, 18th International Conference on Machine Learning, pp. 178–185 (2001)
Glover, E.J., et al.: Using Web structure for Classifying and Describing Web Pages. In: Proceedings of the Eleventh International Conference on World Wide Web, Honolulu, Hawaii, USA, pp. 562–569 (2002)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 3(31), 264–323 (1999)
Koch, T., Ardö, A.: Automatic Classification of Full-Text HTML-Documents from One Specific Subject Area. In: EU Project DESIRE II D3.6a, Working Paper 2 (2000), Available, http://www.it.lth.se/knowlib/DESIRE36a-WP2.html
Kolcz, A., Prabakarmurthi, V., Kalita, J., Alspector, J.: Summarization as Feature Selection for Text Categorization. In: Proceedings of the Tenth International Information and Knowledge Management (CIKM 2001), pp. 365–370 (2001)
Olson, H.A., Boll, J.J.: Subject Analysis in Online Catalogs., 2nd edn. Libraries Unlimited, Englewood (2001)
Pierre, J.: On the Automated Classification of Web sites. In: Linköping Electronic Articles in Computer and Information Science 001(6) (2001), Available, http://www.ep.liu.se/ea/cis/2001/001/
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 1(34), 1–47 (2002)
Slattery, S., Craven, M.: Discovering Test Set Regularities in Relational Domains. In: Proceedings of ICML 2000, 17th International Conference on Machine Learning, pp. 895–902 (2000)
Svenonius, E.: The Intellectual Foundations of Information Organization. MIT Press, Cambridge (2000)
Tudhope, D., Taylor, C.: Navigation via Similarity: Automatic Linking Based on Semantic Closeness. Information Processing and Management 33(2), 233–242 (1997)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1/2(1), 67–88 (1999)
Yang, Y., Slattery, S., Ghani, R.: A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems 2/3(8), 219–241 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Golub, K., Ardö, A. (2005). Importance of HTML Structural Elements and Metadata in Automated Subject Classification. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2005. Lecture Notes in Computer Science, vol 3652. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551362_33
Download citation
DOI: https://doi.org/10.1007/11551362_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28767-4
Online ISBN: 978-3-540-31931-3
eBook Packages: Computer ScienceComputer Science (R0)