Importance of HTML Structural Elements and Metadata in Automated Subject Classification

Golub, Koraljka; Ardö, Anders

doi:10.1007/11551362_33

Koraljka Golub¹⁹ &
Anders Ardö¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3652))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1228 Accesses
20 Citations

Abstract

The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

“All” Engineering Resources on the Internet: A Companion Service to EELS, Available (2003), http://eels.lub.lu.se/ae/
Ardö, A., Koch, T.: Automatic Classification Applied to the Full-Text Internet Documents in a Robot-Generated Subject Index. In: Online Information 1999, Proceedings of the 23rd International Online Information Meeting, London, pp. 239–246 (1999)
Google Scholar
Attardi, G., Gullì, A., Sebastiani, F.: Automatic Web Page Categorization by Link and Context Analysis. In: Hutchison, C., Lanzarone, G. (eds.) Proceedings of THAI 1999, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105–119 (1999)
Google Scholar
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Annual Workshop on Computational Learning Theory, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Google Scholar
Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)
Chapter Google Scholar
DESIRE : Development of a European Service for Information on Research and Education (2000), Available, http://www.desire.org/
Engineering Electronic Library. Available (2003), http://eels.lub.lu.se/
Fisher, M., Everson, R.: When are Links Useful?: Experiments in Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 41–56. Springer, Heidelberg (2003)
Chapter Google Scholar
Fürnkranz, J.: Hyperlink Ensembles: A Case Study in Hypertext Classification. Information Fusion 3(4), 299–312 (2002)
Article Google Scholar
Ghani, R., Slattery, S., Yang, Y.: Hypertext Categorization Using Hyperlink Patterns and Metadata. In: Proceedings of ICML 2001, 18th International Conference on Machine Learning, pp. 178–185 (2001)
Google Scholar
Glover, E.J., et al.: Using Web structure for Classifying and Describing Web Pages. In: Proceedings of the Eleventh International Conference on World Wide Web, Honolulu, Hawaii, USA, pp. 562–569 (2002)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 3(31), 264–323 (1999)
Article Google Scholar
Koch, T., Ardö, A.: Automatic Classification of Full-Text HTML-Documents from One Specific Subject Area. In: EU Project DESIRE II D3.6a, Working Paper 2 (2000), Available, http://www.it.lth.se/knowlib/DESIRE36a-WP2.html
Kolcz, A., Prabakarmurthi, V., Kalita, J., Alspector, J.: Summarization as Feature Selection for Text Categorization. In: Proceedings of the Tenth International Information and Knowledge Management (CIKM 2001), pp. 365–370 (2001)
Google Scholar
Olson, H.A., Boll, J.J.: Subject Analysis in Online Catalogs., 2nd edn. Libraries Unlimited, Englewood (2001)
Google Scholar
Pierre, J.: On the Automated Classification of Web sites. In: Linköping Electronic Articles in Computer and Information Science 001(6) (2001), Available, http://www.ep.liu.se/ea/cis/2001/001/
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 1(34), 1–47 (2002)
Article Google Scholar
Slattery, S., Craven, M.: Discovering Test Set Regularities in Relational Domains. In: Proceedings of ICML 2000, 17th International Conference on Machine Learning, pp. 895–902 (2000)
Google Scholar
Svenonius, E.: The Intellectual Foundations of Information Organization. MIT Press, Cambridge (2000)
Google Scholar
Tudhope, D., Taylor, C.: Navigation via Similarity: Automatic Linking Based on Semantic Closeness. Information Processing and Management 33(2), 233–242 (1997)
Article Google Scholar
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1/2(1), 67–88 (1999)
Google Scholar
Yang, Y., Slattery, S., Ghani, R.: A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems 2/3(8), 219–241 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Knowledge Discovery and Digital Library Research Group (KnowLib), Digital Information Systems, Department of Information Technology, Lund University, P.O. Box 118, 22 100, Lund, Sweden
Koraljka Golub & Anders Ardö

Authors

Koraljka Golub
View author publications
You can also search for this author in PubMed Google Scholar
Anders Ardö
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Vienna University of Technology, Vienna, Austria
Andreas Rauber
Laboratory of Distributed Multimedia Information Systems and Applications, Technical University of Crete (MUSIC/TUC) Chania, 73100, Crete, Greece
Stavros Christodoulakis
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040, Wien, Austria
A Min Tjoa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Golub, K., Ardö, A. (2005). Importance of HTML Structural Elements and Metadata in Automated Subject Classification. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2005. Lecture Notes in Computer Science, vol 3652. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551362_33

Download citation

DOI: https://doi.org/10.1007/11551362_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28767-4
Online ISBN: 978-3-540-31931-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics