Skip to main content

Importance of HTML Structural Elements and Metadata in Automated Subject Classification

  • Conference paper
Research and Advanced Technology for Digital Libraries (ECDL 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3652))

Included in the following conference series:

Abstract

The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. “All” Engineering Resources on the Internet: A Companion Service to EELS, Available (2003), http://eels.lub.lu.se/ae/

  2. Ardö, A., Koch, T.: Automatic Classification Applied to the Full-Text Internet Documents in a Robot-Generated Subject Index. In: Online Information 1999, Proceedings of the 23rd International Online Information Meeting, London, pp. 239–246 (1999)

    Google Scholar 

  3. Attardi, G., Gullì, A., Sebastiani, F.: Automatic Web Page Categorization by Link and Context Analysis. In: Hutchison, C., Lanzarone, G. (eds.) Proceedings of THAI 1999, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105–119 (1999)

    Google Scholar 

  4. Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Annual Workshop on Computational Learning Theory, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)

    Google Scholar 

  5. Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  6. DESIRE : Development of a European Service for Information on Research and Education (2000), Available, http://www.desire.org/

  7. Engineering Electronic Library. Available (2003), http://eels.lub.lu.se/

  8. Fisher, M., Everson, R.: When are Links Useful?: Experiments in Text Classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 41–56. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  9. Fürnkranz, J.: Hyperlink Ensembles: A Case Study in Hypertext Classification. Information Fusion 3(4), 299–312 (2002)

    Article  Google Scholar 

  10. Ghani, R., Slattery, S., Yang, Y.: Hypertext Categorization Using Hyperlink Patterns and Metadata. In: Proceedings of ICML 2001, 18th International Conference on Machine Learning, pp. 178–185 (2001)

    Google Scholar 

  11. Glover, E.J., et al.: Using Web structure for Classifying and Describing Web Pages. In: Proceedings of the Eleventh International Conference on World Wide Web, Honolulu, Hawaii, USA, pp. 562–569 (2002)

    Google Scholar 

  12. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 3(31), 264–323 (1999)

    Article  Google Scholar 

  13. Koch, T., Ardö, A.: Automatic Classification of Full-Text HTML-Documents from One Specific Subject Area. In: EU Project DESIRE II D3.6a, Working Paper 2 (2000), Available, http://www.it.lth.se/knowlib/DESIRE36a-WP2.html

  14. Kolcz, A., Prabakarmurthi, V., Kalita, J., Alspector, J.: Summarization as Feature Selection for Text Categorization. In: Proceedings of the Tenth International Information and Knowledge Management (CIKM 2001), pp. 365–370 (2001)

    Google Scholar 

  15. Olson, H.A., Boll, J.J.: Subject Analysis in Online Catalogs., 2nd edn. Libraries Unlimited, Englewood (2001)

    Google Scholar 

  16. Pierre, J.: On the Automated Classification of Web sites. In: Linköping Electronic Articles in Computer and Information Science 001(6) (2001), Available, http://www.ep.liu.se/ea/cis/2001/001/

  17. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 1(34), 1–47 (2002)

    Article  Google Scholar 

  18. Slattery, S., Craven, M.: Discovering Test Set Regularities in Relational Domains. In: Proceedings of ICML 2000, 17th International Conference on Machine Learning, pp. 895–902 (2000)

    Google Scholar 

  19. Svenonius, E.: The Intellectual Foundations of Information Organization. MIT Press, Cambridge (2000)

    Google Scholar 

  20. Tudhope, D., Taylor, C.: Navigation via Similarity: Automatic Linking Based on Semantic Closeness. Information Processing and Management 33(2), 233–242 (1997)

    Article  Google Scholar 

  21. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1/2(1), 67–88 (1999)

    Google Scholar 

  22. Yang, Y., Slattery, S., Ghani, R.: A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems 2/3(8), 219–241 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Golub, K., Ardö, A. (2005). Importance of HTML Structural Elements and Metadata in Automated Subject Classification. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2005. Lecture Notes in Computer Science, vol 3652. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551362_33

Download citation

  • DOI: https://doi.org/10.1007/11551362_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28767-4

  • Online ISBN: 978-3-540-31931-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics