Skip to main content

Web metadata extraction and semantic indexing for learning objects extraction


Secondary-school teachers are in constant need of finding relevant digital resources to support specific didactic goals. Unfortunately, generic search engines do not allow them to identify learning objects among semi-structured candidate educational resources, much less retrieve them by teaching goals. This article describes a multi-strategy approach for semantically guided extraction, indexing and search of educational metadata; it combines machine learning, concept analysis, and corpus-based natural language processing techniques. The overall model was validated by comparing extracted metadata against standard search methods and heuristic-based techniques for Classification Accuracy and Metadata Quality (as evaluated by actual teachers), yielding promising results and showing that this semantically guided metadata extraction can effectively enhance access and use of educational digital material.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11








  1. Almpanidis G, Kotropoulos C, Pitas I (2007) Combining text and link analysis for focused crawling. Inf Syst 32(5):886–908

    Article  Google Scholar 

  2. Alpaydin E (2004) Introduction to machine learning. The MIT Press

  3. Baldi P, Frasconi P, Smyth P (2003) Modeling the internet and the web. Wiley

  4. Bauer M, Maier R, Thalmann P (2010) Metadata generation for learning objects: an experimental comparison of automatic and collaborative solutions. e-Learning pp 181–195

  5. Bhatia S, Mitra P (2012) Summarizing figures, tables, and algorithms in scientific publications to augment search results. In: ACM Transactions on Information Systems (TOIS), vol 1, pp 45–49

  6. Bhattacharya I, Godbole S, Joshi S (2008) Structured entity identification and document categorization: two tasks with one joint model. Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. Las Vegas, Nevada, USA, 25–33

  7. Bolettieri P, Falchi F, Gennaro C, Rabitti F (2007) Automatic metadata extraction and indexing for reusing e-learning multimedia objects. In: Workshop on multimedia information retrieval on The many faces of multimedia semantics, ACM, New York, NY, USA, pp 21–28

  8. Chatti M, Muhammad N, Jarke M (2008) Aloa: a web services driven framework for automatic learning object annotation. Times of convergence technologies across learning contexts, pp 86–91

  9. Cherfi H, Napoli A, Toussaint Y (2004) Knowledge-based selection of association rules for text mining. 16th European Conference on Artificial Intelligence - ECAI’04 (Valencia Spain) 24:485–489

  10. Contreras J, Mendoza M, Becerra C, Astudillo H (2010) Enhancing learning objects metadata improvement with indexing and categorization. In: LACLO 2010, 5th Latin American Conference on Learning Objects. Sao Paulo, Brazil, pp 1–1

  11. Day M, Tsai R, Sung C, Hsieh C, Lee C, Wu S, Wu K (2007) Reference metadata extraction using a hierarchical knowledge representation framework. Decis Support Syst 43(1):152–167

    Article  Google Scholar 

  12. Edvardsen L, Sølvberg I, Aalberg T, Trætteberg H (2009) Automatically generating high quality metadata by analyzing the document code of common file types. In: Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries. ACM, pp 29–38

  13. Flynn P, Zhou L, Maly K, Zeil S, Zubair M (2007) Automated template-based metadata extraction architecture. In: ICADL (LNCS 4822). Springer, Berlin, pp 327–336

  14. Gauch S, Wang Q (2009) Ontology-based focused crawling. International conference on information, process, and knowledge management, pp 123–128

  15. Golub K, Ardo A (2005) Importance of html structural elements and metadata in automated subject classification. In: Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), vol 3652, pp 368–378

  16. Greenberg J (2004) Metadata extraction and harvesting. J Internet Cat, pp 59–82

  17. Guo Z, Jin H (2011) Reference metadata extraction from scientific papers. In: 12th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp 45–49

  18. Hu Y, Li H, Cao Y, Teng L, Meyerzon D, Zheng R (2006) Automatic extraction of titles from general documents using machine learning. Inf Process Manag pp 1276–1293

  19. Huynh T, Hoang K (2010) Gate framework based metadata extraction from scientific papers. In: International Conference on Education and Management Technology (ICEMT), pp 188–191

  20. Jain S, Pareek J (2009) Keyphrase extraction tool (ket) for semantic metadata annotation of learning materials. International Conference on Signal Processing Systems, Singapore

  21. Jain S, Pareek J (2010) Automatic topic(s) identification from learning material: an ontological approach. Second International Conference on Computer Engineering and Applications, Indonesia

    Google Scholar 

  22. Jin H, Chen H (2008) Semrex: efficient search in a semantic overlay for literature retrieval. Futur Gener Comput Syst 24(6):475–488

    Article  Google Scholar 

  23. Jurafsky D, Martin J (2009) Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition, 2nd edn. Prentice Hall

  24. Kovacevic M (2005) Visual adjacency multigraphs-a novel approach to web page classification. Proceedings of SAWM04 workshop, ECML2004

  25. Landauer T, McNamara D, Dennis S, Kintsch W (2007) Handbook of latent semantic analysis (University of Colorado Institute of Cognitive Science Series). Lawrence Erlbaum Associates

  26. Lehmann L, Hildebrandt T, Rensing C, Steinmetz R (2008) Capture, management, and utilization of lifecycle information for learning resources. IEEE Trans Learn Technol 1(1):75–87

    Article  Google Scholar 

  27. Lu X, Kataria S, Brouwer W, Wang J, Mitra P, Giles C (2009) Automated analysis of images in documents for intelligent document search. In: IJDAR, 2, pp 65–81

  28. Manning C, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press

  29. Marinai S (2009) Metadata extraction from pdf papers for digital library ingest. 10th International Conference on Document Analysis and Recognition

  30. Meire M, Ochoa X, Duval E (2007) Samgi: automatic metadata generation v2. 0. In: Proceedings of world conference on educational multimedia. Hypermedia and Telecommunications, vol 2007, pp 1195–1204

  31. Nugent G, Kupzyk K, Riley S, Miller L (2009) Empirical usage metadata in learning objects. 39th ASEE/IEEE Frontiers in Education Conference, San Antonio, TX, USA

  32. Ojokoh B, Adewale O, Falaki S (2009) Automated document metadata extraction. In: Journal of Information Science, pp 563–570

  33. Olson D, Delen D (2008) Advanced data mining techniques. Springer

  34. Park J, Lu C (2009) Application of semi-automatic metadata generation in libraries: types, tools, and techniques. Libr Inf Sci Res 31:225–231

    Article  Google Scholar 

  35. Ping L (2009) Towards combining web classification and web information extraction: a case study. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. Paris, France, pp 1235–1244

  36. Ray S, Mitra P, Kirk A, Szep S, Pellegrino D (2013) Figure metadata extraction from digital documents. 12th International Conference on Document Analysis and Recognition

  37. Sen A (2004) Metadata management: past, present and future. Decis Support Syst 37(1):151–173

    Article  Google Scholar 

  38. Wu C, Marchese M, Jiang J, Ivanyukovich A, Liang Y (2007) Machine learning-based keywords extraction for scientific literature. J UCS 13(10):1471–1483

    Google Scholar 

  39. Xiong Y, Luo P, Zhao Y, Lin F (2009) Ofcourse: web content discovery, classification and information extraction. The 18th ACM Conference on Information and Knowledge Management, Hong Kong

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to John Atkinson.

Additional information

This research was supported by FONDECYT (Chile) under grant number 1130035, and project grant Basal FB0821 CCTVal (Chile)

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Atkinson, J., Gonzalez, A., Munoz, M. et al. Web metadata extraction and semantic indexing for learning objects extraction. Appl Intell 41, 649–664 (2014).

Download citation

  • Published:

  • Issue Date:

  • DOI: