Abstract
Secondary-school teachers are in constant need of finding relevant digital resources to support specific didactic goals. Unfortunately, generic search engines do not allow them to identify learning objects among semi-structured candidate educational resources, much less retrieve them by teaching goals. This article describes a multi-strategy approach for semantically guided extraction, indexing and search of educational metadata; it combines machine learning, concept analysis, and corpus-based natural language processing techniques. The overall model was validated by comparing extracted metadata against standard search methods and heuristic-based techniques for Classification Accuracy and Metadata Quality (as evaluated by actual teachers), yielding promising results and showing that this semantically guided metadata extraction can effectively enhance access and use of educational digital material.
This is a preview of subscription content, access via your institution.











References
Almpanidis G, Kotropoulos C, Pitas I (2007) Combining text and link analysis for focused crawling. Inf Syst 32(5):886–908
Alpaydin E (2004) Introduction to machine learning. The MIT Press
Baldi P, Frasconi P, Smyth P (2003) Modeling the internet and the web. Wiley
Bauer M, Maier R, Thalmann P (2010) Metadata generation for learning objects: an experimental comparison of automatic and collaborative solutions. e-Learning pp 181–195
Bhatia S, Mitra P (2012) Summarizing figures, tables, and algorithms in scientific publications to augment search results. In: ACM Transactions on Information Systems (TOIS), vol 1, pp 45–49
Bhattacharya I, Godbole S, Joshi S (2008) Structured entity identification and document categorization: two tasks with one joint model. Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. Las Vegas, Nevada, USA, 25–33
Bolettieri P, Falchi F, Gennaro C, Rabitti F (2007) Automatic metadata extraction and indexing for reusing e-learning multimedia objects. In: Workshop on multimedia information retrieval on The many faces of multimedia semantics, ACM, New York, NY, USA, pp 21–28
Chatti M, Muhammad N, Jarke M (2008) Aloa: a web services driven framework for automatic learning object annotation. Times of convergence technologies across learning contexts, pp 86–91
Cherfi H, Napoli A, Toussaint Y (2004) Knowledge-based selection of association rules for text mining. 16th European Conference on Artificial Intelligence - ECAI’04 (Valencia Spain) 24:485–489
Contreras J, Mendoza M, Becerra C, Astudillo H (2010) Enhancing learning objects metadata improvement with indexing and categorization. In: LACLO 2010, 5th Latin American Conference on Learning Objects. Sao Paulo, Brazil, pp 1–1
Day M, Tsai R, Sung C, Hsieh C, Lee C, Wu S, Wu K (2007) Reference metadata extraction using a hierarchical knowledge representation framework. Decis Support Syst 43(1):152–167
Edvardsen L, Sølvberg I, Aalberg T, Trætteberg H (2009) Automatically generating high quality metadata by analyzing the document code of common file types. In: Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries. ACM, pp 29–38
Flynn P, Zhou L, Maly K, Zeil S, Zubair M (2007) Automated template-based metadata extraction architecture. In: ICADL (LNCS 4822). Springer, Berlin, pp 327–336
Gauch S, Wang Q (2009) Ontology-based focused crawling. International conference on information, process, and knowledge management, pp 123–128
Golub K, Ardo A (2005) Importance of html structural elements and metadata in automated subject classification. In: Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), vol 3652, pp 368–378
Greenberg J (2004) Metadata extraction and harvesting. J Internet Cat, pp 59–82
Guo Z, Jin H (2011) Reference metadata extraction from scientific papers. In: 12th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp 45–49
Hu Y, Li H, Cao Y, Teng L, Meyerzon D, Zheng R (2006) Automatic extraction of titles from general documents using machine learning. Inf Process Manag pp 1276–1293
Huynh T, Hoang K (2010) Gate framework based metadata extraction from scientific papers. In: International Conference on Education and Management Technology (ICEMT), pp 188–191
Jain S, Pareek J (2009) Keyphrase extraction tool (ket) for semantic metadata annotation of learning materials. International Conference on Signal Processing Systems, Singapore
Jain S, Pareek J (2010) Automatic topic(s) identification from learning material: an ontological approach. Second International Conference on Computer Engineering and Applications, Indonesia
Jin H, Chen H (2008) Semrex: efficient search in a semantic overlay for literature retrieval. Futur Gener Comput Syst 24(6):475–488
Jurafsky D, Martin J (2009) Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition, 2nd edn. Prentice Hall
Kovacevic M (2005) Visual adjacency multigraphs-a novel approach to web page classification. Proceedings of SAWM04 workshop, ECML2004
Landauer T, McNamara D, Dennis S, Kintsch W (2007) Handbook of latent semantic analysis (University of Colorado Institute of Cognitive Science Series). Lawrence Erlbaum Associates
Lehmann L, Hildebrandt T, Rensing C, Steinmetz R (2008) Capture, management, and utilization of lifecycle information for learning resources. IEEE Trans Learn Technol 1(1):75–87
Lu X, Kataria S, Brouwer W, Wang J, Mitra P, Giles C (2009) Automated analysis of images in documents for intelligent document search. In: IJDAR, 2, pp 65–81
Manning C, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press
Marinai S (2009) Metadata extraction from pdf papers for digital library ingest. 10th International Conference on Document Analysis and Recognition
Meire M, Ochoa X, Duval E (2007) Samgi: automatic metadata generation v2. 0. In: Proceedings of world conference on educational multimedia. Hypermedia and Telecommunications, vol 2007, pp 1195–1204
Nugent G, Kupzyk K, Riley S, Miller L (2009) Empirical usage metadata in learning objects. 39th ASEE/IEEE Frontiers in Education Conference, San Antonio, TX, USA
Ojokoh B, Adewale O, Falaki S (2009) Automated document metadata extraction. In: Journal of Information Science, pp 563–570
Olson D, Delen D (2008) Advanced data mining techniques. Springer
Park J, Lu C (2009) Application of semi-automatic metadata generation in libraries: types, tools, and techniques. Libr Inf Sci Res 31:225–231
Ping L (2009) Towards combining web classification and web information extraction: a case study. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. Paris, France, pp 1235–1244
Ray S, Mitra P, Kirk A, Szep S, Pellegrino D (2013) Figure metadata extraction from digital documents. 12th International Conference on Document Analysis and Recognition
Sen A (2004) Metadata management: past, present and future. Decis Support Syst 37(1):151–173
Wu C, Marchese M, Jiang J, Ivanyukovich A, Liang Y (2007) Machine learning-based keywords extraction for scientific literature. J UCS 13(10):1471–1483
Xiong Y, Luo P, Zhao Y, Lin F (2009) Ofcourse: web content discovery, classification and information extraction. The 18th ACM Conference on Information and Knowledge Management, Hong Kong
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported by FONDECYT (Chile) under grant number 1130035, and project grant Basal FB0821 CCTVal (Chile)
Rights and permissions
About this article
Cite this article
Atkinson, J., Gonzalez, A., Munoz, M. et al. Web metadata extraction and semantic indexing for learning objects extraction. Appl Intell 41, 649–664 (2014). https://doi.org/10.1007/s10489-014-0557-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-014-0557-6