Automatic Subject Metadata Generation for Scientific Documents Using Wikipedia and Genetic Algorithms

  • Arash Joorabchi
  • Abdulhussain E. Mahdi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7603)


Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents. However, scientific documents that are manually annotated with keyphrases are in the minority. This paper describes a machine learning-based automatic keyphrase annotation method for scientific documents, which utilizes Wikipedia as a thesaurus for candidate selection from documents’ content and deploys genetic algorithms to learn a model for ranking and filtering the most probable keyphrases. Reported experimental results show that the performance of our method, evaluated in terms of inter-consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised methods.


text mining scientific digital libraries subject metadata keyphrase annotation keyphrase indexing Wikipedia genetic algorithms 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Grineva, M., Grinev, M., Lizorkin, D.: Extracting key terms from noisy and multi-theme documents. In: 18th International Conference on World Wide Web, Madrid, Spain (2009)Google Scholar
  2. 2.
    Mahdi, A.E., Joorabchi, A.: A Citation-based approach to automatic topical indexing of scientific literature. Journal of Information Science 36, 798–811 (2010)CrossRefGoogle Scholar
  3. 3.
    Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: practical automatic keyphrase extraction. In: Fourth ACM Conference on Digital Libraries. ACM, Berkeley (1999)Google Scholar
  4. 4.
    Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Inf. Retr. 2, 303–336 (2000)CrossRefGoogle Scholar
  5. 5.
    Turney, P.D.: Coherent keyphrase extraction via web mining. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, Mexico, pp. 434–439 (2003)Google Scholar
  6. 6.
    Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Proceedings of the 10th International Conference on Asian Digital Libraries, Vietnam, pp. 317–326 (2007)Google Scholar
  7. 7.
    Markó, K.G., Hahn, U., Schulz, S., Daumke, P., Nohama, P.: Interlingual Indexing across Different Languages. In: Computer-Assisted Information Retrieval, RIAO, pp. 82–99 (2004)Google Scholar
  8. 8.
    Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus. Ontologies and Information Extraction. In: Workshop at EUROLAN 2003 (2003)Google Scholar
  9. 9.
    Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, USA, pp. 296–297 (2006)Google Scholar
  10. 10.
    Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. Journal of the American Society for Information Science and Technology 59, 1026–1040 (2008)CrossRefGoogle Scholar
  11. 11.
    Milne, D., Medelyan, O., Witten, I.H.: Mining Domain-Specific Thesauri from Wikipedia: A Case Study. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 442–448. IEEE Computer Society (2006)Google Scholar
  12. 12.
    Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from Wikipedia. Int. J. Hum.-Comput. Stud. 67, 716–754 (2009)CrossRefGoogle Scholar
  13. 13.
    Medelyan, O., Witten, I.H., Milne, D.: Topic Indexing with Wikipedia. In: First AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008). AAAI Press, US (2008)Google Scholar
  14. 14.
    Medelyan, O.: Human-competitive automatic topic indexing. Department of Computer Science. PhD thesis. University of Waikato, New Zealand (2009)Google Scholar
  15. 15.
    Milne, D.: An open-source toolkit for mining Wikipedia. In: New Zealand Computer Science Research Student Conference (2009)Google Scholar
  16. 16.
    Turney, P.D.: Learning to Extract Keyphrases from Text. National Research Council. Institute for Information Technology (1999)Google Scholar
  17. 17.
    Barker, K., Cornacchia, N.: Using Noun Phrase Heads to Extract Document Keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  18. 18.
  19. 19.
    Milne, D., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, USA, pp. 509–518 (2008)Google Scholar
  20. 20.
    Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: First AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, I.L. (2008)Google Scholar
  21. 21.
  22. 22.
    Rolling, L.: Indexing consistency, quality and efficiency. Information Processing & Management 17, 69–76 (1981)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Arash Joorabchi
    • 1
  • Abdulhussain E. Mahdi
    • 1
  1. 1.Department of Electronic and Computer EngineeringUniversity of LimerickIreland

Personalised recommendations