Skip to main content

Discovering taxonomies in Wikipedia by means of grammatical evolution


This work applies grammatical evolution to identify taxonomic hierarchies of concepts from Wikipedia. Each article in Wikipedia covers a topic and is cross-linked by hyperlinks that connect related topics. Hierarchical taxonomies and their generalization to ontologies are a highly useful resource for many applications since they enable semantic search and reasoning. Thus, the automatic identification of taxonomies composed of concepts associated with linked Wikipedia pages has attracted much attention. We have developed a system which arranges a set of Wikipedia concepts into a taxonomy. This technique is based on the relationships among a set of features extracted from the contents of the Wikipedia pages. We have used a grammatical evolution algorithm to discover the best way of combining the considered features in an explicit function. Candidate functions are evaluated by applying a genetic algorithm to approximate the optimal taxonomy that the function can provide for a number of training cases. The fitness is computed as an average of the precision obtained by comparing, for the set of training cases, the taxonomy provided by the evaluated function with the reference one. Experimental results show that the proposal is able to provide valuable functions to find high-quality taxonomies.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9



  2. They are available at


  • Ali E, Raghavan V (2015) Extending skos: A wikipedia-based unified annotation model for creating interoperable domain ontologies. In: Esposito F, Pivert O, Hacid MS, Rás ZW, Ferilli S (eds) Proceedings of the 22nd international symposium on foundations of intelligent systems. Springer, pp 364–370

  • Araujo L, Martinez-Romo J, Duque A (2015) Grammatical evolution for identifying wikipedia taxonomies. In: Genetic and evolutionary computation conference, GECCO 2015, Madrid, Spain, July 11–15, 2015, companion material proceedings, pp 1345–1346

  • Bartoli A, De Lorenzo A, Medvet E, Tarlao F (2016) Syntactical similarity learning by means of grammatical evolution. In: Handl J, Hart E, Lewis PR, López-Ibáñez M, Ochoa G, Paechter B (eds) Proceedings of parallel problem solving from nature—PPSN XIV. Springer, pp 260–269

  • Ben Aouicha M, Hadj Taieb MA, Ezzeddine M (2016) Derivation of “is” taxonomy from wikipedia category graph. Eng Appl Artif Intell 50(C):265–286. doi:10.1016/j.engappai.2016.01.033

    Article  Google Scholar 

  • Bhogal J, Macfarlane A, Smith P (2007) A review of ontology based query expansion. Inf Process Manag 43(4):866–886

    Article  Google Scholar 

  • Camous F, Blott S, Smeaton A (2007) Ontology-based medline document classification. In: Hochreiter S, Wagner R (eds) Bioinformatics research and development. Lecture notes in computer science, vol 4414. Springer, Berlin, pp 439–452. doi:10.1007/978-3-540-71233-6_34

  • Cerri R, Barros RC, Freitas AA, de Carvalho AC (2014) Evolving relational hierarchical classification rules for predicting gene ontology-based protein functions. In: Proceedings of the 2014 conference companion on genetic and evolutionary computation companion, GECCO Comp ’14. ACM, New York, pp 1279–1286

  • Chernov S, Iofciu T, Nejdl W, Zhou X (2006) Extracting semantics relationships between wikipedia categories. In: Völkel M, Schaffert S (eds) Proceedings of the first workshop on semantic wikis-from wiki to semantics, ESWC2006. Workshop on semantic wikis

  • Clarke LE (1958) On Cayley’s formula for counting trees. J Lond Math Soci 33(4):471–474

    MathSciNet  Article  MATH  Google Scholar 

  • Dempsey I, O’Neill M, Brabazon A (2007) Constant creation in grammatical evolution. Int J Innov Comput Appl 1(1):23–38

    Article  Google Scholar 

  • Forsati R, Shamsfard M (2016) Symbiosis of evolutionary and combinatorial ontology mapping approaches. Inf Sci 342(C):53–80

    MathSciNet  Article  Google Scholar 

  • Galitsky BA (2013) Transfer learning of syntactic structures for building taxonomies for search engines. Eng Appl Artif Intell 26(10):2504–2515

    Article  Google Scholar 

  • Geem ZW, Kim JH, Loganathan G (2001) A new heuristic optimization algorithm: harmony search. Simulation 76(2):60–68

    Article  Google Scholar 

  • He P, Deng Z, Gao C, Wang X, Li J (2016) Model approach to grammatical evolution: deep-structured analyzing of model and representation. Soft Comput 1–11. doi:10.1007/s00500-016-2130-1

  • Herbelot A, Copestake A (2006) Acquiring ontological relationships from wikipedia using rmrs. In: Proceedings of the ISWC 2006 workshop on web content mining with human language technologies

  • Hovy E (1998) Combining and standardizing large-scale, practical ontologies for machine translation and other uses. In: Language resource and evaluation conference.

  • Isele R, Bizer C (2013) Active learning of expressive linkage rules using genetic programming. Web Semant Sci Serv Agents World Wide Web 23(0):2–15

  • Khalatbari S, Mirroshandel SA (2015) Automatic construction of domain ontology using wikipedia and enhancing it by google search engine. J Inf Syst Telecommun 3:248–258

    Google Scholar 

  • Koza JR (1992) Genetic programming: on the programming of computers by means of natural selection. MIT Press, cambridge

    MATH  Google Scholar 

  • Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2015) DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semant Web J 6(2):167–195

    Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  • Mao Y (2001) A semantic-based genetic algorithm for sub-ontology evolution. Inf Technol J 9(4):609–620

    Google Scholar 

  • Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from wikipedia. Int J Hum Comput Stud 67(9):716–754

    Article  Google Scholar 

  • Miles A, Bechhofer S (2008) SKOS simple knowledge organization system reference. Working draft, W3C.

  • Morales LP, Esteban AD, Gervás P (2008) Concept-graph based biomedical automatic summarization using ontologies. In: Proceedings of the 3rd textgraphs workshop on graph-based algorithms for natural language processing. Association for Computational Linguistics, Stroudsburg, pp 53–56

  • Nakayama K, Hara T, Nishio S (2007) A thesaurus construction method from large scale web dictionaries. In: Proceedings of the 21st IEEE international conference on advanced information networking and applications, AINA07. IEEE Computer Society, pp 932–939

  • Navigli R, Velardi P, Gangemi A (2003) Ontology learning and its application to automated terminology translation. Intell Syst IEEE 18(1):22–31

    Article  Google Scholar 

  • Nguyen DPT, Matsuo Y, Ishizuka M (2007) Exploiting syntactic and semantic information for relation extraction from Wikipedia. In: IJCAI workshop on text-mining and link-analysis (TextLink 2007)

  • O’Neill M, Ryan C (2001) Grammatical evolution. IEEE Trans Evol Comput 5(4):349–358

    Article  Google Scholar 

  • Otero FEB, Freitas AA, Johnson CG (2009) A hierarchical classification ant colony algorithm for predicting gene ontology terms. In: Pizzuti C, Ritchie MD, Giacobini M (eds) EvoBIO. Lecture notes in computer science, vol 5483. Springer, pp 68–79

  • Othman RM, Deris S, Illias RM, Alashwal HT, Hassan R, Farhan M (2007) Incorporating semantic similarity measure in genetic algorithm: an approach for searching the gene ontology terms. Int J Comput Intell 1(12):325–334

    Google Scholar 

  • Ponzetto SP, Strube M (2007) Deriving a large scale taxonomy from wikipedia. In: AAAI’07, Proceedings of the 22nd national conference on artificial intelligence, vol 2. AAAI Press, pp 1440–1445

  • Prokofyev R, Demartini G, Boyarsky A, Ruchayskiy O, Cudr-Mauroux P (2013) Ontology-based word sense disambiguation for scientific literature. In: Serdyukov P, Braslavski P, Kuznetsov S, Kamps J, Rger S, Agichtein E, Segalovich I, Yilmaz E (eds) Advances in information retrieval. Lecture notes in computer science, vol 7814. Springer, Berlin, pp 594–605

  • Ruiz-Casado M, Alfonseca E, Castells P (2005) Automatic extraction of semantic relationships for wordnet by means of pattern learning from wikipedia. In: NLDB, pp 67–79

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  • Schlegel DR, Crowner C, Elkin PL (2015) Automatically expanding the synonym set of SNOMED CT using wikipedia. In: MEDINFO 2015: eHealth-enabled Health—Proceedings of the 15th world congress on health and biomedical informatics, São Paulo, Brazil, 19–23 August 2015, pp 619–623

  • Suchanek FM, Ifrim G, Weikum G (2006) Combining linguistic and statistical analysis to extract relations from web documents. In: KDD ’06, Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 712–717

  • Suchanek FM, Kasneci G, Weikum G (2007) Yago: A core of semantic knowledge. In: WWW ’07, Proceedings of the 16th international conference on world wide web. ACM, New York, pp 697–706

  • Vicient C, Sánchez D, Moreno A (2013) An automatic approach for ontology-based feature extraction from heterogeneous textualresources. Eng Appl Artif Intell 26(3):1092–1106

    Article  Google Scholar 

  • Weber N, Buitelaar P (2006) Web-based ontology learning with isolde. In: Proceedings of the workshop on web content mining with human language at the international semantic web conference

  • Wu F, Weld DS (2007) Autonomously semantifying wikipedia. In: CIKM ’07, Proceedings of the sixteenth ACM conference on conference on information and knowledge management. ACM, New York, USA, pp 41–50

Download references


This work has been partially supported by the Spanish Ministry of Science and Innovation within the projects EXTRECM (TIN2013-46616-C2-2-R) and PROSA-MED (TIN2016-77820-C3-2-R), as well as by the Universidad Nacional de Educación a Distancia (UNED) through the FPI-UNED 2013 Grant. The authors would like to thank the referees for their valuable comments which led to improvements in the paper.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Lourdes Araujo.

Ethics declarations

Conflict of interest

Lourdes Araujo declares that she has no conflict of interest. Juan Martinez-Romo declares that he has no conflict of interest. Andres Duque declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Araujo, L., Martinez-Romo, J. & Duque, A. Discovering taxonomies in Wikipedia by means of grammatical evolution. Soft Comput 22, 2907–2919 (2018).

Download citation

  • Published:

  • Issue Date:

  • DOI:


  • Grammatical evolution
  • Genetic algorithm
  • Wikipedia taxonomies
  • Information extraction