Building Chatbot Thesaurus

  • Boris Galitsky


We implement a scalable mechanism to build a thesaurus of entities which is intended to improve the relevance of a chatbot. The thesaurus construction process starts from the seed entities and mines available source domains for new entities associated with these seed entities. New entities are formed by applying the machine learning of syntactic parse trees (their generalizations) to the search results for existing entities to form commonalities between them. These commonality expressions then form parameters of existing entities, and are turned into new entities at the next learning iteration. To match natural language expressions between source and target domains, we use syntactic generalization, an operation that finds a set of maximal common sub-trees of the parse trees of these expressions.

Thesaurus and syntactic generalization are applied to relevance improvement in search and text similarity assessment. We conduct an evaluation of the search relevance improvement in vertical and horizontal domains and observe significant contribution of the learned thesaurus in the former, and a noticeable contribution of a hybrid system in the latter domain. We also perform industrial evaluation of thesaurus and syntactic generalization-based text relevance assessment and conclude that a proposed algorithm for automated thesaurus learning is suitable for integration into chatbots. The proposed algorithm is implemented as a component of Apache OpenNLP project.


  1. Alani H, Brewster C (2005) Ontology ranking based on the analysis of concept structures. K-CAP’05 Proceedings of the 3rd international conference on knowledge capture, pp 51–58Google Scholar
  2. Allen JF (1987) Natural language understanding. Benjamin Cummings, Menlo ParkzbMATHGoogle Scholar
  3. Amiridze N, Kutsia T (2018) Anti-unification and natural language processing fifth workshop on natural language and computer science, NLCS’18, EasyChair Preprint no. 203Google Scholar
  4. Blanco-Fernández Y, López-Nores M, Pazos-Arias JJ, García-Duque J (2011) An improvement for semantics-based recommender systems grounded on attaching temporal information to ontologies and user profiles. Eng Appl Artif Intell 24(8):1385–1397CrossRefGoogle Scholar
  5. Buitelaar P, Olejnik D, Sintek M (2003) A proteg’e´ plug-in for ontology extraction from text based on linguistic analysis. In: Proceedings of the international semantic web conference (ISWC)Google Scholar
  6. Chu B-H, Lee C-E, Ho C-S (2008) An ontology-supported database refurbishing technique and its application in mining actionable troubleshooting rules from real-life databases. Eng Appl Artif Intell 21(8):1430–1442CrossRefGoogle Scholar
  7. Cimiano P, Pivk A, Schmidt-Thieme L, Staab S (2004) Learning taxonomic relations from heterogeneous sources of evidence. In: Buitelaar P, Cimiano P, Magnini B (eds) Ontology learning from text: methods, evaluation and applications. IOS Press, Amsterdam/BerlinGoogle Scholar
  8. De la Rosa JL, Rovira M, Beer M, Montaner M, Gibovic D (2010) Reducing administrative burden by online information and referral services. In: Reddick Austin CG (ed) Citizens and E-government: evaluating policy and management. IGI Global, Hershey, pp 131–157CrossRefGoogle Scholar
  9. Dzikovska M, Swift M, Allen J, de Beaumont W (2005) Generic parsing for multi-domain semantic interpretation. International workshop on parsing technologies (Iwpt05), Vancouver BCGoogle Scholar
  10. Galitsky B (2003) Natural language question answering system: technique of semantic headers. Advanced Knowledge International, MagillGoogle Scholar
  11. Galitsky B (2005) Disambiguation via default rules under answering complex questions. Int J AI Tools 14(1–2):157–175. World ScientificCrossRefGoogle Scholar
  12. Galitsky B (2013) Machine learning of syntactic parse trees for search and classification of text. Eng Appl AI 26(3):1072–1091Google Scholar
  13. Galitsky B (2016) Generalization of parse trees for iterative taxonomy learning. Inf Sci 329:125–143CrossRefGoogle Scholar
  14. Galitsky B (2017) Improving relevance in a content pipeline via syntactic generalization. Eng Appl Artif Intell 58:1–26CrossRefGoogle Scholar
  15. Galitsky B, Kovalerchuk B (2006) Mining the blogosphere for contributors’ sentiments. AAAI Spring symposium: computational approaches to analyzing weblogs, pp 37–39Google Scholar
  16. Galitsky B, Kovalerchuk B (2014) Improving web search relevance with learning structure of domain concepts. Clust Order Trees Methods Appl 92:341–376MathSciNetGoogle Scholar
  17. Galitsky B, Lebedeva N (2015) Recognizing documents versus meta-documents by tree kernel learning. FLAIRS conference, pp 540–545Google Scholar
  18. Galitsky B, McKenna EW (2017) Sentiment extraction from consumer reviews for providing product recommendations. US Patent App. 15/489,059Google Scholar
  19. Galitsky B, Dobrocsi G, de la Rosa JL, Kuznetsov SO (2010) From generalization of syntactic parse trees to conceptual graphs. ICCS 2010:185–190Google Scholar
  20. Galitsky B, Kovalerchuk B, de la Rosa JL (2011a) Assessing plausibility of explanation and meta-explanation in inter-human conflicts. A special issue on semantic-based information and engineering systems. Eng Appl Artif Intell 24(8):1472–1486CrossRefGoogle Scholar
  21. Galitsky B, Dobrocsi G, de la Rosa JL, Kuznetsov SO (2011b) Using generalization of syntactic parse trees for taxonomy capture on the web. ICCS 2011:104–117Google Scholar
  22. Galitsky B, Dobrocsi G, de la Rosa JL (2012) Inferring semantic properties of sentences mining syntactic parse trees. Data Knowl Eng 81:21–45CrossRefGoogle Scholar
  23. Grefenstette G (1994) Explorations in automatic thesaurus discovery. Kluwer Academic, Boston/London/DordrechtCrossRefGoogle Scholar
  24. Harris Z (1968) Mathematical structures of language. Wiley, New YorkzbMATHGoogle Scholar
  25. Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th international conference on computational linguistics, pp 539–545Google Scholar
  26. Heddon H (2008) Better living through thesauri. Digital Web Magazine.
  27. Howard RW (1992) Classifying types of concept and conceptual structure: some thesauri. J Cogn Psychol 4(2):81–111CrossRefGoogle Scholar
  28. Justo AV, dos Reis JC, Calado I, Rodrigues Jensen F (2018) Exploring ontologies to improve the empathy of interactive BotsE. IEEE 27th international conference on enabling technologies: infrastructure for collaborative enterprises (WETICE)Google Scholar
  29. Kerschberg L,Kim W, Scime A (2003) A semantic thesaurus-based personalizable meta-search agent. In: Truszkowski W (ed) Innovative concepts for agent-based aystems, vol. LNAI 2564, Lecture notes in artificial intelligence. Springer, Heidelberg, pp 3–31Google Scholar
  30. Kozareva Z, Hovy E, Riloff E (2009) Learning and evaluating the content and structure of a term thesaurus. Learning by reading and learning to read AAAI Spring symposium 2009. Stanford, CAGoogle Scholar
  31. Lin D (1998) Automatic retrieval and clustering of similar words. In: Proceedings of COLING-ACL98, vol 2, pp 768–773Google Scholar
  32. Liu J, Birnbaum L (2007) Measuring semantic similarity between named entities by searching the web directory. Web Intell 2007:461–465Google Scholar
  33. Liu J, Birnbaum L (2008) What do they think?: aggregating local views about news events and topics. WWW 2008:1021–1022CrossRefGoogle Scholar
  34. Makhalova T, Dmitry A, Ilvovsky, Galitsky B (2015) News clustering approach based on discourse text structure. In: Proceedings of the first workshop on computing news storylines @ACLGoogle Scholar
  35. Morbach J, Yang A, Marquardt W (2007) OntoCAPE – a large-scale ontology for chemical process engineering. Eng Appl Artif Intell 20(2):147–161. CrossRefGoogle Scholar
  36. Moreno A, Valls A, Isern D, Marin L, Borràs J (2012) SigTur/E-destination: ontology-based personalized recommendation of tourism and leisure activities. Eng Appl Artif Intell. Available online 17 Mar 2012Google Scholar
  37. Moschitti A (2006) Efficient convolution kernels for dependency and constituent syntactic trees. In: Proceedings of the 17th European conference on machine learning, Berlin, GermanyGoogle Scholar
  38. Nissan E (2014) Narratives, formalism, computational tools, and nonlinearity. In: Dershowitz N, Nissan E (eds) Language, culture, computation. Computing of the humanities, law, and narratives. Lecture notes in computer science, vol 8002. Springer, Berlin/HeidelbergGoogle Scholar
  39. OpenNLP (2012)
  40. Pan SJ, Qiang Yang A (2010) Survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359CrossRefGoogle Scholar
  41. Poesio M, Ishikawa T, Schulte im Walde S, Viera R (2002) Acquiring lexical knowledge for anaphora resolution. In: Proceedings of the 3rd conference on language resources and evaluation (LREC)Google Scholar
  42. Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: transfer learning from unlabeled data. In: Proceedings of 24th international conference on machine learning, pp 759–766Google Scholar
  43. Ravichandran D, Hovy E (2002) Learning surface text patterns for a question answering system. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia, PAGoogle Scholar
  44. Reinberger ML, Spyns P (2005) Generating and evaluating triples for modelling a virtual environment. OTM workshops, pp 1205–1214Google Scholar
  45. Resnik P, Lin J (2010) Evaluation of NLP systems. In: Clark A, Fox C, Lappin S (eds) The handbook of computational linguistics and natural language processing. Wiley-Blackwell, OxfordGoogle Scholar
  46. Roth C (2006) Compact, evolving community thesauri using concept lattices ICCS 14 – July 17–21, 2006, Aalborg, DKGoogle Scholar
  47. Sánchez D (2010) A methodology to learn ontological attributes from the web. Data Knowl Eng 69(6):573–597CrossRefGoogle Scholar
  48. Sánchez D, Moreno A (2008) Pattern-based automatic thesaurus learning from the web. AI Commun 21(1):27–48MathSciNetzbMATHGoogle Scholar
  49. Sano AVD, Imanuel TD, Calista MI, Nindito H, Condrobimo AR (2018) The application of AGNES algorithm to optimize knowledge base for tourism chatbot. International conference on information management and technology (ICIMTech)Google Scholar
  50. Saxena N, Tiwari NK, Husain M (2014) A web search survey: a study for fusion of different sources to determine relevance. 2014 international conference on computing for sustainable global development (INDIACom)Google Scholar
  51. Sidorov G (2013) Syntactic dependency based N-grams in rule based automatic English as second language grammar correction. Int J Comput Linguist Appl 4(2):169–188Google Scholar
  52. Trias A, de la Rosa JL (2013) Survey of social search from the perspective of the village paradigm and online social networks. J Inf Sci 39(5):688–707CrossRefGoogle Scholar
  53. Trias A, de la Rosa JL, Galitsky B, Drobocsi G (2010) Automation of social networks with QA agents (extended abstract). In: van der Hoek, Kaminka L, Luck, Sen (eds) Proceedings of 9th international conference on autonomous agents and multi-agent systems, AAMAS ‘10, Toronto, pp 1437–1438Google Scholar
  54. Vicient C, Sánchez D, Moreno A (2012) An automatic approach for ontology-based feature extraction from heterogeneous textual resources. Eng Appl Artif Intell. Available online 12 Sept 2012Google Scholar
  55. Vicient C, Sánchez D, Moreno A (2013) An automatic approach for ontology-based feature extraction from heterogeneous textual resources. Eng Appl Artif Intell 26(3):1092–1106CrossRefGoogle Scholar
  56. Vorontsov K, Potapenko A (2015) Additive regularization of topic models. Mach Learn 101(1–3):303–323MathSciNetCrossRefGoogle Scholar
  57. Wang K, Ming Z, Chua TS (2009) A syntactic tree matching approach to finding similar questions in community-based QA services. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’09). ACM, New York, NY, USA, pp 187–194Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Boris Galitsky
    • 1
  1. 1.Oracle (United States)San JoseUSA

Personalised recommendations