Globally Optimal Parsimoniously Lifting a Fuzzy Query Set Over a Taxonomy Tree

  • Dmitry FrolovEmail author
  • Boris Mirkin
  • Susana Nascimento
  • Trevor Fenner
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 991)


This paper presents a relatively rare case of an optimization problem in data analysis to admit a globally optimal solution by a recursive algorithm. We are concerned with finding a most specific generalization of a fuzzy set of topics assigned to leaves of domain taxonomy represented by a rooted tree. The idea is to “lift” the set to its “head subject” in the higher ranks of the taxonomy tree. The head subject is supposed to “tightly” cover the query set, possibly bringing in some errors, either “gaps” or “offshoots” or both. Our method globally minimizes a penalty function combining the numbers of head subjects and gaps and offshoots, differently weighted. We apply this to a collection of 17645 research papers on Data Science published in 17 Springer journals for the past 20 years. We extract a taxonomy of Data Science (TDS) from the international Association for Computing Machinery Computing Classification System 2012. We find fuzzy clusters of leaf topics over the text collection, optimally lift them to head subjects in TDS, and comment on the tendencies of current research following from the lifting results.


Hierarchical taxonomy Parsimony Generalization Additive fuzzy cluster Spectral clustering Annotated suffix tree 


  1. 1.
    The 2012 ACM Computing Classification System. Accessed 30 Apr 2018
  2. 2.
    Blei, D.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)Google Scholar
  3. 3.
    Chernyak, E.: An approach to the problem of annotation of research publications. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 429–434. ACM (2015)Google Scholar
  4. 4.
    Frolov, D., Mirkin, B., Nascimento, S., Fenner, T.: Finding an appropriate generalization for a fuzzy thematic set in taxonomy. Working paper WP7/2018/04, Moscow, Higher School of Economics Publ. House, 58 p. (2018)Google Scholar
  5. 5.
    Lloret, E., Boldrini, E., Vodolazova, T., MartÃnez-Barco, P., Munoz, R., Palomar, M.: A novel concept-level approach for ultra-concise opinion summarization. Expert. Syst. Appl. 42(20), 7148–7156 (2015)Google Scholar
  6. 6.
    Mei, J.P., Wang, Y., Chen, L., Miao, C.: Large scale document categorization with fuzzy clustering. IEEE Trans. Fuzzy Syst. 25(5), 1239–1251 (2017)Google Scholar
  7. 7.
    Mirkin, B., Nascimento, S.: Additive spectral method for fuzzy cluster analysis of similarity data including community structure and affinity matrices. Inf. Sci. 183(1), 16–34 (2012)Google Scholar
  8. 8.
    Mueller, G., Bergmann, R.: Generalization of workflows in process-oriented case-based reasoning. In: FLAIRS Conference, pp. 391–396 (2015)Google Scholar
  9. 9.
    Pampapathi, R., Mirkin, B., Levene, M.: A suffix tree approach to anti-spam email filtering. Mach. Learn. 65(1), 309–338 (2006)Google Scholar
  10. 10.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 25(5), 513–523 (1998)Google Scholar
  11. 11.
    Song, Y., Liu, S., Wang, H., Wang, Z., Li, H.: Automatic taxonomy construction from keywords. US Patent No. 9,501,569. Washington, DC, US Patent and Trademark Office (2016)Google Scholar
  12. 12.
    Vedula, N., Nicholson, P.K., Ajwani, D., Dutta, S., Sala, A., Parthasarathy, S.: Enriching taxonomies with functional domain knowledge. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 745–754. ACM (2018)Google Scholar
  13. 13.
    Waitelonis, J., Exeler, C., Sack, H.: Linked data enabled generalized vector space model to improve document retrieval. In: Proceedings of NLP & DBpedia 2015 Workshop in Conjunction with 14th International Semantic Web Conference (ISWC), vol. 1486. CEUR-WS (2015)Google Scholar
  14. 14.
    Wang, C., He, X., Zhou, A.: A Short survey on taxonomy learning from text corpora: issues, resources and recent advances. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1190–1203 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Dmitry Frolov
    • 1
    Email author
  • Boris Mirkin
    • 1
    • 2
  • Susana Nascimento
    • 3
  • Trevor Fenner
    • 2
  1. 1.Department of Data Analysis and Artificial IntelligenceNational Research University Higher School of EconomicsMoscowRussian Federation
  2. 2.Department of Computer Science and Information SystemsBirkbeck University of LondonLondonUK
  3. 3.Department of Computer Science and NOVA LINCSUniversidade Nova de LisboaCaparicaPortugal

Personalised recommendations