Learning Domain Labels Using Conceptual Fingerprints: An In-Use Case Study in the Neurology Domain

  • Zubair Afzal
  • George Tsatsaronis
  • Marius Doornenbal
  • Pascal Coupet
  • Michelle Gregory
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10024)


Modelling a science domain for the purposes of thematically categorizing the research work and enabling better browsing and search can be a daunting task, especially if a specialized taxonomy or ontology does not exist for this domain. Elsevier, the largest academic publisher, faces this challenge often, for the needs of supporting the journals submission system, but also for supplying ScienceDirect and Scopus, two flagship platforms of the company, with sufficient metadata, such as conceptual labels that characterize the research works, which can improve the user experience in browsing and searching the literature. In this paper we describe an Elsevier in-use case study of learning appropriate domain labels from a collection of 6, 357 full text articles in the neurology domain, exploring different document representations and clustering mechanisms. Besides the baseline approaches for document representation (e.g., bag-of-words) and their variations (e.g., n-grams), we employ a novel in-house methodology which produces conceptual fingerprints of the research articles, starting from a general domain taxonomy, such as the Medical Subject Headings (MeSH). A thorough empirical evaluation is presented, using a variety of clustering mechanisms and several validity indices to evaluate the resulting clusters. Our results summarize the best practices in modelling this specific domain and we report on the advantages and disadvantages of using the different clustering mechanisms and document representations that were examined, with the aim to learn appropriate conceptual labels for this domain.


Document labeling Document clustering Conceptual fingerprints Domain taxonomy Neurology domain Clustering evaluation Best practices 


  1. 1.
    Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Batet, M., Valls, A., Gibert, K., Sánchez, D.: Semantic clustering using multiple ontologies. In: Proceedings of the 13th International Conference of the Catalan Association for Artificial Intelligence, pp. 207–216 (2010)Google Scholar
  3. 3.
    Dagher, G.G., Fung, B.C.: Subject-based semantic document clustering for digital forensic investigations. Data Knowl. Eng. 86, 224–241 (2013)CrossRefGoogle Scholar
  4. 4.
    Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 224–227 (1979)CrossRefGoogle Scholar
  5. 5.
    Dietze, H., Schroeder, M.: GoWeb: a semantic search engine for the life science web. BMC Bioinform. 10(S–10), 7 (2009)CrossRefGoogle Scholar
  6. 6.
    Fodeh, S.J., Punch, W.F., Tan, P.: On ontology-driven document clustering using core semantic features. Knowl. Inf. Syst. 28(2), 395–421 (2011)CrossRefGoogle Scholar
  7. 7.
    Funk, C., Baumgartner, W., Garcia, B., Roeder, C., Bada, M., Cohen, K.B., Hunter, L.E., Verspoor, K.: Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bioinform. 15(1), 1–29 (2014)CrossRefGoogle Scholar
  8. 8.
    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2–3), 107–145 (2001)CrossRefzbMATHGoogle Scholar
  9. 9.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  10. 10.
    Nasir, J.A., Varlamis, I., Karim, A., Tsatsaronis, G.: Semantic smoothing for text clustering. Knowl.-Based Syst. 54, 216–229 (2013)CrossRefGoogle Scholar
  11. 11.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefzbMATHGoogle Scholar
  12. 12.
    Staab, S., Hotho, A.: Ontology-based text document clustering. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) IIPWM 2003, pp. 451–452. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers, M.R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almirantis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artières, T., Ngonga, A., Heino, N., Gaussier, É., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I., Paliouras, G.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16, 138 (2015)CrossRefGoogle Scholar
  14. 14.
    Vestdam, T., Rasmussen, H., Doornenbal, M.: Black magic meta data - get a glimpse behind the scene. Procedia Comput. Sci. 33, 239–244 (2014)CrossRefGoogle Scholar
  15. 15.
    Willet, P.: Document clustering using an inverted file approach. J. Inf. Sci. 2, 223–231 (1980)CrossRefGoogle Scholar
  16. 16.
    Zhao,Y., Karypis, G.: Topic-driven clustering for document datasets. In: Proceedings of the SDM, pp. 358–369 (2005)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Zubair Afzal
    • 1
  • George Tsatsaronis
    • 1
  • Marius Doornenbal
    • 1
  • Pascal Coupet
    • 1
  • Michelle Gregory
    • 1
  1. 1.Content and Innovation Group, Operations DivisionElsevier B.V.AmsterdamThe Netherlands

Personalised recommendations