Benchmarking of a Novel POS Tagging Based Semantic Similarity Approach for Job Description Similarity Computation

  • Joydeep MondalEmail author
  • Sarthak Ahuja
  • Kushal Mukherjee
  • Sudhanshu Shekhar Singh
  • Gyana Parija
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10843)


Most solutions providing hiring analytics involve mapping provided job descriptions to a standard job framework, thereby requiring computation of a document similarity score between two job descriptions. Finding semantic similarity between a pair of documents is a problem that is yet to be solved satisfactorily over all possible domains/contexts. Most document similarity calculation exercises require a large corpus of data for training the underlying models. In this paper we compare three methods of document similarity for job descriptions - topic modeling (LDA), doc2vec, and a novel part-of-speech tagging based document similarity (POSDC) calculation method. LDA and doc2vec require a large corpus of data to train, while POSDC exploits a domain specific property of descriptive documents (such as job descriptions) that enables us to compare two documents in isolation. POSDC method is based on an action-object-attribute representation of documents, that allows meaningful comparisons. We use stanford Core NLP and NLTK Wordnet to do a multilevel semantic match between the actions and corresponding objects. We use sklearn for topic modeling and gensim for doc2vec. We compare the results from these three methods based on IBM Kenexa Talent frameworks job taxonomy.


  1. 1.
    Apache Spark. Accessed 23 May 2017
  2. 2.
    Apache Spark Wiki. Accessed 23 May 2017
  3. 3.
    NLTK. Accessed 23 May 2017
  4. 4.
    Watson Natural Language Understanding Service. Accessed 05 Jan 2017Google Scholar
  5. 5.
    Wordnet NLTK. Accessed 23 May 2017
  6. 6.
    Wordnet Synsets. Accessed 23 May 2017
  7. 7.
    WUP Similarity. Accessed 23 May 2017
  8. 8.
    Ahuja, S., Mondal, J., Singh, S.S., George, D.G.: Similarity computation exploiting the semantic and syntactic inherent structure among job titles. In: Maximilien, M., Vallecillo, A., Wang, J., Oriol, M. (eds.) ICSOC 2017. LNCS, vol. 10601, pp. 3–18. Springer, Cham (2017). Scholar
  9. 9.
    Aizawa, A.: An information-theoretic perspective of tf-idf measures. Inf. Proces. Manag. 39(1), 45–65 (2003). Scholar
  10. 10.
    Friedland, L., Allan, J.: Joke retrieval: recognizing the same joke told differently. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 883–892. ACM (2008)Google Scholar
  11. 11.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJcAI, vol. 7, pp. 1606–1611 (2007)Google Scholar
  12. 12.
    Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 179–186. ACM (2008)Google Scholar
  13. 13.
    Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC 2008), Christchurch, New Zealand, pp. 49–56 (2008)Google Scholar
  14. 14.
    Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014).
  16. 16.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing (1999)Google Scholar
  17. 17.
    Matveeva, I., Levow, G.A., Farahat, A., Royer, C.: Generalized latent semantic analysis for term representation. In: Proceedings of the of RANLP (2005)Google Scholar
  18. 18.
    Pak, A.N., Chung, C.W.: A Wikipedia matching approach to contextual advertising. World Wide Web 13(3), 251–274 (2010)CrossRefGoogle Scholar
  19. 19.
    Pimplikar, R.R., Kannan, K., Mondal, A., Mondal, J., Saxena, S., Parija, G., Devulapalli, C.: RISE: resolution of identity through similarity establishment on unstructured job descriptions. In: Maximilien, M., Vallecillo, A., Wang, J., Oriol, M. (eds.) ICSOC 2017. LNCS, vol. 10601, pp. 19–36. Springer, Cham (2017). Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM Research LabNew DelhiIndia

Personalised recommendations