Advertisement

Replicating Relevance-Ranked Synonym Discovery in a New Language and Domain

  • Andrew YatesEmail author
  • Michael Unterkalmsteiner
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11437)

Abstract

Domain-specific synonyms occur in many specialized search tasks, such as when searching medical documents, legal documents, and software engineering artifacts. We replicate prior work on ranking domain-specific synonyms in the consumer health domain by applying the approach to a new language and domain: identifying Swedish language synonyms in the building construction domain. We chose this setting because identifying synonyms in this domain is helpful for downstream systems, where different users may query for documents (e.g., engineering requirements) using different terminology. We consider two new features inspired by the change in language and methodological advances since the prior work’s publication. An evaluation using data from the building construction domain supports the finding from the prior work that synonym discovery is best approached as a learning to rank task in which a human editor views ranked synonym candidates in order to construct a domain-specific thesaurus. We additionally find that FastText embeddings alone provide a strong baseline, though they do not perform as well as the strongest learning to rank method. Finally, we analyze the performance of individual features and the differences in the domains.

Keywords

Synonym discovery Thesaurus construction Domain-specific search Replication Generalization 

References

  1. 1.
    Building construction - Organization of information about construction works - Part 2: Framework for classification. Tech. Rep. 12006–2:2015, ISO (May 2015). https://www.iso.org/standard/61753.html
  2. 2.
  3. 3.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)CrossRefGoogle Scholar
  4. 4.
    Braschler, M., Ripplinger, B.: How effective is stemming and decompounding for german text retrieval? Inf. Retrieval 7(3), 291–316 (2004)CrossRefGoogle Scholar
  5. 5.
    Briscoe, T., Carroll, J., Watson, R.: The second release of the rasp system. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions. COLING-ACL 2006 (2006)Google Scholar
  6. 6.
    Burges, C.J.: From ranknet to lambdarank to lambdamart: An overview. Tech. Rep. MSR-TR-2010-82 (2010)Google Scholar
  7. 7.
    Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. (CSUR) 44(1), 1 (2012)CrossRefGoogle Scholar
  8. 8.
    Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T.: The vocabulary problem in human-system communication. Commun. ACM 30(11), 964–971 (1987)CrossRefGoogle Scholar
  9. 9.
    Goodman, S.N., Fanelli, D., Ioannidis, J.P.: What does research reproducibility mean? Sci. Transl. Med. 8(341), 341ps12 (2016). http://stm.sciencemag.org/content/8/341/341ps12CrossRefGoogle Scholar
  10. 10.
    Hagiwara, M.: A supervised learning approach to automatic synonym identification based on distributional features. In: Proceedings Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Student Research Workshop, pp. 1–6. ACM (2008)Google Scholar
  11. 11.
    Haiduc, S., Bavota, G., Marcus, A., Oliveto, R., De Lucia, A., Menzies, T.: Automatic query reformulations for text retrieval in software engineering. In: Proceedings International Conference on Software Engineering (ICSE), pp. 842–851. IEEE (2013)Google Scholar
  12. 12.
    Kang, Y., Li, J., Yang, J., Wang, Q., Sun, Z.: Semantic analysis for enhanced medical retrieval. In: International Conference on Systems, Man, and Cybernetics (SMC), pp. 1121–1126. IEEE, October 2017Google Scholar
  13. 13.
    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)CrossRefGoogle Scholar
  14. 14.
    Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 2, pp. 768–774. Association for Computational Linguistics (1998)Google Scholar
  15. 15.
    Lucia, A.D., Fasano, F., Oliveto, R., Tortora, G.: Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans. Softw. Eng. Methodol. (TOSEM) 16(4), 13 (2007)CrossRefGoogle Scholar
  16. 16.
    Martin Riedl, C.B.: Unsupervised compound splitting with distributional semantics rivals supervised methods. In: Proceedings of The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie, pp. 617–622. San Diego, CA, USA (2016)Google Scholar
  17. 17.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  18. 18.
    Nivre, J., et al.: Maltparser: a language-independent system for data-driven dependency parsing. Nat. Lang. Eng. 13(2), 95–135 (2007)CrossRefGoogle Scholar
  19. 19.
    Oard, D.W., Baron, J.R., Hedin, B., Lewis, D.D., Tomlinson, S.: Evaluation of information retrieval for e-discovery. Artif. Intell. Law 18(4), 347–386 (2010)CrossRefGoogle Scholar
  20. 20.
    Östling, R.: Part of speech tagging: shallow or deep learning? Northern Eur. J. Lang. Technol. (NEJLT) 5, 1–15 (2018)CrossRefGoogle Scholar
  21. 21.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162
  22. 22.
    Stanojević, M.: Cognitive synonymy: a general overview. Facta Universitatis-Series: Linguist. Lit. 7(2), 193–200 (2009)Google Scholar
  23. 23.
    Terra, E., Clarke, C.L.: Frequency estimates for statistical word similarity measures. In: Proceedings Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 165–172. AMC (2003)Google Scholar
  24. 24.
    Widdows, D., Cohen, T.: The semantic vectors package: New algorithms and public tools for distributional semantics. In: 2010 IEEE Fourth International Conference on Semantic Computing (ICSC), pp. 9–15. IEEE (2010)Google Scholar
  25. 25.
    Yates, A., Goharian, N., Frieder, O.: Relevance-ranked domain-specific synonym discovery. In: Advances in Information Retrieval - 36th European Conference on IR Research ECIR 2014 (2014)Google Scholar
  26. 26.
    Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster web search results. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 2004, pp. 210–217 (2004)Google Scholar
  27. 27.
    Zhang, L., Li, L., Li, T.: Patent mining: a survey. ACM SIGKDD Explor. Newslett. 16(2), 1–19 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany
  2. 2.Software Engineering Research LaboratoryBlekinge Institute of TechnologyKarlskronaSweden

Personalised recommendations