Wikifying Novel Words to Mixtures of Wikipedia Senses by Structured Sparse Coding

  • Balázs Pintér
  • Gyula Vörös
  • Zoltán Szabó
  • András Lőrincz
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 318)


We extend the scope of Wikification to novel words by relaxing two premises of Wikification: (i) we wikify without using the surface form of the word (ii) to a mixture of Wikipedia senses instead of a single sense. We identify two types of “novel” words: words where the connection between their surface form and their meaning is broken (e.g., a misspelled word), and words where there is no meaning to connect to—the meaning itself is also novel. We propose a method capable of wikifying both types of novel words while also dealing with the inherently large-scale disambiguation problem. We show that the method can disambiguate between up to 1,000 Wikipedia senses, and it can explain words with novel meaning as a mixture of other, possibly related senses. This mixture representation compares favorably to the widely used bag of words representation.


Interpreting novel words Wikification Link disambiguation Natural language processing Structured sparse coding 



The research has been supported by the ‘European Robotic Surgery’ EC FP7 grant (no.: 288233). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of other members of the consortium or the European Commission. The research was carried out as part of the EITKIC_12-1-2012-0001 project, which is supported by the Hungarian Government, managed by the National Development Agency, financed by the Research and Technology Innovation Fund and was performed in cooperation with the EIT ICT Labs Budapest Associate Partner Group.


  1. 1.
    Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 233–242 (2007)Google Scholar
  2. 2.
    Akmajian, A.: Linguistics: An Introduction to Language and Communication. The MIT press, Cambridge (2001)Google Scholar
  3. 3.
    Harris, Z.: Distributional structure. Word 10, 146–162 (1954)Google Scholar
  4. 4.
    Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4, 1–106 (2012)CrossRefGoogle Scholar
  5. 5.
    Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 509–518 (2008)Google Scholar
  6. 6.
    Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD), pp. 457–466 (2009)Google Scholar
  7. 7.
    Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and global algorithms for disambiguation to Wikipedia. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.1375–1384 (2011)Google Scholar
  8. 8.
    Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf. Retr. 2, 165–176 (2000)CrossRefGoogle Scholar
  9. 9.
    Garofolo, J.S., Auzanne, C.G.P., Voorhees, E.M.: The TREC spoken document retrieval track: a success story. In: RIAO, pp. 1–20 (2000)Google Scholar
  10. 10.
    Jenatton, R., Mairal, J., Obozinski, G., Bach, F.: Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12, 2297–2334 (2011)MathSciNetMATHGoogle Scholar
  11. 11.
    Martins, A.F.T., Smith, N.A., Aguiar, P.M.Q., Figueiredo, M.A.T.: Structured sparsity in structured prediction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1500–1511 (2011)Google Scholar
  12. 12.
    Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)MathSciNetMATHGoogle Scholar
  13. 13.
    Yuan, M., Yuan, M., Lin, Y., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. 68, 49–67 (2006)CrossRefMathSciNetMATHGoogle Scholar
  14. 14.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58, 267–288 (1994)MathSciNetGoogle Scholar
  15. 15.
    Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco (1997)Google Scholar
  16. 16.
    Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)CrossRefGoogle Scholar
  17. 17.
    BNC Consortium: The British National Corpus, version 2 (BNC World) (2001)Google Scholar
  18. 18.
    Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University, Tempe (2009)Google Scholar
  19. 19.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)Google Scholar
  20. 20.
    Lee, Y.K., Ng, H.T.: An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 41–48 (2002)Google Scholar
  21. 21.
    Schütze, H.: Automatic word sense discrimination. Comput. Linguist. 24, 97–123 (1998)Google Scholar
  22. 22.
    Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measures. In: IEEE 10th International Conference on Data Mining (ICDM), pp. 911–916. IEEE (2010)Google Scholar
  23. 23.
    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17, 107–145 (2001)CrossRefMATHGoogle Scholar
  24. 24.
    Sharma, S.: Applied Multivariate Techniques. Wiley, New York (1996)Google Scholar
  25. 25.
    Han, E.H., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Proceedings of the Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp. 116–123 (2000)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Balázs Pintér
    • 1
  • Gyula Vörös
    • 1
  • Zoltán Szabó
    • 1
    • 2
  • András Lőrincz
    • 1
  1. 1.Faculty of InformaticsEötvös Loránd UniversityBudapestHungary
  2. 2.Gatsby Computational Neuroscience UnitUniversity College LondonLondonUK

Personalised recommendations