Constructing Document Vectors Using Kernel Density Estimates

  • Michael MayoEmail author
  • Sean Goltz
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10571)


Document vector embeddings are numeric fixed length representations of text documents that can be used for machine learning and text mining purposes. We describe in this paper a new technique for generating document vectors. Our novel idea builds on the recently popular notion of neural word vector embeddings and combines this concept with the statistics of kernel density estimation. We show that robust document vectors can be produced using our new algorithm, and perform an experiment involving several challenging text classification datasets to demonstrate its effectiveness.


  1. 1.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arxiv:1607.04606 (2016)
  2. 2.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  3. 3.
    Duong, T.: ks: kernel density estimation and kernel discriminant analysis for multivariate data in R. J. Stat. Soft. 21(7), 1–16 (2007)CrossRefGoogle Scholar
  4. 4.
    Frank, E., Hall, M., Witten, I.: The WEKA workbench. In: Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann (2016)Google Scholar
  5. 5.
    Goltz, S., Mayo, M.: Enhancing regulatory compliance by using artificial intelligence text mining to identify penalty clauses in legislation. In: Proceedings of Workshop on MIning and REasoning with Legal Texts (MIREL 2017) (2017, to appear)Google Scholar
  6. 6.
    Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of 23rd International Conference on Machine Learning (ICML), pp. 377–384 (2006)Google Scholar
  7. 7.
    Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)CrossRefGoogle Scholar
  8. 8.
    Hinneburg, A., Gabriel, H.-H.: DENCLUE 2.0: Fast clustering based on kernel density estimation. In: R. Berthold, M., Shawe-Taylor, J., Lavrač, N. (eds.) IDA 2007. LNCS, vol. 4723, pp. 70–80. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-74825-0_7 CrossRefGoogle Scholar
  9. 9.
    Iyyer, M., Enns, P., Boyd-Graber, J., Resnik, P.: Political ideology detection using recursive neural networks. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1113–1122 (2014)Google Scholar
  10. 10.
    Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arxiv:1607.01759 (2016)
  11. 11.
    Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)CrossRefzbMATHGoogle Scholar
  12. 12.
    Lau, J., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation, Technical report arxiv:1607.05368 (2016)
  13. 13.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Technical report, arXiv preprint arxiv:1301.3781 (2013)
  14. 14.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26 (NIPS 2013) (2013)Google Scholar
  15. 15.
    Ott, M., Cardie, C., Hancock, J.: Negative deceptive opinion spam. In: Proceedings of 2013 Conference of the North American Chapter of the Association for Computational Linguistics (2013)Google Scholar
  16. 16.
    Ott, M., Choi, Y., Cardie, C., Hancock., J.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of 49th Annual Meeting of the Association for Computational Linguistics (2011)Google Scholar
  17. 17.
    Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of Empirical Methods on Natural Language Processing, pp. 79–86 (2002)Google Scholar
  18. 18.
    Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, London (1986)CrossRefzbMATHGoogle Scholar
  19. 19.
    Simonoff, J.: Smoothing methods in statistics. Springer, New York (1996). doi: 10.1007/978-1-4612-4026-6 CrossRefzbMATHGoogle Scholar
  20. 20.
    Turney, P., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of WaikatoHamiltonNew Zealand

Personalised recommendations