Sentence Clustering Using Continuous Vector Space Representation

  • Mara Chinea-RiosEmail author
  • Germán Sanchis-Trilles
  • Francisco Casacuberta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9117)


In this paper, we present a clustering approach based on the combined use of a continuous vector space representation of sentences and the \(k\)-means algorithm. The principal motivation of this proposal is to split a big heterogeneous corpus into clusters of similar sentences. We use the word2vec toolkit for obtaining the representation of a given word as a continuous vector space. We provide empirical evidence for proving that the use of our technique can lead to better clusters, in terms of intra-cluster perplexity and \(F1\) score.


Clustering \(k\)-means Continuous vector spaces 


  1. 1.
    Andrés-Ferrer, J., Sanchis-Trilles, G., Casacuberta, F.: Similarity word-sequence kernels for sentence clustering. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR & SPR 2010. LNCS, vol. 6218, pp. 610–619. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  2. 2.
    Bengio, Y., Schwenk, H., Senécal, J. and Morin, F.: Neural probabilistic language models. In: Innovations in Machine Learning, pp. 137–186 (2006)Google Scholar
  3. 3.
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. JMLR 12, 2493–2537 (2011)zbMATHGoogle Scholar
  4. 4.
    Cortes, C., Mohri, M., Weston, J.: A general regression technique for learning transductions. In: Proceedings of conference on ML, pp. 153–160 (2005)Google Scholar
  5. 5.
    Hamerly, G., Elkan, C.: Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of Conference on Information and Knowledge Management, pp. 600–607 (2002)Google Scholar
  6. 6.
    Joachims, T.: Text categorisation with support vector machines: learning with many relevant features. In: Proceedings of ECML, pp. 137–142 (1998)Google Scholar
  7. 7.
    Karatzoglou, A., Feinerer, I.: Text clustering with string kernels in R. JSS 15, 1–28 (2006)Google Scholar
  8. 8.
    Lagarda, A., Juan, A.: Topic detection and classification techniques. WP4 deliverable, TransType2 (2003)Google Scholar
  9. 9.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. JMLR 2, 419–444 (2002)zbMATHGoogle Scholar
  10. 10.
    MacQueen, J., and others: Some methods for classification and analysis of multivariate observations. In: Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  11. 11.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: Proceedings of ICML, pp. 41–48 (1998)Google Scholar
  12. 12.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
  13. 13.
    Sanchis, G.: Building task-oriented machine translation systems (Doctoral dissertation, Universitat Politcnica de Valncia) (2012)Google Scholar
  14. 14.
    Sennrich, R.: Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation. In: Proceedings of EAMT, pp. 185–192 (2012)Google Scholar
  15. 15.
    Serrano, N., Andrés-Ferrer, J., Casacuberta, F.: On a kernel regression approach to machine translation. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 394–401. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  16. 16.
    Szedmak, Z.W.S.T.: Kernel regression based machine translation. In: Proceedings of ACL, pp. 185–188 (2007)Google Scholar
  17. 17.
    Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In: Proceedings of RANLP, pp. 237–248 (2009)Google Scholar
  18. 18.
    Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of LREC, pp. 2214–2218 (2012)Google Scholar
  19. 19.
    Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of ACL, pp. 384–394 (2010)Google Scholar
  20. 20.
    Wäschle, K., Riezler, S.: Structural and topical dimensions in multi-task patent translation. In: Proceedings of EACL, pp. 818–828 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Mara Chinea-Rios
    • 1
    Email author
  • Germán Sanchis-Trilles
    • 1
  • Francisco Casacuberta
    • 1
  1. 1.Pattern Recognition and Human Language Technologies CenterUniversitat Politècnica de ValènciaValenciaSpain

Personalised recommendations