Locality-Sensitive Term Weighting for Short Text Clustering

  • Chu-Tao Zheng
  • Sheng Qian
  • Wen-Ming Cao
  • Hau-San Wong
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10634)


To alleviate sparseness in short text clustering, considerable researches investigate external information such as Wikipedia to enrich feature representation, which requires extra works and resources and might lead to possible inconsistency. Sparseness leads to weak connections between short texts, thus the similarity information is difficult to be measured. We introduce a special term-specific document set—potential locality set—to capture weak similarity. Specifically, for any two short documents within the same potential locality, the Jaccard similarity between them is greater than 0. In other words, the adjacency graph based on these weak connections is a complete graph. Further, a locality-sensitive term weighting scheme is proposed based on our potential locality set. Experimental results show the proposed approach builds more reliable neighborhood for short text data. Compared with another state-of-the-art algorithm, the proposed approach obtains better clustering performances, which verifies its effectiveness.


Short text Clustering Locality 



The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 11300715], and a grant from City University of Hong Kong [Project No. 7004674].


  1. 1.
    Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: 20th International Conference on Information and Knowledge Management, pp. 775–784. ACM, Glasgow, Scotland, UK (2011)Google Scholar
  2. 2.
    Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: 15th International Conference on World Wide Web, pp. 377–386. ACM, Edinburgh, Scotland (2006)Google Scholar
  3. 3.
    Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T.: A hidden topic-based framework toward building applications with short web documents. Trans. KDE 23(7), 961–976 (2011)Google Scholar
  4. 4.
    Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. J. Neural Netw. 88, 22–32 (2017)CrossRefGoogle Scholar
  5. 5.
    Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. J. Neurocomput. 174, 806–814 (2016)CrossRefGoogle Scholar
  6. 6.
    Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: International Joint Conference on Artificial Intelligence, pp. 1776–1781 (2011)Google Scholar
  7. 7.
    Wang, Z., Mi, H., Ittycheriah, A.: Semi-supervised clustering for short text via deep representation learning. In: 20th Conference on Computational Natural Language Learning, pp. 31–39, Berlin, Germany (2016)Google Scholar
  8. 8.
    Luo, H., Tang, Y.Y., Li, C., Yang, L.: Local and global geometric structure preserving and application to hyperspectral image classification. J. Math. Prob. Eng. 2015, 13 p (2015)Google Scholar
  9. 9.
    Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems, pp. 585–591 (2002)Google Scholar
  10. 10.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)MATHGoogle Scholar
  11. 11.
    Xing, E.P., Jordan, M.I., Russell, S.J., Ng, A.Y.: Distance metric learning with application to clustering with side-information. In: Advances in Neural Information Processing Systems, pp. 521–528 (2003)Google Scholar
  12. 12.
    Finegan, C., Coke, R., Zhang, R., Ye, X., Radev, D.: Effects of creativity and cluster tightness on short text clustering performance. In: 54th Annual Meeting of the Association for Computational Linguistics, pp. 654–665, Berlin, Germany (2016)Google Scholar
  13. 13.
    Xu, J., Peng, W., Guanhua, T., Bo, X., Jun, Z., Fangyuan, W., Hongwei, H.: Short text clustering via convolutional neural networks. In: NAACL-HLT, pp. 62–69, Denver, Colorado (2015)Google Scholar
  14. 14.
    Yan, X., Guo, J., Lan, Y., Cheng, X.: A Biterm topic model for short texts. In: 22nd International Conference on World Wide Web, pp. 1445–1456 (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Chu-Tao Zheng
    • 1
  • Sheng Qian
    • 1
  • Wen-Ming Cao
    • 1
  • Hau-San Wong
    • 1
  1. 1.Department of Computer ScienceCity University of Hong KongKowloon TongHong Kong

Personalised recommendations