Locality-Sensitive Term Weighting for Short Text Clustering
To alleviate sparseness in short text clustering, considerable researches investigate external information such as Wikipedia to enrich feature representation, which requires extra works and resources and might lead to possible inconsistency. Sparseness leads to weak connections between short texts, thus the similarity information is difficult to be measured. We introduce a special term-specific document set—potential locality set—to capture weak similarity. Specifically, for any two short documents within the same potential locality, the Jaccard similarity between them is greater than 0. In other words, the adjacency graph based on these weak connections is a complete graph. Further, a locality-sensitive term weighting scheme is proposed based on our potential locality set. Experimental results show the proposed approach builds more reliable neighborhood for short text data. Compared with another state-of-the-art algorithm, the proposed approach obtains better clustering performances, which verifies its effectiveness.
KeywordsShort text Clustering Locality
The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 11300715], and a grant from City University of Hong Kong [Project No. 7004674].
- 1.Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: 20th International Conference on Information and Knowledge Management, pp. 775–784. ACM, Glasgow, Scotland, UK (2011)Google Scholar
- 2.Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: 15th International Conference on World Wide Web, pp. 377–386. ACM, Edinburgh, Scotland (2006)Google Scholar
- 3.Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T.: A hidden topic-based framework toward building applications with short web documents. Trans. KDE 23(7), 961–976 (2011)Google Scholar
- 6.Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: International Joint Conference on Artificial Intelligence, pp. 1776–1781 (2011)Google Scholar
- 7.Wang, Z., Mi, H., Ittycheriah, A.: Semi-supervised clustering for short text via deep representation learning. In: 20th Conference on Computational Natural Language Learning, pp. 31–39, Berlin, Germany (2016)Google Scholar
- 8.Luo, H., Tang, Y.Y., Li, C., Yang, L.: Local and global geometric structure preserving and application to hyperspectral image classification. J. Math. Prob. Eng. 2015, 13 p (2015)Google Scholar
- 9.Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems, pp. 585–591 (2002)Google Scholar
- 11.Xing, E.P., Jordan, M.I., Russell, S.J., Ng, A.Y.: Distance metric learning with application to clustering with side-information. In: Advances in Neural Information Processing Systems, pp. 521–528 (2003)Google Scholar
- 12.Finegan, C., Coke, R., Zhang, R., Ye, X., Radev, D.: Effects of creativity and cluster tightness on short text clustering performance. In: 54th Annual Meeting of the Association for Computational Linguistics, pp. 654–665, Berlin, Germany (2016)Google Scholar
- 13.Xu, J., Peng, W., Guanhua, T., Bo, X., Jun, Z., Fangyuan, W., Hongwei, H.: Short text clustering via convolutional neural networks. In: NAACL-HLT, pp. 62–69, Denver, Colorado (2015)Google Scholar
- 14.Yan, X., Guo, J., Lan, Y., Cheng, X.: A Biterm topic model for short texts. In: 22nd International Conference on World Wide Web, pp. 1445–1456 (2013)Google Scholar