APWeb 2015: Web Technologies and Applications pp 54-70 | Cite as
Mining the Discriminative Word Sets for Bag-of-Words Model Based on Distributional Similarity Graph
Abstract
Most of the previous distributional clustering methods are fundamentally unsupervised, and the discriminative property of words is not well modeled in the clustering procedure. In this paper, we propose a supervised model which involves the class conditional probability in measuring the word similarity, and transform the word-set extraction to a supervised graph-partition optimization model. A greedy algorithm is proposed to solve this model, which combines the word selecting method and the word grouping method in the unified framework. By grouping the related words, this method essentially transforms the exact match between word bins to fuzzy match between groups of related-word bins, which to some extent avoid the synonymous problems in BoW model. Experiments on data sets demonstrate that the proposed method is applicable for both text sets and image sets, and has advantages in producing better retrieval precision and meanwhile reducing the lexicon size.
References
- 1.Yogatama, D., Smith, N.: Making the most of bag of words: sentence regularization with alternating direction method of multipliers. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 656–664 (2014)Google Scholar
- 2.Zhang, Y., Jia, Z., Chen, T.: Image retrieval with geometry-preserving visual phrases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), pp. 809–816 (2011)Google Scholar
- 3.Burghouts, G.J., Schutte, K.: Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recogn. Lett. 34(15), 1861–1869 (2013)CrossRefGoogle Scholar
- 4.Metzler, D.A., Jr.: Beyond bags of words: effectively modeling dependence and features in information retrieval. Dissertation, University of Massachusetts Amherst (2007)Google Scholar
- 5.Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)Google Scholar
- 6.Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the International Workshop on Multimedia Information Retrieval, pp. 197–206. ACM (2007)Google Scholar
- 7.Wang, F., Guibas, L.J.: Supervised earth mover’s distance learning and its computer vision applications. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 442–455. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 8.Budanitsky, A., Hirst, G.: Evaluating worldnet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)CrossRefMATHGoogle Scholar
- 9.Vogel, J., Schiele, B.: Semantic modeling of natural scenes for content-based image retrieval. Int. J. Comput. Vis. 72(2), 133–157 (2007)CrossRefGoogle Scholar
- 10.Abbasi, A., France, S., Zhang, Z., Chen, H.: Selecting attributes for sentiment classification using feature relation networks. IEEE Trans. Knowl. Data Eng. 23(3), 447–462 (2011)CrossRefGoogle Scholar
- 11.Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103 (1998)Google Scholar
- 12.Slonim, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 208–215. ACM (2000)Google Scholar
- 13.Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, pp. 183–190 (1993)Google Scholar
- 14.Zheng, Y.T., Zhao, M., Neo, S.Y., Chua, T.S., Tian, Q.: Visual synset: towards a higher-level visual representation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8 (2008)Google Scholar
- 15.Yuan, J., Wu, Y., Yang, M.: Discovery of collocation patterns: from visual words to visual phrases. In: Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR 2007) pp. 1–8 (2007)Google Scholar
- 16.Menéndez-Mora, R.E., Ichise, R.: Effect of semantic differences in wordnet-based similarity measures. In: Garcia-Pedrajas, N., Herrera, F., Fyfe, C., Benítez, J.M., Ali, M. (eds.) IEA/AIE 2010, Part II. LNCS, vol. 6097, pp. 545–554. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 17.Mojsilović, A., Gomes, J., Rogowitz, B.: Semantic-friendly indexing and querying of images based on the extraction of the objective semantic cues. Int. J. Comput. Vis. 56(1–2), 79–107 (2004)CrossRefGoogle Scholar
- 18.Wan, X.: A novel document similarity measure based on earth mover’s distance. Inf. Sci. 177(18), 3718–3730 (2007)CrossRefGoogle Scholar
- 19.Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40(2), 99–121 (2000)CrossRefMATHGoogle Scholar
- 20.Van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Visual word ambiguity. IEEE Trans. Pattern Anal. Mach. Intell. 32(7), 1271–1283 (2010)CrossRefGoogle Scholar
- 21.Perronnin, F.: Universal and adapted vocabularies for generic visual categorization. IEEE Trans. Pattern Anal. Mach. Intell. 30(7), 1243–1256 (2008)CrossRefGoogle Scholar
- 22.Slonim, N., Friedman, N., Tishby, N.: Agglomerative multivariate information bottleneck. Advances in Neural Information Processing Systems, pp. 929–936 (2001)Google Scholar
- 23.Xie, X., Lu, L., Jia, M., Li, H., Seide, F., Ma, W.: Mobile search with multimodal queries. Proc. IEEE 96(4), 589–601 (2008)CrossRefGoogle Scholar
- 24.Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
- 25.Sen, P., Getoor, L.: Link-based classification, University of Maryland Technical report CS-TR-4858 (2007)Google Scholar