Abstract
Deep learning model has witnessed its obvious advantage in feature representation and document retrival. However, the model only considered most frequent words as the input to learn latent features, which inevitably ignores lots of useful information contained in documents especially for high-dimensional documents. We introduce a novel method based on word-vector clustering to obtain low-dimensional semantic vectors of documents, as the input of deep learning model to improve the feature representation in the output layer. Firstly, word-vector, a kind of compact and distributed representation of words, is obtained by training neural network language model using word2vec. Then, we present a modified word-vector clustering method based on locality-sensitive hashing and affinity propagation, with a stronger adaptability and scalability for large scale and high dimensionality. Afterwards, each document is represented by the set of cluster centers as the input of deep learning model. Experimental results proved the proposed method improves the ability of feature representation of deep learning model and performs better on document retrieval task compared with traditional methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bengio, Y.: Learning deep architectures for AI. In: Foundations and Trends in Machine Learning (2009)
Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Comput. 21(6), 1601–1621 (2009)
Salakhutdinov, R., Hinton, G.: Semantic hashing, In SIGIR workshop on information retrieval and applications of graphical models (2007)
Paccanaro, A., Hinton, G.: Learning distributed representations of concepts from relational data using linear relation. IEEE Trans. Knowl. Data Eng. 3, 98–104 (2001)
Bengio, Y., Ducharme, R., Vincent, P., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Andoni, A., Indyk, P.: Nearest-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008). 50th anniversary issue
Mikolov, T., Chen, K., Corrado, G.: et al.: Efficient estimation of word representations in vector space[EB/OL], 18 September 2014. http://arxiv.org/abs/1301.3781v3
Mikolov, T.: Word2vec project [EB/OL], 18 September 2014. https://code.googlecom/p/word2vec/
Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inform. Process. Syst. 3111–3119 (2013)
Malcolm, S., Michael, C.: Locality-sensitive hashing for finding nearest neighbors. IEEE Sig. Process. Magzine 8(3), 128–131 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this paper
Cite this paper
Li, H., Hao, W., Zhang, H., Chen, G. (2016). Feature Representation Based on Improved Word-Vector Clustering Using AP and E2LSH. In: Zhang, L., Song, X., Wu, Y. (eds) Theory, Methodology, Tools and Applications for Modeling and Simulation of Complex Systems. AsiaSim SCS AutumnSim 2016 2016. Communications in Computer and Information Science, vol 646. Springer, Singapore. https://doi.org/10.1007/978-981-10-2672-0_15
Download citation
DOI: https://doi.org/10.1007/978-981-10-2672-0_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2671-3
Online ISBN: 978-981-10-2672-0
eBook Packages: Computer ScienceComputer Science (R0)