Empiric Introduction to Light Stochastic Binarization
We introduce a novel method for transformation of texts into short binary vectors which can be subsequently compared by means of Hamming distance measurement. Similary to other semantic hashing approaches, the objective is to perform radical dimensionality reduction by putting texts with similar meaning into same or similar buckets while putting the texts with dissimilar meaning into different and distant buckets. First, the method transforms the texts into complete TF-IDF, than implements Reflective Random Indexing in order to fold both term and document spaces into low-dimensional space. Subsequently, every dimension of the resulting low-dimensional space is simply thresholded along its 50th percentile so that every individual bit of resulting hash shall cut the whole input dataset into two equally cardinal subsets. Without implementing any parameter-tuning training phase whatsoever, the method attains, especially in the high-precision/low-recall region of 20newsgroups text classification task, results which are comparable to those obtained by much more complex deep learning techniques.
KeywordsReflective Random Indexing unsupervised Locality Sensitive Hashing Dimensionality Reduction Hamming Distance Nearest-Neighbor Search
Unable to display preview. Download preview PDF.
- 3.El Ghali, A., Hromada, D., El Ghali, K.: Enrichir et raisonner sur des espaces sémantiques pour l’attribution de mots-clés. In: JEP-TALN-RECITAL 2012, p. 77 (2012)Google Scholar
- 5.Sahlgren, M., Cöster, R.: Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 487 (2004)Google Scholar
- 6.Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE, vol. 5 (2005)Google Scholar
- 8.Hromada, D.D.: Random Projection and Geometrization of String Distance Metrics. In: Proceedings of the Student Research Workshop Associated with RANLP, pp. 79–85 (2013)Google Scholar
- 11.Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, vol. 99, pp. 518–529 (1999)Google Scholar
- 12.Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, pp. 380–388 (2002)Google Scholar
- 13.Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2006, pp. 459–468 (2006)Google Scholar
- 14.20 newsgroups, http://qwone.com/~jason/20Newsgroups/