Empiric Introduction to Light Stochastic Binarization

  • Daniel Devatman Hromada
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)


We introduce a novel method for transformation of texts into short binary vectors which can be subsequently compared by means of Hamming distance measurement. Similary to other semantic hashing approaches, the objective is to perform radical dimensionality reduction by putting texts with similar meaning into same or similar buckets while putting the texts with dissimilar meaning into different and distant buckets. First, the method transforms the texts into complete TF-IDF, than implements Reflective Random Indexing in order to fold both term and document spaces into low-dimensional space. Subsequently, every dimension of the resulting low-dimensional space is simply thresholded along its 50th percentile so that every individual bit of resulting hash shall cut the whole input dataset into two equally cardinal subsets. Without implementing any parameter-tuning training phase whatsoever, the method attains, especially in the high-precision/low-recall region of 20newsgroups text classification task, results which are comparable to those obtained by much more complex deep learning techniques.


Reflective Random Indexing unsupervised Locality Sensitive Hashing Dimensionality Reduction Hamming Distance Nearest-Neighbor Search 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Salakhutdinov, R., Hinton, G.: Semantic hashing. International Journal of Approximate Reasoning 50(7), 969–978 (2009)CrossRefGoogle Scholar
  2. 2.
    Cohen, T., Schvaneveldt, R., Widdows, D.: Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics 43(2), 240–256 (2010)CrossRefGoogle Scholar
  3. 3.
    El Ghali, A., Hromada, D., El Ghali, K.: Enrichir et raisonner sur des espaces sémantiques pour l’attribution de mots-clés. In: JEP-TALN-RECITAL 2012, p. 77 (2012)Google Scholar
  4. 4.
    Sahlgren, M., Karlgren, J.: Vector-based semantic analysis using random indexing for cross-lingual query expansion. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 169–176. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  5. 5.
    Sahlgren, M., Cöster, R.: Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 487 (2004)Google Scholar
  6. 6.
    Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE, vol. 5 (2005)Google Scholar
  7. 7.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  8. 8.
    Hromada, D.D.: Random Projection and Geometrization of String Distance Metrics. In: Proceedings of the Student Research Workshop Associated with RANLP, pp. 79–85 (2013)Google Scholar
  9. 9.
    Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics 26(189–206), 1 (1984)MathSciNetGoogle Scholar
  10. 10.
    Landauer, T.K., Dumais, S.T.: A solution to Platos problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2), 211–240 (1997)CrossRefGoogle Scholar
  11. 11.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, vol. 99, pp. 518–529 (1999)Google Scholar
  12. 12.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, pp. 380–388 (2002)Google Scholar
  13. 13.
    Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2006, pp. 459–468 (2006)Google Scholar
  14. 14.

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Daniel Devatman Hromada
    • 1
    • 2
  1. 1.Faculty of Electrical Engineering and Information Technology, Department of Robotics and CyberneticsSlovak University of TechnologyBratislavaSlovakia
  2. 2.Laboratoire Cognition Humaine et ArtificielleUniversité Paris 8St Denis Cedex 02France

Personalised recommendations