Random Indexing Explained with High Probability

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9302)

Abstract

Random indexing (RI) is an incremental method for constructing a vector space model (VSM) with a reduced dimensionality. Previously, the method has been justified using the mathematical framework of Kanerva’s sparse distributed memory. This justification, although intuitively plausible, fails to provide the information that is required to set the parameters of the method. In order to suggest criteria for the method’s parameters, the RI method is revisited and described using the principles of linear algebra and sparse random projections in Euclidean spaces. These simple mathematics are then employed to suggest criteria for setting the method’s parameters and to explain their influence on the estimated distances in the RI-constructed VSMs. The empirical results observed in an evaluation are reported to support the suggested guidelines in the paper.

Keywords

Random indexing Dimensionality reduction Text analytics 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    QasemiZadeh, B.: Random indexing revisited. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds.) NLDB 2015. LNCS, vol. 9103, pp. 437–442. Springer, Heidelberg (2015) CrossRefGoogle Scholar
  2. 2.
    Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37(1), 141–188 (2010)MathSciNetMATHGoogle Scholar
  3. 3.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)CrossRefMATHGoogle Scholar
  4. 4.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998) CrossRefGoogle Scholar
  5. 5.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  6. 6.
    Brand, M.: Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications 415(1), 20–30 (2006). Special Issue on Large Scale Linear and Nonlinear Eigenvalue ProblemsMathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. In: Beals, R., Beck, A., Bellow, A., Hajian, A. (eds.) Conference on Modern Analysis and Probability (1982: Yale University). Contemporary Mathematics, vol. 26. American Mathematical Society, pp. 189–206 (1984)Google Scholar
  8. 8.
    Dasgupta, S., Gupta, A.: An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22(1), 60–65 (2003)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Achlioptas, D.: Database-friendly random projections. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281. ACM, Santa Barbara, May 2001Google Scholar
  10. 10.
    Li, P., Hastie, T.J., Church, K.W.: Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 287–296. ACM, New York (2006)Google Scholar
  11. 11.
    Kanerva, P., Kristoferson, J., Holst, A.: Random indexing of text samples for latent semantic analysis. In: Gleitman, L.R., Josh, A.K. (eds.) Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036. Erlbaum, Mahwah (2000) Google Scholar
  12. 12.
    Sahlgren, M.: An introduction to random indexing. Technical report, Swedish ICT (SICS) (2005). Retrived from https://www.sics.se/~mange/papers/RI_intro.pdf
  13. 13.
    Lupu, M.: On the usability of random indexing in patent retrieval. In: Hernandez, N., Jäschke, R., Croitoru, M. (eds.) ICCS 2014. LNCS, vol. 8577, pp. 202–216. Springer, Heidelberg (2014) Google Scholar
  14. 14.
    Polajnar, T., Clark, S.: Improving distributional semantic vectors through context selection and normalisation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Gothenburg, pp. 230–238, April 2014Google Scholar
  15. 15.
    Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky Wide Web: A collection of very large linguistically processed Web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)CrossRefGoogle Scholar
  16. 16.
    Brinkman, B., Charikar, M.: On the impossibility of dimension reduction in L1. Journal of the ACM 52(5), 766–788 (2005)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Lapesa, G., Evert, S.: Evaluating neighbor rank and distance measures as predictors of semantic priming. In: Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics. Association for Computational Linguistics, Sofia, pp. 66–74, August 2013Google Scholar
  18. 18.
    Indyk, P.: Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM 53(3), 307–323 (2006)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Li, P., Samorodnitsk, G., Hopcroft, J.: Sign Cauchy projections and chi-square kernel. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 2571–2579. Curran Associates, Inc. (2013)Google Scholar
  20. 20.
    Geva, S., De Vries, C.M.: TOPSIG: topology preserving document signatures. In: Berendt, B., de Vries, A., Fan, W., Macdonald, C., Ounis, I., Ruthven, I. (eds.) Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 333–338. ACM, Glasgow (2011)Google Scholar
  21. 21.
    Zadeh, B.Q., Handschuh, S.: Random Manhattan integer indexing: incremental L1 normed vector space construction. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1713–1723. Association for Computational Linguistics, Doha (2014)Google Scholar
  22. 22.
    Zadeh, B.Q., Handschuh, S.: Random Manhattan indexing. In: 25th International Workshop on Database and Expert Systems Applications (DEXA), pp. 203–208. IEEE, Munich (2014)Google Scholar
  23. 23.
    Baroni, M., Lenci, A., Onnis, L.: ISA meets Lara: an incremental word space model for cognitively plausible simulations of semantic learning. In: Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, pp. 49–56. Association for Computational Linguistics, Prague, June 2007Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Digital Libraries and Web Information SystemsUniversity of PassauPassauGermany
  2. 2.National University of IrelandGalwayIreland

Personalised recommendations