Skip to main content
Log in

Correlations between random projections and the bivariate normal

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Random projections is a technique primarily used in dimension reduction by mapping high dimensional data to a low dimensional space, preserving pairwise distances in expectation, such as the Euclidean distance, inner product, angular distance, and \(l_p\) distance for values of p which are even. These estimated pairwise distances between observations in the low dimensional space can be rapidly computed to be used for nearest neighbor searches, clustering, or even classification. This paper highlights how these two disparate topics have a common thread, and expand upon two computational statistical techniques in recent random projection literature to further improve the accuracy of the estimate of the inner product between vectors under random projection by making use of the properties of the respective dataset, as well as limitations of these methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  • Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687

    Article  MathSciNet  Google Scholar 

  • Ailon N, Chazelle B (2009) The fast Johnson–Lindenstrauss Transform and approximate nearest neighbors. SIAM J Comput 39(1):302–322

    Article  MathSciNet  Google Scholar 

  • Alkema L, Raftery A, Gerland P, Clark S, Pelletier F, Buettner T, Heilig G (2011) Probabilistic projections of the total fertility rate for all countries. Demography 48(3):815–839

    Article  Google Scholar 

  • Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12):1624–1637

    Article  Google Scholar 

  • Casella G, Berger R (2001) Statistical inference. Duxbury Resource Center

  • Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on theory of computing. ACM, pp 380–388

  • Dasgupta S (2000) Experiments with Random Projection. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, UAI ’00, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc, pp 143–151

  • Durrant R, Kaban A (2013) Random projections as regularizers: learning a linear discriminant ensemble from fewer observations than dimensions. In: Asian conference on machine learning, pp 17–32

  • Fosdick BK, Perlman MD (2016) Variance-stabilizing and confidence-stabilizing transformations for the normal correlation coefficient with known variances. Commun Stat Simul Comput 45(6):1918–1935

    Article  MathSciNet  Google Scholar 

  • Fosdick BK, Raftery AE (2012) Estimating the correlation in bivariate normal data with known variances and small sample sizes. Am Stat 66(1):34–41

    Article  MathSciNet  Google Scholar 

  • Fu Y, Wang H, Wong A (2013) Small sample inference for the correlation in bivariate normal with known variances. Far East J Theor Stat 45(2):147

    MathSciNet  MATH  Google Scholar 

  • Glynn PW, Szechtman R (2002) Some new perspectives on the method of control variates. In: Monte Carlo and Quasi-Monte Carlo Methods 2000. Springer, pp 27–49

  • Halko N, Martinsson PG, Tropp JA (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288

    Article  MathSciNet  Google Scholar 

  • Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, STOC ’98, New York, NY, USA. ACM, pp 604–613

  • Jeffreys H (1961) Theory of probability, 3rd edn. Oxford

  • Kaban A (2015) Improved bounds on the dot product under random projection and random sign projection. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 487–496

  • Kang K (2017a) Random projections with Bayesian priors. In: Natural Language Processing and Chinese Computing - 6th CCF International Conference, NLPCC 2017, Dalian, China, November 8-12, 2017, Proceedings, pp 170–182

  • Kang K (2017b) Using the multivariate normal to improve random projections. In: Intelligent data engineering and automated learning—IDEAL 2017: 18th international conference, Guilin, China, October 30–November 1, 2017, Proceedings. Springer, Cham, pp 397–405

  • Kang K, Hooker G (2017a) Control variates as a variance reduction technique for random projections. In: Pattern recognition applications and methods - 6th international conference, ICPRAM 2017, Porto, Portugal, February 24-26, 2017, Revised Selected Papers, pp 1–20

  • Kang K, Hooker G (2017b) Random projections with control variates. In: Proceedings of the 6th international conference on pattern recognition applications and methods - volume 1: ICPRAM. INSTICC, ScitePress, pp 138–147

  • Lavenberg SS, Welch PD (1981) A perspective on the use of control variables to increase the efficiency of Monte Carlo simulations. Manage Sci 27(3):322–335

    Article  MathSciNet  Google Scholar 

  • Li P, Hastie T, Church KW (2006a) Improving random projections using marginal information. In: Lugosi G, Simon H-U (eds) COLT, volume 4005 of Lecture Notes in Computer Science. Springer, pp 635–649

  • Li P, Hastie TJ, Church KW (2006b) Very Sparse Random Projections. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06, New York, NY, USA. ACM, pp 287–296

  • Li P, Mahoney MW, She Y (2010) Approximating higher-order distances using random projections. In: Proceedings of the twenty-sixth conference on uncertainty in artificial intelligence. AUAI Press, pp 312–321

  • Liberty E, Ailon N, Singer A (2008) Dense fast random projections and lean walsh transforms. In: Goel A, Jansen K, Rolim JDP, Rubinfeld R (eds) APPROX-RANDOM, volume 5171 of Lecture Notes in Computer Science. Springer, pp 512–522

  • Lichman M (2013) UCI machine learning repository

  • Madansky A (1965) On the maximum likelihood estimate of the correlation coefficient. Defense Technical Information Center

  • Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, London

    MATH  Google Scholar 

  • Muirhead RJ (2005) Aspects of multivariate statistical theory. Wiley-Interscience, Hoboken

    MATH  Google Scholar 

  • Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9(1):141–142

    Article  Google Scholar 

  • Oates CJ, Girolami M, Chopin N (2017) Control functionals for Monte Carlo integration. J R Stat Soc: Ser B (Stat Methodol) 79(3):695–718

    Article  MathSciNet  Google Scholar 

  • Papamarkou T, Mira A, Girolami M (2014) Zero variance differential geometric Markov chain Monte Carlo algorithms. Bayesian Anal 9(1):97–128

    Article  MathSciNet  Google Scholar 

  • Paul S, Boutsidis C, Magdon-Ismail M, Drineas P (2013) Random projections for support vector machines. In: Artificial intelligence and statistics, pp 498–506

  • Portier F, Segers J (2018) Monte carlo integration with a growing number of control variates. arXiv preprint arXiv:1801.01797

  • Shao J (2003) Mathematical statistics. Springer Texts in Statistics. Springer

  • Vempala SS (2004) The random projection method, volume 65 of DIMACS series in discrete mathematics and theoretical computer science. Providence, R.I. American Mathematical Society. Appendice, pp 101–105

  • Watson GS (1964) Smooth regression analysis. Sankhyā: Indian J Stat Ser A 359–372

Download references

Acknowledgements

We would like to thank the reviewers for their comments and suggestions for improvement, which has helped to enhance the quality of the paper. We also want to thank the following people: Wong Wei Pin and Sergey Kushnarev for fruitful and productive discussions. We thank Omar Ortiz for his technical assistance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keegan Kang.

Additional information

Responsible editor: Fei Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is funded by the SUTD Faculty Fellow Grant RGFECA17003 as well as the Singapore Ministry of Education Academic Research Fund Tier 2 Grant MOE2018-T2-2-013.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kang, K. Correlations between random projections and the bivariate normal. Data Min Knowl Disc 35, 1622–1653 (2021). https://doi.org/10.1007/s10618-021-00764-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-021-00764-6

Keywords

Navigation