Efficient binary embedding of categorical data using BinSketch

Verma, Bhisham Dev; Pratap, Rameshwar; Bera, Debajyoti

doi:10.1007/s10618-021-00815-y

Efficient binary embedding of categorical data using BinSketch

Published: 04 January 2022

Volume 36, pages 537–565, (2022)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

590 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches. The minimum dimension of the sketches required by Cham to ensure a good estimation theoretically depends only on the sparsity of the data points—making it useful for many real-life scenarios involving sparse datasets. We present a rigorous theoretical analysis of our approach and supplement it with extensive experiments on several high-dimensional real-world data sets, including one with over a million dimensions. We show that the Cabin and Cham duo is a significantly fast and accurate approach for tasks such as \(\mathrm {RMSE}\), all-pair similarity, and clustering when compared to working with the full dataset and other dimensionality reduction techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Notes

References

Achlioptas D (2001) Database-friendly random projections. In: Buneman P (ed) Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, May 21–23, 2001, Santa Barbara, California, USA. ACM
Agarwal A, Chapelle O, Dudík M, Langford J (2014) A reliable effective terascale linear learning system. J Mach Learn Res 15:1111–1133
MathSciNet MATH Google Scholar
Agrawal R, Imielinski T, and Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pp 207–216, New York, NY, USA, 1993. ACM Press
Arthur D, and Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’07, pp 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics
Blasius J and Greenacre M (2006) Multiple correspondence analysis and related methods. In: Multiple correspondence analysis and related methods, 06 2006
Blei David M, Ng Andrew Y, Jordan Michael I, Lafferty J (2003) Latent Dirichlet allocation. J Mach Learn Res 3:2003
MATH Google Scholar
Boutsidis C, Zouzias A, Drineas P (2010) Random projections for \(k\)-means clustering. Adv Neural Inf Process Syst 23:298–306
Google Scholar
Broder AZ, Charikar M, Frieze AM, and Mitzenmacher M (1998) Min-wise independent permutations (extended abstract). In: Proceedings of the thirtieth annual ACM symposium on the theory of computing, Dallas, Texas, USA, May 23–26, 1998, pp 327–336
Charikar M (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings on 34th annual ACM symposium on theory of computing, May 19–21, 2002, Montréal, Québec, Canada, pp 380–388
Cormode G, Datar M, Indyk P, Muthukrishnan S (2003) Comparing data streams using Hamming norms (how to zero in). IEEE Trans Knowl Data Eng 15(3):529–540
Article Google Scholar
Deerwester S, Dumais Susan T, Furnas George W, Landauer Thomas K, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Gionis A, Indyk P, and Motwani R (1999) Similarity search in high dimensions via hashing. In: VLDB’99, proceedings of 25th international conference on very large data bases, Sep 7–10, 1999, Edinburgh, Scotland, UK, pp 518–529
Grigorev A (2017) Mastering Java for data science: building data science applications in Java. Packt Publishing
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Article Google Scholar
Hämäläinen W, and Nykänen M (2008) Efficient discovery of statistically significant association rules. In: 2008 Eighth IEEE international conference on data mining, pp 203–212
Indyk P, and Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on the theory of computing, Dallas, Texas, USA, May 23–26, 1998, pp 604–613
Johnson WB, and Lindenstrauss J (1983) Extensions of Lipschitz mappings into a hilbert space. In: Conference in modern analysis and probability (New Haven, Conn., 1982), Am. Math. Soc., Providence, R.I., pp 189–206
Kane DM, Nelson J, and Woodruff DP (2010) An optimal algorithm for the distinct elements problem. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 2010, June 6–11, 2010, Indianapolis, Indiana, USA, pp 41–52
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93
Article Google Scholar
Kim M, and Smaragdis P (2018) Bitwise neural networks for efficient single-channel source separation. In: 2018 IEEE international conference on acoustics, speech and signal processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018, pp 701–705. IEEE
Kingma DP, and Welling M (2014) Auto-encoding variational bayes. In: 2nd International conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference track proceedings
Kurgan L, Cios K, Tadeusiewicz R, Ogiela M, Goodenday L (2001) Knowledge discovery approach to automated cardiac spect diagnosis. Artificial Intell Med 23:149–69
Article Google Scholar
Lavergne J, Benton R, and Raghavan VV (2012) Min–max itemset trees for dense and categorical datasets. In: Chen L, Felfernig A, Liu J, and Raś ZW (eds) Foundations of intelligent systems, pp 51–60. Springer, Berlin, Heidelberg, 2012
Lee DD, and Sebastian Seung H (2000) Algorithms for non-negative matrix factorization. In: Leen TK, Dietterich TG, and Tresp V (eds) NIPS, pp 556–562. MIT Press
Li P, Hastie TJ, and Church KW (2006) Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06, pp 287–296, New York, NY, USA, 2006. Association for Computing Machinery
Lichman M (2013) UCI machine learning repository
Liu H, and Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Seventh international conference on tools with artificial intelligence, ICTAI ’95, Herndon, VA, USA, Nov 5–8, 1995, pp 388–391
Mitzenmacher M, Pagh R, and Pham N (2014) Efficient estimation for high similarities using odd sketches. In: 23rd International World Wide Web Conference, WWW’14, Seoul, Republic of Korea, Apr 7–11, 2014, pp 109–118
Moody J, Touretsky D, Kaufmann M, Noordewier MO, Towell GG, and Shavlik JW (eds) (1991) Training knowledge-based neural networks to recognize genes in DNA sequences
Nguyen LH, Holmes S (2019) Ten quick tips for effective dimensionality reduction. PLOS Comput Biology 15(6):1–19
Article Google Scholar
Patwary MMA, Byna S, Satish NR, Sundaram N, Lukić Z, Roytershteyn V, Anderson MJ, Yao Y, Prabhat, and Dubey P (2015) Bd-cats: big data clustering at trillion particle scale. In: SC ’15: proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–12
Peng H, Long F, Ding CHQ (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Pratap R, Bera D, and Revanuru K (2019) Efficient sketching algorithm for sparse binary data. In: 2019 IEEE international conference on data mining, ICDM 2019, Beijing, China, Nov 8–11, 2019, pp 508–517
Pratap R, Kulkarni R, and Sohony I (2018) Efficient dimensionality reduction for sparse binary data. In: IEEE international conference on big data, Big Data 2018, Seattle, WA, USA, Dec 10–13, 2018, pp 152–157
Pratap R, Sohony I, and Kulkarni R (2018) Efficient compression technique for sparse sets. In: Advances in knowledge discovery and data mining—22nd Pacific-Asia conference, PAKDD 2018, Melbourne, VIC, Australia, June 3–6, 2018, Proceedings, Part III, pp 164–176
Rognvaldsson T, You L, Garwicz D (2014) State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics (Oxford, England) 31:12
Google Scholar
Sidana S, Laclau C, and Amini M-R (2018) Learning to recommend diverse items over implicit feedback on pandor, pp 427–431
Spaen QP (2019) Applications and advances in similarity-based machine learning. PhD thesis, University of California, Berkeley
Steinbach M, Ertöz L, and Kumar V (2004) The challenges of clustering high dimensional data, pp 273–309. Springer, Berlin Heidelberg
Wang C, Kao W-H, Kate Hsiao C (2015) Using hamming distance as information for SNP-sets clustering and testing in disease association studies. PLoS ONE 10:e0135918
Article Google Scholar
Weinberger KQ, Dasgupta A, Langford J, Smola AJ, and Attenberg J (2009) Feature hashing for large scale multitask learning. In: Proceedings of the 26th annual international conference on machine learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009, pp 1113–1120
Zheng G, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al (2017) Massively parallel digital transcriptional profiling of single cells. Nature Commun 8(1):1–12 Made available by 10\(\times \) Genomics at https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons

Download references

Author information

Authors and Affiliations

School of Computing and Electrical Engineering, IIT Mandi, Mandi, H.P., India
Bhisham Dev Verma & Rameshwar Pratap
Indraprastha Institute of Information Technology (IIIT-Delhi), New Delhi, Delhi, India
Debajyoti Bera

Authors

Bhisham Dev Verma
View author publications
You can also search for this author in PubMed Google Scholar
Rameshwar Pratap
View author publications
You can also search for this author in PubMed Google Scholar
Debajyoti Bera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Debajyoti Bera.

Additional information

Responsible editor: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verma, B.D., Pratap, R. & Bera, D. Efficient binary embedding of categorical data using BinSketch. Data Min Knowl Disc 36, 537–565 (2022). https://doi.org/10.1007/s10618-021-00815-y

Download citation

Received: 30 January 2021
Accepted: 13 November 2021
Published: 04 January 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10618-021-00815-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient binary embedding of categorical data using BinSketch

Abstract

Access this article

Similar content being viewed by others

Sketches with Unbalanced Bits for Similarity Search

Real-Valued Embeddings and Sketches for Fast Distance and Similarity Estimation

An Alternating Optimization Scheme for Binary Sketches for Cosine Similarity Search

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient binary embedding of categorical data using BinSketch

Abstract

Access this article

Similar content being viewed by others

Sketches with Unbalanced Bits for Similarity Search

Real-Valued Embeddings and Sketches for Fast Distance and Similarity Estimation

An Alternating Optimization Scheme for Binary Sketches for Cosine Similarity Search

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation