Machine Learning

, Volume 107, Issue 8–10, pp 1477–1494 | Cite as

Similarity encoding for learning with dirty categorical variables

  • Patricio CerdaEmail author
  • Gaël Varoquaux
  • Balázs Kégl
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2018 Journal Track


For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. “Dirty” non-curated data give rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in predictive performance in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinalities, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperform classic encoding approaches.


Dirty data Categorical variables Statistical learning String similarity measures 



We would like to acknowledge the excellent feedback from the reviewers. This work was funded by the Wendelin and DirtyData (ANR-17-CE23-0018) grants.


  1. Alkharusi, H. (2012). Categorical variables in regression analysis: A comparison of dummy and effect coding. International Journal of Education, 4(2), 202–210.CrossRefGoogle Scholar
  2. Angell, R. C., Freund, G. E., & Willett, P. (1983). Automatic spelling correction using a trigram similarity measure. Information Processing & Management, 19(4), 255–261.CrossRefGoogle Scholar
  3. Berry, K. J., Mielke, P. W, Jr., & Iyer, H. K. (1998). Factorial designs and dummy coding. Perceptual and Motor Skills, 87(3), 919–927.CrossRefGoogle Scholar
  4. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606
  5. Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.CrossRefGoogle Scholar
  6. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2013). Applied multiple regression/correlation analysis for the behavioral sciences. London: Routledge.Google Scholar
  7. Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. Kdd Workshop on Data Cleaning and Object Consolidation, 3, 73–78.Google Scholar
  8. Cohen, W. W. (1998). Integration of heterogeneous databases without common domains using queries based on textual similarity. In ACM SIGMOD record (Vol. 27, pp. 201–212). ACM.Google Scholar
  9. Coppersmith, D., Hong, S. J., & Hosking, J. R. (1999). Partitioning nominal attributes in decision trees. Data Mining and Knowledge Discovery, 3(2), 197–217.CrossRefGoogle Scholar
  10. Davis, M. J. (2010). Contrast coding in multiple regression analysis: Strengths, weaknesses, and utility of popular coding structures. Journal of Data Science, 8(1), 61–73.Google Scholar
  11. Duch, W., Grudzinski, K., & Stawski, G. (2000). Symbolic features in neural networks. In Proceedings of the 5th conference on neural networks and their applications. Citeseer.Google Scholar
  12. Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.CrossRefGoogle Scholar
  13. Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRefzbMATHGoogle Scholar
  14. Gomaa, W. H., & Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13), 13–18.CrossRefGoogle Scholar
  15. Grabczewski, K., & Jankowski, N. (2003). Transformations of symbolic data for continuous data oriented models. In Artificial neural networks and neural information processing (pp. 359–366). Springer.Google Scholar
  16. Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737
  17. Hull, D. A., et al. (1996). Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1), 70–84.CrossRefGoogle Scholar
  18. Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420.CrossRefGoogle Scholar
  19. Kim, W., Choi, B. J., Hong, E. K., Kim, S. K., & Lee, D. (2003). A taxonomy of dirty data. Data Mining and Knowledge Discovery, 7(1), 81–99.MathSciNetCrossRefGoogle Scholar
  20. Kondrak, G. (2005). N-gram similarity and distance. In International symposium on string processing and information retrieval (pp. 115–126). Springer.Google Scholar
  21. Krishnan, S., Franklin, M. J., Goldberg, K., & Wu, E. (2017). Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299.
  22. Krishnan, S., Wang, J., Wu, E., Franklin, M. J., & Goldberg, K. (2016). Activeclean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment, 9(12), 948–959.CrossRefGoogle Scholar
  23. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.CrossRefGoogle Scholar
  24. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.MathSciNetGoogle Scholar
  25. Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1–2), 22–31.Google Scholar
  26. Maier, D. (1983). The theory of relational databases (Vol. 11). Rockville: Computer Science Press.zbMATHGoogle Scholar
  27. Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1), 27–32.CrossRefGoogle Scholar
  28. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR workshop papers.Google Scholar
  29. Myers, J. L., Well, A., & Lorch, R. F. (2010). Research design and statistical analysis. London: Routledge.Google Scholar
  30. O’Grady, K. E., & Medoff, D. R. (1988). Categorical variables in multiple regression: Some cautions. Multivariate Behavioral Research, 23(2), 243–2060.CrossRefGoogle Scholar
  31. Oliveira, P., Rodrigues, F., & Henriques, P. R. (2005). A formal definition of data quality problems. In Proceedings of the 2005 international conference on information quality (MIT IQ conference). Google Scholar
  32. Pedhazur, E. J., Kerlinger, F. N., et al. (1973). Multiple regression in behavioral research. Rinehart and Winston New York: Holt.Google Scholar
  33. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetzbMATHGoogle Scholar
  34. Pyle, D. (1999). Data preparation for data mining (Vol. 1). Burlington: Morgan Kaufmann.Google Scholar
  35. Rahimi, A., & Recht, B. (2008). Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in neural information processing systems 20 (pp. 1177–1184). Curran Associates, Inc.
  36. Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.Google Scholar
  37. Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 269–278). ACM.Google Scholar
  38. Shyu, M. L., Sarinnapakorn, K., Kuruppu-Appuhamilage, I., Chen, S. C., Chang, L., & Goldring, T. (2005). Handling nominal features in anomaly intrusion detection problems. In 15th international workshop on research issues in data engineering: Stream data mining and applications (pp. 55–62). IEEE.Google Scholar
  39. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning (pp. 1113–1120). ACM.Google Scholar
  40. Winkler, W. E. (1999). The state of record linkage and current research problems. Citeseer: Statistical Research Division, US Census Bureau.Google Scholar
  41. Winkler, W. E. (2002). Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC.Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Inria, Parietal teamPalaiseauFrance
  2. 2.Linear Accelerator Laboratory, CNRSOrsayFrance

Personalised recommendations