Similarity encoding for learning with dirty categorical variables
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. “Dirty” non-curated data give rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in predictive performance in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinalities, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperform classic encoding approaches.
KeywordsDirty data Categorical variables Statistical learning String similarity measures
We would like to acknowledge the excellent feedback from the reviewers. This work was funded by the Wendelin and DirtyData (ANR-17-CE23-0018) grants.
- Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606
- Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2013). Applied multiple regression/correlation analysis for the behavioral sciences. London: Routledge.Google Scholar
- Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. Kdd Workshop on Data Cleaning and Object Consolidation, 3, 73–78.Google Scholar
- Cohen, W. W. (1998). Integration of heterogeneous databases without common domains using queries based on textual similarity. In ACM SIGMOD record (Vol. 27, pp. 201–212). ACM.Google Scholar
- Davis, M. J. (2010). Contrast coding in multiple regression analysis: Strengths, weaknesses, and utility of popular coding structures. Journal of Data Science, 8(1), 61–73.Google Scholar
- Duch, W., Grudzinski, K., & Stawski, G. (2000). Symbolic features in neural networks. In Proceedings of the 5th conference on neural networks and their applications. Citeseer.Google Scholar
- Grabczewski, K., & Jankowski, N. (2003). Transformations of symbolic data for continuous data oriented models. In Artificial neural networks and neural information processing (pp. 359–366). Springer.Google Scholar
- Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737
- Kondrak, G. (2005). N-gram similarity and distance. In International symposium on string processing and information retrieval (pp. 115–126). Springer.Google Scholar
- Krishnan, S., Franklin, M. J., Goldberg, K., & Wu, E. (2017). Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299.
- Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1–2), 22–31.Google Scholar
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR workshop papers.Google Scholar
- Myers, J. L., Well, A., & Lorch, R. F. (2010). Research design and statistical analysis. London: Routledge.Google Scholar
- Oliveira, P., Rodrigues, F., & Henriques, P. R. (2005). A formal definition of data quality problems. In Proceedings of the 2005 international conference on information quality (MIT IQ conference). Google Scholar
- Pedhazur, E. J., Kerlinger, F. N., et al. (1973). Multiple regression in behavioral research. Rinehart and Winston New York: Holt.Google Scholar
- Pyle, D. (1999). Data preparation for data mining (Vol. 1). Burlington: Morgan Kaufmann.Google Scholar
- Rahimi, A., & Recht, B. (2008). Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in neural information processing systems 20 (pp. 1177–1184). Curran Associates, Inc. http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf.
- Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.Google Scholar
- Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 269–278). ACM.Google Scholar
- Shyu, M. L., Sarinnapakorn, K., Kuruppu-Appuhamilage, I., Chen, S. C., Chang, L., & Goldring, T. (2005). Handling nominal features in anomaly intrusion detection problems. In 15th international workshop on research issues in data engineering: Stream data mining and applications (pp. 55–62). IEEE.Google Scholar
- Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning (pp. 1113–1120). ACM.Google Scholar
- Winkler, W. E. (1999). The state of record linkage and current research problems. Citeseer: Statistical Research Division, US Census Bureau.Google Scholar
- Winkler, W. E. (2002). Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC.Google Scholar