Abstract
Self-supervised and Semi-supervised learning (SSL) on tabular data is an understudied topic. Despite some attempts, there are two major challenges: 1. Imbalanced nature in the tabular dataset; 2. The one-hot encoding used in these methods becomes less efficient for high-cardinality categorical features. To cope with the challenges, we propose SAWTab which uses a target encoding method, Conditional Probability Representation (CPR), for efficient representation in the input space of categorical features. We improve this representation by incorporating the unlabeled samples through pseudo-labels. Furthermore, we propose a Smooth Adaptive Weighting mechanism in the target encoding to mitigate the issue of noisy and biased pseudo-labels. Experimental results on various datasets and comparisons with existing frameworks show that SAWTab yields best test accuracy on all datasets. We find that pseudo-labels can help improve the input space representation in the SSL setting, which enhances the generalization of the learning algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Yoon, J., Zhang, Y., Jordon, J., van der Schaar, M.: Vime: extending the success of self-and semi-supervised learning to tabular domain. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Ucar, T., Hajiramezanali, E., Edwards, L.: Subtab: subsetting features of tabular data for self-supervised representation learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 18853–18865 (2021)
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120 (2009)
Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S.J., Shin, J.: Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 14567–14579 (2020)
Wei, C., Sohn, K., Mellina, C., Yuille, A., Yang, F.: Crest: a class-rebalancing self-training framework for imbalanced semi-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10857–10866 (2021)
Wang, X., Wu, Z., Lian, L., Yu, S.X.: Debiased learning from naturally imbalanced pseudo-labels for zero-shot and semi-supervised learning. CoRR, vol. abs/2201.01490 (2022). https://arxiv.org/abs/2201.01490
He, J., et al.: Rethinking re-sampling in imbalanced semi-supervised learning. CoRR, vol. abs/2106.00209 (2021). https://arxiv.org/abs/2106.00209
Guo, L.-Z., Li, Y.-F.: Class-imbalanced semi-supervised learning with adaptive thresholding. In: Proceedings of the 39th International Conference on Machine Learning, vol. 162, pp. 8082–8094. PMLR (2022). https://proceedings.mlr.press/v162/guo22e.html
Bahri, D., Jiang, H., Tay, Y., Metzler, D.: Scarf: self-supervised contrastive learning using random feature corruption. arXiv preprint arXiv:2106.15147 (2021)
Darabi, S., Fazeli, S., Pazoki, A., Sankararaman, S., Sarrafzadeh, M.: Contrastive mixup: self- and semi-supervised learning for tabular domain. arXiv:2108.12296 (2021)
Cerda, P., Varoquaux, G.: Encoding high-cardinality string categorical variables. IEEE Trans. Knowl. Data Eng. 34, 1164–1176 (2020)
Cerda, P., Varoquaux, G., Kégl, B.: Similarity encoding for learning with dirty categorical variables. Mach. Learn. 107(8), 1477–1494 (2018)
Slakey, A., Salas, D., Schamroth, Y.: Encoding categorical variables with conjugate bayesian models for wework lead scoring engine. arXiv preprint arXiv:1904.13001 (2019)
Lai, Z., Wang, C., Gunawan, H., Cheung, S.-C.S., Chuah, C.-N.: Smoothed adaptive weighting for imbalanced semi-supervised learning: improve reliability against unknown distribution data. In: Proceedings of the 39th International Conference on Machine Learning, vol. 162, pp. 11828–11843. PMLR (2022). https://proceedings.mlr.press/v162/lai22b.html
Verleysen, M., François, D.: The curse of dimensionality in data mining and time series prediction. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) IWANN 2005. LNCS, vol. 3512, pp. 758–770. Springer, Heidelberg (2005). https://doi.org/10.1007/11494669_93
Gharasuie, M.M., Wang, F.: Progressive feature upgrade in semi-supervised learning on tabular domain. In: 2022 IEEE International Conference on Knowledge Graph (ICKG), pp. 188–195 (2022)
Website, U.G.: Traffic violations dataset. https://catalog.data.gov/dataset/traffic-violations-56dda, non specified
Food, U., Administration, D.: CTR prediction contest (2015). https://www.kaggle.com/c/avazu-ctr-prediction
Criteo: Display advertising challenge (2015). https://www.kaggle.com/c/criteo-display-ad-challenge
Majmundar, K., Goyal, S., Netrapalli, P., Jain, P.: Met: masked encoding for tabular data. arXiv preprint arXiv:2206.08564 (2022)
Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C.B., Goldstein, T.: Saint: improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Mohammady Gharasuie, M., Wang, F., Sharif, O., Mukkamala, R. (2024). SAWTab: Smoothed Adaptive Weighting for Tabular Data in Semi-supervised Learning. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14647. Springer, Singapore. https://doi.org/10.1007/978-981-97-2259-4_24
Download citation
DOI: https://doi.org/10.1007/978-981-97-2259-4_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2261-7
Online ISBN: 978-981-97-2259-4
eBook Packages: Computer ScienceComputer Science (R0)