Embeddings of Categorical Variables for Sequential Data in Fraud Context

  • Yoan RussacEmail author
  • Olivier Caelen
  • Liyun He-Guelton
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 723)


In this paper we propose a new generic method to work with categorical variables in case of sequential data. Our main contributions are: (1) The use of unsupervised methods to extract sequential information, (2) The generation of embeddings including this sequential information for categorical variables using the well-known Word2Vec neural network. The use of embeddings not only reduced the memory usage but also improved the machine learning algorithms learning capacity from data compared with commonly used One-Hot encoding. We implemented those processes on a real world credit card fraud dataset, which represents more than 400 million transactions over a one year time window. We demonstrated that we were able to reduce the memory usage by 50% and to improve performance by 3% points while using only a small subset of features.


Categorical variable Word2Vec Embeddings Credit card fraud detection Machine learning 


  1. 1.
    Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., Bontempi, G.: Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst. Appl. 41(10), 4915–4928 (2014)CrossRefGoogle Scholar
  2. 2.
    Guo, C., Berkhahn, F.: Entity embeddings of categorical variables. CoRR, abs/1604.06737 (2016)Google Scholar
  3. 3.
    Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)zbMATHGoogle Scholar
  4. 4.
    Liu, X.-Y., Wu, J., Zhou, Z.-H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. B (Cybernetics) 39(2), 539–550 (2009)CrossRefGoogle Scholar
  5. 5.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR 2013, January 2013Google Scholar
  6. 6.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, USA, vol. 2, pp. 3111–3119. Curran Associates Inc (2013)Google Scholar
  7. 7.
    Musto, C., Semeraro, G., de Gemmis, M., Lops, P.: Word embedding techniques for content-based recommender systems: an empirical evaluation. In: Castells, P. (ed.) RecSys Posters, CEUR Workshop Proceedings, vol. 1441 (2015).
  8. 8.
    Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer (2010)Google Scholar
  9. 9.
    Trivedi, I., Monik, M., Mridushi, M.: Review of web crawlers with specification and working. Int. J. Adv. Res. Comput. Commun. Eng. 5(1), 39–42 (2016)CrossRefGoogle Scholar
  10. 10.
    Wen, Y., Yuan, H., Zhang, P.: Research on keyword extraction based on word2vec weighted textrank. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC), pp. 2109–2113, October 2016Google Scholar
  11. 11.
    Ziegler, K., Caelen, O., Garchery, M., Granitzer, M., He-Guelton, L., Jurgovsky, J., Portier, P.-E., Zwicklbauer, S.: Injecting semantic background knowledge into neural networks using graph embeddings. In: 2017 IEEE 26th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), pp. 200–205. IEEE (2017)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.ENSAE ParisTechParisFrance
  2. 2.R&D WorldlineBrusselsBelgium
  3. 3.R&D WorldlineLyonFrance

Personalised recommendations