Automated Product-Attribute Mapping

  • Karamjit Singh
  • Garima GuptaEmail author
  • Gautam Shroff
  • Puneet Agarwal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10526)


Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains. Sometimes this is a missing data issue, while in other cases it may be inherent, e.g., the records in different geographical databases may actually describe different product ‘SKUs’, or follow different norms for categorization. Often a tedious manual mapping process is often employed in practice. We focus on improving such a process using machine-learning driven automation. Record linkage techniques, such as [5] can be used to automatically map products in different data sources to a common set of global attributes, thereby enabling federated aggregation joins to be performed. Traditional record-linkage techniques are typically unsupervised, relying textual similarity features across attributes to estimate matches. In this paper, we present an ensemble model combining minimal supervision using Bayesian network models together with unsupervised textual matching for automating such ‘attribute fusion’. We present results of our approach on a large volume of real-life data from a market-research scenario and compare with a standard record matching algorithm. Our approach is especially suited for practical implementation since we also provide confidence values for matches, enabling routing of items for human intervention where required.


  1. 1.
    Acheson, E., Peto, R.: Record linkage and the identification of long-term environmental hazards [and discussion]. Proc. Roy. Soc. London B Biol. Sci. 205, 165–178 (1979)CrossRefGoogle Scholar
  2. 2.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2003)Google Scholar
  3. 3.
    Brizan, D.G., Tansel, A.U.: A survey of entity resolution and record linkage methodologies. Commun. IIMA 6, 5 (2015)Google Scholar
  4. 4.
    Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14, 462–467 (1968)CrossRefzbMATHGoogle Scholar
  5. 5.
    Christen, P.: Febrl: a freely available record linkage system with a graphical user interface. In: 2nd Australasian Workshop on Health Data and Knowledge Management, vol. 80. Australian Computer Society, Inc. (2008)Google Scholar
  6. 6.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)CrossRefzbMATHGoogle Scholar
  7. 7.
    Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000)CrossRefGoogle Scholar
  8. 8.
    Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5, 2018–2019 (2012)CrossRefGoogle Scholar
  9. 9.
    Huang, T., Russell, S.: Object identification: a bayesian analysis with application to traffic surveillance. Artif. Intell. 103, 77–93 (1998)CrossRefzbMATHGoogle Scholar
  10. 10.
    Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)zbMATHGoogle Scholar
  11. 11.
    Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 484–493 (2010)CrossRefGoogle Scholar
  12. 12.
    Lam, W., Bacchus, F.: Learning bayesian belief networks: an approach based on the MDL principle. Comput. Intell. 10, 269–293 (1994)CrossRefGoogle Scholar
  13. 13.
    Li, X., Morie, P., Roth, D.: Semantic integration in text: from ambiguous names to identifiable entities. AI Mag. 26, 45 (2005)Google Scholar
  14. 14.
    Norén, G.N., Orre, R., Bate, A.: A hit-miss model for duplicate detection in the who drug safety database. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM (2005)Google Scholar
  15. 15.
    Poon, S., Poon, J., Lam, M., et al.: An ensemble approach for record matching in data linkage. Stud. Health Technol. Inf. 227, 113–119 (2016)Google Scholar
  16. 16.
    Shah, A., Woolf, P.: Python environment for bayesian learning: inferring the structure of bayesian networks from knowledge and data. J. Mach. Learn. Res. JMLR 10, 159–162 (2009)Google Scholar
  17. 17.
    Singh, K., Paneri, et al.: Visual bayesian fusion to navigate a data lake. In: 2016 19th International Conference on Information Fusion (FUSION). ISIF (2016)Google Scholar
  18. 18.
    Singh, K., Shroff, G., Agarwal, P.: Predictive reliability mining for early warnings in populations of connected machines. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA). 36678 2015. IEEE (2015)Google Scholar
  19. 19.
    Uebersax, J.: Genetic Counseling and Cancer Risk Modeling: An Application of Bayes Nets. Ravenpack International, Marbella (2004)Google Scholar
  20. 20.
    Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database SystGoogle Scholar
  21. 21.
    Yadav, S., Shroff, G., Hassan, E., Agarwal, P.: Business data fusion. In: 2015 18th International Conference on Information Fusion (Fusion). IEEE (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Karamjit Singh
    • 1
  • Garima Gupta
    • 1
    Email author
  • Gautam Shroff
    • 1
  • Puneet Agarwal
    • 1
  1. 1.TCS ResearchGurgaonIndia

Personalised recommendations