Electronic Markets

, Volume 26, Issue 2, pp 143–156 | Cite as

The Ambiguous Identifier Clustering Technique

  • Michael Scholz
  • Markus Franz
  • Oliver Hinz
Research Paper


Investigations of online transaction data often face the problem that entries for identical products cannot be identified as such. There is, for example, typically no unique product identifier in online auctions; retailers make their offers at price comparison sites hardly comparable and online stores often use different identifiers for virtually equal products. Existing studies typically use data sets that are restricted to one or only a few products in order to avoid product heterogeneity if a unique product identifier is not available. We propose the Ambiguous Identifier Clustering Technique (AICT) that identifies online transaction data that refer to virtually the same product. Based on a data set of eBay auctions, we demonstrate that AICT clusters online transactions for identical products with high accuracy. We further show how researchers benefit from AICT and the reduced product heterogeneity when analyzing data with econometric models.


Product heterogeneity Clustering Online transaction data E-commerce 

JEL Classification

C18 D44 


  1. Abbasi, A., & Chen, H. (2008). CyberGate: a design framework and system for text analysis of computer-mediated communication. MIS Quarterly, 32(4), 811–837.Google Scholar
  2. Aliguliyev, R. M. (2009). Performance evaluation of density-based clustering methods. Information Sciences, 179(20), 3583–3602.CrossRefGoogle Scholar
  3. Ananthanarayanan, R., Chenthamarakshan, V, Deshpande, P. M., & Krishnapuram, R. (2008). Rule based synonyms for entity extraction from noisy text. Proceedings of the 2nd ACM Workshop on Analytics for Noisy Unstructured Text Data (pp. 31–38).Google Scholar
  4. Arasu, A. & Kaushik, R. (2009). A grammar-based entity representation framework for data cleaning. Proceedings of the 35th ACM SIGMOD International Conference on Management of Data (pp. 233–244).Google Scholar
  5. Arasu, A., Ganti, V., & Kaushik, R. (2006). Efficient exact set-similarity joins. Proceedings of the 32nd International Conference on Very Large Data Bases (pp. 918–929).Google Scholar
  6. Bajari, P., & Hortaçsu, A. (2003). The winner’s curse, reserve prices, and endogenous entry: empirical insights from eBay auctions”. RAND Journal of Economics, 34(2), 329–355.CrossRefGoogle Scholar
  7. Bapna, R., Goes, P., & Gupta, A. (2003). Replicating online yankee auctions to analyze auctioneers’ and bidders’ strategies. Information Systems Research, 14(3), 244–268.CrossRefGoogle Scholar
  8. Bapna, R., Goes, P., Gupta, A., & Jin, Y. (2004). User heterogeneity and its impact on electronic auction market design: an empirical exploration. MIS Quarterly, 28(1), 21–43.Google Scholar
  9. Bapna, R., Jank, W., & Shmueli, G. (2008). Consumer surplus in online auctions. Information Systems Research, 19(4), 400–416.CrossRefGoogle Scholar
  10. Bapna, R., Chang, S. A., Goes, P., & Gupta, A. (2009). Overlapping online auctions: empirical characterization of bidder strategies and auction prices. MIS Quarterly, 33(4), 763–783.Google Scholar
  11. Becker, J.-M., Rai, A., Ringle, C. M., & Völckner, F. (2013). Discovering unobserved heterogeneity in structural equation models to avert validity threats. MIS Quarterly, 37(3), 665–694.Google Scholar
  12. Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–35.CrossRefGoogle Scholar
  13. Cagnina, L., Errecalde, M., Ingaramo, D., & Rosso, P. (2014). An efficient Particle Swarm Optimization approach to cluster short texts. Information Sciences, 265, 36–49.CrossRefGoogle Scholar
  14. Cheema, A., Chakravarti, D., & Sinha, A. R. (2012). Bidding behavior in descending and ascending auctions. Marketing Science, 31(5), 779–800.CrossRefGoogle Scholar
  15. Chehreghani, M. H., Abolhassani, H., & Chehreghani, M. H. (2009). Density link-based methods for clustering web pages. Decision Support Systems, 47(4), 374–382.CrossRefGoogle Scholar
  16. Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3), 288–321.CrossRefGoogle Scholar
  17. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39(1), 1–38.Google Scholar
  18. Dey, D. (2003). Record matching in data warehouses: a decision model for data consolidation. Operations Research, 51(2), 240–254.CrossRefGoogle Scholar
  19. Dey, D., Sarkar, S., & De, P. (2002). A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 14(3), 567–582.CrossRefGoogle Scholar
  20. Duan, W. (2010). Analyzing the impact of intermediaries in electronic markets: an empirical investigation of online consumer-to-consumer (C2C) auctions. Electronic Markets, 20(2), 85–93.CrossRefGoogle Scholar
  21. Easley, R. F., Wood, C. A., & Barkataki, S. (2010). Bidding patterns, experience, and avoiding the winner’s curse in online auctions. Journal of Management Information Systems, 27(3), 241–268.CrossRefGoogle Scholar
  22. Einav, L., Kuchler, T., Levin, J., & Sundaresan, N. (2015). Assessing sale strategies in online markets using matched listings. American Economic Journal: Microeconomics, 7(2), 215–247.Google Scholar
  23. Elfenbein, D., Fisman, R., & McManus, B. (2012). Charity as a substitute for reputation: evidence from an online marketplace. Review of Economic Studies, 79(4), 1441–1468.CrossRefGoogle Scholar
  24. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).Google Scholar
  25. Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRefGoogle Scholar
  26. Forman, C., Ghose, A., & Wiesenfeld, B. (2008). Examining the relationship between reviews and sales: the role of reviewer identity disclosure in electronic markets. Information Systems Research, 19(3), 291–313.CrossRefGoogle Scholar
  27. Frischmann, T., Hinz, O., & Skiera, B. (2012). Retailers’ use of shipping cost strategies: free shipping or partitioned prices? International Journal of Electronic Commerce, 16(3), 65–87.CrossRefGoogle Scholar
  28. Ghose, A., Telang, R., & Krishnan, R. (2005). Effect of electronic secondary markets on the supply chain. Journal of Management Information Systems, 22(2), 91–120.CrossRefGoogle Scholar
  29. Gilkeson, J. H., & Reynolds, K. (2003). Determinants of internet auction success and closing price: an exploratory study. Psychology & Marketing, 20(6), 537–566.CrossRefGoogle Scholar
  30. Goes, P., Tu, Y., & Tung, A. (2013). Seller heterogeneity in electronic marketplaces: a study of new and experienced sellers in eBay. Decision Support Systems, 56, 247–258.CrossRefGoogle Scholar
  31. Hinz, O., Hill, S., & Kim, J.-Y. (2016). TV’s dirty little secret: the negative effect of popular tv on online auction sales. MIS Quarterly, forthcoming.Google Scholar
  32. Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.CrossRefGoogle Scholar
  33. Hou, J., & Blodgett, J. (2010). Market structure and quality uncertainty: a theoretical framework for online auction research. Electronic Markets, 20(1), 21–32.CrossRefGoogle Scholar
  34. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.CrossRefGoogle Scholar
  35. Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420.CrossRefGoogle Scholar
  36. Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in data: An introduction to cluster analysis. Hobocken: Wiley.Google Scholar
  37. Kocas, C. (2002). Evolution of prices in electronic markets under diffusion of price-comparison shopping. Journal of Management Information Systems, 19(3), 99–119.Google Scholar
  38. Larsen, M. D., & Rubin, D. B. (2001). Iterative automated record linkage using models mixture. Journal of the American Statistical Association, 96(453), 32–41.CrossRefGoogle Scholar
  39. Li, X., Morie, P., & Roth, D. (2005). Semantic integration in text: from ambiguous names to identifiable entities. AI Magazine, 26(1), 45–58.Google Scholar
  40. Li, Y., Luo, C., & Chung, S. M. (2008). Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5), 641–652.CrossRefGoogle Scholar
  41. Lim, E.-P., Srivastava, J., Probhakar, S., & Richardson, J. (1993). Entity identification in database integration. Proceedings of the 9th IEEE International Conference on Data Engineering (pp. 294–301).Google Scholar
  42. Liu, Y., & Sutanto, J. (2012). Buyers’ purchasing time and herd behavior on deal-of-the-day group-buying websites. Electronic Markets, 22(2), 83–93.CrossRefGoogle Scholar
  43. Lucking-Reiley, D. (1999). Using field experiments to test equivalence between auction formats: magic on the internet. American Economic Review, 89(5), 1063–1080.CrossRefGoogle Scholar
  44. Lucking-Reiley, D., Bryan, D., Prasad, N., & Reeves, D. (2007). Pennies from eBay: the determinants of price in online auctions. The Journal of Industrial Economics, 55(2), 223–233.CrossRefGoogle Scholar
  45. Manning, C. D., Raghavan, P., & Schuetze, H. (2009). Introduction to information retrieval. Cambridge: Cambride University Press.Google Scholar
  46. Mingfeng, L., Lucas, H. C., Jr., & Shmueli, G. (2013). Too big to fail: large samples and the p-value problem. Information Systems Research, 24(4), 906–917.CrossRefGoogle Scholar
  47. Mudambi, S. M., & Schuff, D. (2010). What makes a helpful online review? A study of customer reviews on MIS Quarterly, 34(1), 185–200.Google Scholar
  48. Mumtaz, A., Coviello, E., Lanckriet, G. R. G., & Chan, A. B. (2007). Clustering dynamic textures with the hierarchical em algorithm for modeling video. IEEE Transactions on Pattern Analysis & Machine Intelligence, 35(7), 1606–1621.CrossRefGoogle Scholar
  49. Ostrovsky, M., & Schwarz, M. (2009). Reserve prices in internet advertising auctions: A field experiment. Working Paper,
  50. Papapetrou, O., Siberski, W., & Fuhr, N. (2011). Decentralized probabilistic text clustering. IEEE Transactions on Knowledge and Data Engineering, 24(10), 1848–1861.CrossRefGoogle Scholar
  51. Sadinle, M., & Fienberg, S. E. (2013). A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 108(502), 385–397.CrossRefGoogle Scholar
  52. Sarkar, A., Chakraborty, A., & Chaudhuri, A. (2008). A method of finding predictor genes for a particular disease using a clustering algorithm. Communications in Statistics – Simulation and Computation, 37(1), 203–211.CrossRefGoogle Scholar
  53. Shimodaira, H. (2002). An approximately unbiased test of phylogenetic tree selection. Systematic Biology, 51(3), 492–508.CrossRefGoogle Scholar
  54. Shimodaira, H. (2004). Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. The Annals of Statistics, 32(6), 2616–2641.CrossRefGoogle Scholar
  55. Shiu, J.-L., & Sun, C.-H. D. (2014). The determinants of price in online auctions: more evidence from unbalanced panel data. Journal of Applied Statistics, 41(2), 382–392.CrossRefGoogle Scholar
  56. Srinivasan, K., & Wang, X. (2010). Bidders’ experience and learning in online auctions: issues and implications. Marketing Science, 29(6), 988–993.CrossRefGoogle Scholar
  57. Stern, B., & Stafford, M. R. (2006). Individual and social determinants of winning bids in online auctions. Journal of Consumer Behavior, 5(1), 43–55.CrossRefGoogle Scholar
  58. Tan, C.-H., Teo, H.-H., & Xu, H. (2010). Online auction: the effects of transaction probability and listing price on a seller’s decision-making behavior. Electronic Markets, 20(1), 67–79.CrossRefGoogle Scholar
  59. Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning object identification rules for information integration. Information Systems, 26(8), 607–633.CrossRefGoogle Scholar
  60. Zhao, H., & Ram, S. (2008). Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data & Knowledge Engineering, 66(3), 368–381.CrossRefGoogle Scholar
  61. Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393.CrossRefGoogle Scholar
  62. Zhou, W., & Hinz, O. (2015). Determining profit-optimizing return policies—a two-step approach on data from Electronic Markets, forthcoming.Google Scholar

Copyright information

© Institute of Applied Informatics at University of Leipzig 2016

Authors and Affiliations

  1. 1.University of PassauPassauGermany

Personalised recommendations