Skip to main content
Log in

The Ambiguous Identifier Clustering Technique

  • Research Paper
  • Published:
Electronic Markets Aims and scope Submit manuscript

Abstract

Investigations of online transaction data often face the problem that entries for identical products cannot be identified as such. There is, for example, typically no unique product identifier in online auctions; retailers make their offers at price comparison sites hardly comparable and online stores often use different identifiers for virtually equal products. Existing studies typically use data sets that are restricted to one or only a few products in order to avoid product heterogeneity if a unique product identifier is not available. We propose the Ambiguous Identifier Clustering Technique (AICT) that identifies online transaction data that refer to virtually the same product. Based on a data set of eBay auctions, we demonstrate that AICT clusters online transactions for identical products with high accuracy. We further show how researchers benefit from AICT and the reduced product heterogeneity when analyzing data with econometric models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. An n-gram is a successional subsequence of n items (characters or word) of a text. The character-based n-grams of size 2 (bi-grams) of the word “text” are “te”, “ex”, “xt”. A detailed example of character-based n-grams can be found in Table 1.

  2. An overview of textual features is presented in Abbasi and Chen 2008 and in Zheng et al. 2006.

  3. We use a very short stop word list containing only one entry for each year between 1999 and 2013, and the following words: album, cd, digipack, dvd, edition, emi, hardback, hardcover, limited, paperback.

  4. We found 942 different bi-grams in our data set.

  5. We used conditional inference trees (see Hothorn et al. 2006 for a detailed description of this method) with 10,000 Monte Carlo replications to test for significant splits of ending time.

  6. PriceGrabber.com, for example, lists more than 120 products when searching for “canon powershots × 50 hs”. Some of these products are colored variants of the camera, some are bundles including the camera and some are supplements such as bags or tripods.

References

  • Abbasi, A., & Chen, H. (2008). CyberGate: a design framework and system for text analysis of computer-mediated communication. MIS Quarterly, 32(4), 811–837.

    Google Scholar 

  • Aliguliyev, R. M. (2009). Performance evaluation of density-based clustering methods. Information Sciences, 179(20), 3583–3602.

    Article  Google Scholar 

  • Ananthanarayanan, R., Chenthamarakshan, V, Deshpande, P. M., & Krishnapuram, R. (2008). Rule based synonyms for entity extraction from noisy text. Proceedings of the 2nd ACM Workshop on Analytics for Noisy Unstructured Text Data (pp. 31–38).

  • Arasu, A. & Kaushik, R. (2009). A grammar-based entity representation framework for data cleaning. Proceedings of the 35th ACM SIGMOD International Conference on Management of Data (pp. 233–244).

  • Arasu, A., Ganti, V., & Kaushik, R. (2006). Efficient exact set-similarity joins. Proceedings of the 32nd International Conference on Very Large Data Bases (pp. 918–929).

  • Bajari, P., & Hortaçsu, A. (2003). The winner’s curse, reserve prices, and endogenous entry: empirical insights from eBay auctions”. RAND Journal of Economics, 34(2), 329–355.

    Article  Google Scholar 

  • Bapna, R., Goes, P., & Gupta, A. (2003). Replicating online yankee auctions to analyze auctioneers’ and bidders’ strategies. Information Systems Research, 14(3), 244–268.

    Article  Google Scholar 

  • Bapna, R., Goes, P., Gupta, A., & Jin, Y. (2004). User heterogeneity and its impact on electronic auction market design: an empirical exploration. MIS Quarterly, 28(1), 21–43.

    Google Scholar 

  • Bapna, R., Jank, W., & Shmueli, G. (2008). Consumer surplus in online auctions. Information Systems Research, 19(4), 400–416.

    Article  Google Scholar 

  • Bapna, R., Chang, S. A., Goes, P., & Gupta, A. (2009). Overlapping online auctions: empirical characterization of bidder strategies and auction prices. MIS Quarterly, 33(4), 763–783.

    Google Scholar 

  • Becker, J.-M., Rai, A., Ringle, C. M., & Völckner, F. (2013). Discovering unobserved heterogeneity in structural equation models to avert validity threats. MIS Quarterly, 37(3), 665–694.

    Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–35.

    Article  Google Scholar 

  • Cagnina, L., Errecalde, M., Ingaramo, D., & Rosso, P. (2014). An efficient Particle Swarm Optimization approach to cluster short texts. Information Sciences, 265, 36–49.

    Article  Google Scholar 

  • Cheema, A., Chakravarti, D., & Sinha, A. R. (2012). Bidding behavior in descending and ascending auctions. Marketing Science, 31(5), 779–800.

    Article  Google Scholar 

  • Chehreghani, M. H., Abolhassani, H., & Chehreghani, M. H. (2009). Density link-based methods for clustering web pages. Decision Support Systems, 47(4), 374–382.

    Article  Google Scholar 

  • Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3), 288–321.

    Article  Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39(1), 1–38.

    Google Scholar 

  • Dey, D. (2003). Record matching in data warehouses: a decision model for data consolidation. Operations Research, 51(2), 240–254.

    Article  Google Scholar 

  • Dey, D., Sarkar, S., & De, P. (2002). A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 14(3), 567–582.

    Article  Google Scholar 

  • Duan, W. (2010). Analyzing the impact of intermediaries in electronic markets: an empirical investigation of online consumer-to-consumer (C2C) auctions. Electronic Markets, 20(2), 85–93.

    Article  Google Scholar 

  • Easley, R. F., Wood, C. A., & Barkataki, S. (2010). Bidding patterns, experience, and avoiding the winner’s curse in online auctions. Journal of Management Information Systems, 27(3), 241–268.

    Article  Google Scholar 

  • Einav, L., Kuchler, T., Levin, J., & Sundaresan, N. (2015). Assessing sale strategies in online markets using matched listings. American Economic Journal: Microeconomics, 7(2), 215–247.

    Google Scholar 

  • Elfenbein, D., Fisman, R., & McManus, B. (2012). Charity as a substitute for reputation: evidence from an online marketplace. Review of Economic Studies, 79(4), 1441–1468.

    Article  Google Scholar 

  • Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

  • Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.

    Article  Google Scholar 

  • Forman, C., Ghose, A., & Wiesenfeld, B. (2008). Examining the relationship between reviews and sales: the role of reviewer identity disclosure in electronic markets. Information Systems Research, 19(3), 291–313.

    Article  Google Scholar 

  • Frischmann, T., Hinz, O., & Skiera, B. (2012). Retailers’ use of shipping cost strategies: free shipping or partitioned prices? International Journal of Electronic Commerce, 16(3), 65–87.

    Article  Google Scholar 

  • Ghose, A., Telang, R., & Krishnan, R. (2005). Effect of electronic secondary markets on the supply chain. Journal of Management Information Systems, 22(2), 91–120.

    Article  Google Scholar 

  • Gilkeson, J. H., & Reynolds, K. (2003). Determinants of internet auction success and closing price: an exploratory study. Psychology & Marketing, 20(6), 537–566.

    Article  Google Scholar 

  • Goes, P., Tu, Y., & Tung, A. (2013). Seller heterogeneity in electronic marketplaces: a study of new and experienced sellers in eBay. Decision Support Systems, 56, 247–258.

    Article  Google Scholar 

  • Hinz, O., Hill, S., & Kim, J.-Y. (2016). TV’s dirty little secret: the negative effect of popular tv on online auction sales. MIS Quarterly, forthcoming.

  • Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.

    Article  Google Scholar 

  • Hou, J., & Blodgett, J. (2010). Market structure and quality uncertainty: a theoretical framework for online auction research. Electronic Markets, 20(1), 21–32.

    Article  Google Scholar 

  • Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.

    Article  Google Scholar 

  • Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420.

    Article  Google Scholar 

  • Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in data: An introduction to cluster analysis. Hobocken: Wiley.

    Google Scholar 

  • Kocas, C. (2002). Evolution of prices in electronic markets under diffusion of price-comparison shopping. Journal of Management Information Systems, 19(3), 99–119.

    Google Scholar 

  • Larsen, M. D., & Rubin, D. B. (2001). Iterative automated record linkage using models mixture. Journal of the American Statistical Association, 96(453), 32–41.

    Article  Google Scholar 

  • Li, X., Morie, P., & Roth, D. (2005). Semantic integration in text: from ambiguous names to identifiable entities. AI Magazine, 26(1), 45–58.

    Google Scholar 

  • Li, Y., Luo, C., & Chung, S. M. (2008). Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5), 641–652.

    Article  Google Scholar 

  • Lim, E.-P., Srivastava, J., Probhakar, S., & Richardson, J. (1993). Entity identification in database integration. Proceedings of the 9th IEEE International Conference on Data Engineering (pp. 294–301).

  • Liu, Y., & Sutanto, J. (2012). Buyers’ purchasing time and herd behavior on deal-of-the-day group-buying websites. Electronic Markets, 22(2), 83–93.

    Article  Google Scholar 

  • Lucking-Reiley, D. (1999). Using field experiments to test equivalence between auction formats: magic on the internet. American Economic Review, 89(5), 1063–1080.

    Article  Google Scholar 

  • Lucking-Reiley, D., Bryan, D., Prasad, N., & Reeves, D. (2007). Pennies from eBay: the determinants of price in online auctions. The Journal of Industrial Economics, 55(2), 223–233.

    Article  Google Scholar 

  • Manning, C. D., Raghavan, P., & Schuetze, H. (2009). Introduction to information retrieval. Cambridge: Cambride University Press.

    Google Scholar 

  • Mingfeng, L., Lucas, H. C., Jr., & Shmueli, G. (2013). Too big to fail: large samples and the p-value problem. Information Systems Research, 24(4), 906–917.

    Article  Google Scholar 

  • Mudambi, S. M., & Schuff, D. (2010). What makes a helpful online review? A study of customer reviews on Amazon.com. MIS Quarterly, 34(1), 185–200.

    Google Scholar 

  • Mumtaz, A., Coviello, E., Lanckriet, G. R. G., & Chan, A. B. (2007). Clustering dynamic textures with the hierarchical em algorithm for modeling video. IEEE Transactions on Pattern Analysis & Machine Intelligence, 35(7), 1606–1621.

    Article  Google Scholar 

  • Ostrovsky, M., & Schwarz, M. (2009). Reserve prices in internet advertising auctions: A field experiment. Working Paper, http://faculty-gsb.stanford.edu/ostrovsky/papers/rp.pdf.

  • Papapetrou, O., Siberski, W., & Fuhr, N. (2011). Decentralized probabilistic text clustering. IEEE Transactions on Knowledge and Data Engineering, 24(10), 1848–1861.

    Article  Google Scholar 

  • Sadinle, M., & Fienberg, S. E. (2013). A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 108(502), 385–397.

    Article  Google Scholar 

  • Sarkar, A., Chakraborty, A., & Chaudhuri, A. (2008). A method of finding predictor genes for a particular disease using a clustering algorithm. Communications in Statistics – Simulation and Computation, 37(1), 203–211.

    Article  Google Scholar 

  • Shimodaira, H. (2002). An approximately unbiased test of phylogenetic tree selection. Systematic Biology, 51(3), 492–508.

    Article  Google Scholar 

  • Shimodaira, H. (2004). Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. The Annals of Statistics, 32(6), 2616–2641.

    Article  Google Scholar 

  • Shiu, J.-L., & Sun, C.-H. D. (2014). The determinants of price in online auctions: more evidence from unbalanced panel data. Journal of Applied Statistics, 41(2), 382–392.

    Article  Google Scholar 

  • Srinivasan, K., & Wang, X. (2010). Bidders’ experience and learning in online auctions: issues and implications. Marketing Science, 29(6), 988–993.

    Article  Google Scholar 

  • Stern, B., & Stafford, M. R. (2006). Individual and social determinants of winning bids in online auctions. Journal of Consumer Behavior, 5(1), 43–55.

    Article  Google Scholar 

  • Tan, C.-H., Teo, H.-H., & Xu, H. (2010). Online auction: the effects of transaction probability and listing price on a seller’s decision-making behavior. Electronic Markets, 20(1), 67–79.

    Article  Google Scholar 

  • Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning object identification rules for information integration. Information Systems, 26(8), 607–633.

    Article  Google Scholar 

  • Zhao, H., & Ram, S. (2008). Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data & Knowledge Engineering, 66(3), 368–381.

    Article  Google Scholar 

  • Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393.

    Article  Google Scholar 

  • Zhou, W., & Hinz, O. (2015). Determining profit-optimizing return policies—a two-step approach on data from taobao.com. Electronic Markets, forthcoming.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Scholz.

Additional information

Responsible Editor: Hans-Dieter Zimmermann

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Scholz, M., Franz, M. & Hinz, O. The Ambiguous Identifier Clustering Technique. Electron Markets 26, 143–156 (2016). https://doi.org/10.1007/s12525-016-0217-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12525-016-0217-2

Keywords

JEL Classification

Navigation