Abstract
Investigations of online transaction data often face the problem that entries for identical products cannot be identified as such. There is, for example, typically no unique product identifier in online auctions; retailers make their offers at price comparison sites hardly comparable and online stores often use different identifiers for virtually equal products. Existing studies typically use data sets that are restricted to one or only a few products in order to avoid product heterogeneity if a unique product identifier is not available. We propose the Ambiguous Identifier Clustering Technique (AICT) that identifies online transaction data that refer to virtually the same product. Based on a data set of eBay auctions, we demonstrate that AICT clusters online transactions for identical products with high accuracy. We further show how researchers benefit from AICT and the reduced product heterogeneity when analyzing data with econometric models.
Similar content being viewed by others
Notes
An n-gram is a successional subsequence of n items (characters or word) of a text. The character-based n-grams of size 2 (bi-grams) of the word “text” are “te”, “ex”, “xt”. A detailed example of character-based n-grams can be found in Table 1.
We use a very short stop word list containing only one entry for each year between 1999 and 2013, and the following words: album, cd, digipack, dvd, edition, emi, hardback, hardcover, limited, paperback.
We found 942 different bi-grams in our data set.
We used conditional inference trees (see Hothorn et al. 2006 for a detailed description of this method) with 10,000 Monte Carlo replications to test for significant splits of ending time.
PriceGrabber.com, for example, lists more than 120 products when searching for “canon powershots × 50 hs”. Some of these products are colored variants of the camera, some are bundles including the camera and some are supplements such as bags or tripods.
References
Abbasi, A., & Chen, H. (2008). CyberGate: a design framework and system for text analysis of computer-mediated communication. MIS Quarterly, 32(4), 811–837.
Aliguliyev, R. M. (2009). Performance evaluation of density-based clustering methods. Information Sciences, 179(20), 3583–3602.
Ananthanarayanan, R., Chenthamarakshan, V, Deshpande, P. M., & Krishnapuram, R. (2008). Rule based synonyms for entity extraction from noisy text. Proceedings of the 2nd ACM Workshop on Analytics for Noisy Unstructured Text Data (pp. 31–38).
Arasu, A. & Kaushik, R. (2009). A grammar-based entity representation framework for data cleaning. Proceedings of the 35th ACM SIGMOD International Conference on Management of Data (pp. 233–244).
Arasu, A., Ganti, V., & Kaushik, R. (2006). Efficient exact set-similarity joins. Proceedings of the 32nd International Conference on Very Large Data Bases (pp. 918–929).
Bajari, P., & Hortaçsu, A. (2003). The winner’s curse, reserve prices, and endogenous entry: empirical insights from eBay auctions”. RAND Journal of Economics, 34(2), 329–355.
Bapna, R., Goes, P., & Gupta, A. (2003). Replicating online yankee auctions to analyze auctioneers’ and bidders’ strategies. Information Systems Research, 14(3), 244–268.
Bapna, R., Goes, P., Gupta, A., & Jin, Y. (2004). User heterogeneity and its impact on electronic auction market design: an empirical exploration. MIS Quarterly, 28(1), 21–43.
Bapna, R., Jank, W., & Shmueli, G. (2008). Consumer surplus in online auctions. Information Systems Research, 19(4), 400–416.
Bapna, R., Chang, S. A., Goes, P., & Gupta, A. (2009). Overlapping online auctions: empirical characterization of bidder strategies and auction prices. MIS Quarterly, 33(4), 763–783.
Becker, J.-M., Rai, A., Ringle, C. M., & Völckner, F. (2013). Discovering unobserved heterogeneity in structural equation models to avert validity threats. MIS Quarterly, 37(3), 665–694.
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–35.
Cagnina, L., Errecalde, M., Ingaramo, D., & Rosso, P. (2014). An efficient Particle Swarm Optimization approach to cluster short texts. Information Sciences, 265, 36–49.
Cheema, A., Chakravarti, D., & Sinha, A. R. (2012). Bidding behavior in descending and ascending auctions. Marketing Science, 31(5), 779–800.
Chehreghani, M. H., Abolhassani, H., & Chehreghani, M. H. (2009). Density link-based methods for clustering web pages. Decision Support Systems, 47(4), 374–382.
Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3), 288–321.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39(1), 1–38.
Dey, D. (2003). Record matching in data warehouses: a decision model for data consolidation. Operations Research, 51(2), 240–254.
Dey, D., Sarkar, S., & De, P. (2002). A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 14(3), 567–582.
Duan, W. (2010). Analyzing the impact of intermediaries in electronic markets: an empirical investigation of online consumer-to-consumer (C2C) auctions. Electronic Markets, 20(2), 85–93.
Easley, R. F., Wood, C. A., & Barkataki, S. (2010). Bidding patterns, experience, and avoiding the winner’s curse in online auctions. Journal of Management Information Systems, 27(3), 241–268.
Einav, L., Kuchler, T., Levin, J., & Sundaresan, N. (2015). Assessing sale strategies in online markets using matched listings. American Economic Journal: Microeconomics, 7(2), 215–247.
Elfenbein, D., Fisman, R., & McManus, B. (2012). Charity as a substitute for reputation: evidence from an online marketplace. Review of Economic Studies, 79(4), 1441–1468.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.
Forman, C., Ghose, A., & Wiesenfeld, B. (2008). Examining the relationship between reviews and sales: the role of reviewer identity disclosure in electronic markets. Information Systems Research, 19(3), 291–313.
Frischmann, T., Hinz, O., & Skiera, B. (2012). Retailers’ use of shipping cost strategies: free shipping or partitioned prices? International Journal of Electronic Commerce, 16(3), 65–87.
Ghose, A., Telang, R., & Krishnan, R. (2005). Effect of electronic secondary markets on the supply chain. Journal of Management Information Systems, 22(2), 91–120.
Gilkeson, J. H., & Reynolds, K. (2003). Determinants of internet auction success and closing price: an exploratory study. Psychology & Marketing, 20(6), 537–566.
Goes, P., Tu, Y., & Tung, A. (2013). Seller heterogeneity in electronic marketplaces: a study of new and experienced sellers in eBay. Decision Support Systems, 56, 247–258.
Hinz, O., Hill, S., & Kim, J.-Y. (2016). TV’s dirty little secret: the negative effect of popular tv on online auction sales. MIS Quarterly, forthcoming.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
Hou, J., & Blodgett, J. (2010). Market structure and quality uncertainty: a theoretical framework for online auction research. Electronic Markets, 20(1), 21–32.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420.
Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in data: An introduction to cluster analysis. Hobocken: Wiley.
Kocas, C. (2002). Evolution of prices in electronic markets under diffusion of price-comparison shopping. Journal of Management Information Systems, 19(3), 99–119.
Larsen, M. D., & Rubin, D. B. (2001). Iterative automated record linkage using models mixture. Journal of the American Statistical Association, 96(453), 32–41.
Li, X., Morie, P., & Roth, D. (2005). Semantic integration in text: from ambiguous names to identifiable entities. AI Magazine, 26(1), 45–58.
Li, Y., Luo, C., & Chung, S. M. (2008). Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5), 641–652.
Lim, E.-P., Srivastava, J., Probhakar, S., & Richardson, J. (1993). Entity identification in database integration. Proceedings of the 9th IEEE International Conference on Data Engineering (pp. 294–301).
Liu, Y., & Sutanto, J. (2012). Buyers’ purchasing time and herd behavior on deal-of-the-day group-buying websites. Electronic Markets, 22(2), 83–93.
Lucking-Reiley, D. (1999). Using field experiments to test equivalence between auction formats: magic on the internet. American Economic Review, 89(5), 1063–1080.
Lucking-Reiley, D., Bryan, D., Prasad, N., & Reeves, D. (2007). Pennies from eBay: the determinants of price in online auctions. The Journal of Industrial Economics, 55(2), 223–233.
Manning, C. D., Raghavan, P., & Schuetze, H. (2009). Introduction to information retrieval. Cambridge: Cambride University Press.
Mingfeng, L., Lucas, H. C., Jr., & Shmueli, G. (2013). Too big to fail: large samples and the p-value problem. Information Systems Research, 24(4), 906–917.
Mudambi, S. M., & Schuff, D. (2010). What makes a helpful online review? A study of customer reviews on Amazon.com. MIS Quarterly, 34(1), 185–200.
Mumtaz, A., Coviello, E., Lanckriet, G. R. G., & Chan, A. B. (2007). Clustering dynamic textures with the hierarchical em algorithm for modeling video. IEEE Transactions on Pattern Analysis & Machine Intelligence, 35(7), 1606–1621.
Ostrovsky, M., & Schwarz, M. (2009). Reserve prices in internet advertising auctions: A field experiment. Working Paper, http://faculty-gsb.stanford.edu/ostrovsky/papers/rp.pdf.
Papapetrou, O., Siberski, W., & Fuhr, N. (2011). Decentralized probabilistic text clustering. IEEE Transactions on Knowledge and Data Engineering, 24(10), 1848–1861.
Sadinle, M., & Fienberg, S. E. (2013). A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 108(502), 385–397.
Sarkar, A., Chakraborty, A., & Chaudhuri, A. (2008). A method of finding predictor genes for a particular disease using a clustering algorithm. Communications in Statistics – Simulation and Computation, 37(1), 203–211.
Shimodaira, H. (2002). An approximately unbiased test of phylogenetic tree selection. Systematic Biology, 51(3), 492–508.
Shimodaira, H. (2004). Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. The Annals of Statistics, 32(6), 2616–2641.
Shiu, J.-L., & Sun, C.-H. D. (2014). The determinants of price in online auctions: more evidence from unbalanced panel data. Journal of Applied Statistics, 41(2), 382–392.
Srinivasan, K., & Wang, X. (2010). Bidders’ experience and learning in online auctions: issues and implications. Marketing Science, 29(6), 988–993.
Stern, B., & Stafford, M. R. (2006). Individual and social determinants of winning bids in online auctions. Journal of Consumer Behavior, 5(1), 43–55.
Tan, C.-H., Teo, H.-H., & Xu, H. (2010). Online auction: the effects of transaction probability and listing price on a seller’s decision-making behavior. Electronic Markets, 20(1), 67–79.
Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning object identification rules for information integration. Information Systems, 26(8), 607–633.
Zhao, H., & Ram, S. (2008). Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data & Knowledge Engineering, 66(3), 368–381.
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393.
Zhou, W., & Hinz, O. (2015). Determining profit-optimizing return policies—a two-step approach on data from taobao.com. Electronic Markets, forthcoming.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible Editor: Hans-Dieter Zimmermann
Rights and permissions
About this article
Cite this article
Scholz, M., Franz, M. & Hinz, O. The Ambiguous Identifier Clustering Technique. Electron Markets 26, 143–156 (2016). https://doi.org/10.1007/s12525-016-0217-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12525-016-0217-2