The Ambiguous Identifier Clustering Technique

Scholz, Michael; Franz, Markus; Hinz, Oliver

doi:10.1007/s12525-016-0217-2

The Ambiguous Identifier Clustering Technique

Research Paper
Published: 13 February 2016

Volume 26, pages 143–156, (2016)
Cite this article

Electronic Markets Aims and scope Submit manuscript

Michael Scholz¹,
Markus Franz¹ &
Oliver Hinz¹

314 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Investigations of online transaction data often face the problem that entries for identical products cannot be identified as such. There is, for example, typically no unique product identifier in online auctions; retailers make their offers at price comparison sites hardly comparable and online stores often use different identifiers for virtually equal products. Existing studies typically use data sets that are restricted to one or only a few products in order to avoid product heterogeneity if a unique product identifier is not available. We propose the Ambiguous Identifier Clustering Technique (AICT) that identifies online transaction data that refer to virtually the same product. Based on a data set of eBay auctions, we demonstrate that AICT clusters online transactions for identical products with high accuracy. We further show how researchers benefit from AICT and the reduced product heterogeneity when analyzing data with econometric models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

An n-gram is a successional subsequence of n items (characters or word) of a text. The character-based n-grams of size 2 (bi-grams) of the word “text” are “te”, “ex”, “xt”. A detailed example of character-based n-grams can be found in Table 1.
An overview of textual features is presented in Abbasi and Chen 2008 and in Zheng et al. 2006.
We use a very short stop word list containing only one entry for each year between 1999 and 2013, and the following words: album, cd, digipack, dvd, edition, emi, hardback, hardcover, limited, paperback.
We found 942 different bi-grams in our data set.
We used conditional inference trees (see Hothorn et al. 2006 for a detailed description of this method) with 10,000 Monte Carlo replications to test for significant splits of ending time.
PriceGrabber.com, for example, lists more than 120 products when searching for “canon powershots × 50 hs”. Some of these products are colored variants of the camera, some are bundles including the camera and some are supplements such as bags or tripods.

References

Abbasi, A., & Chen, H. (2008). CyberGate: a design framework and system for text analysis of computer-mediated communication. MIS Quarterly, 32(4), 811–837.
Google Scholar
Aliguliyev, R. M. (2009). Performance evaluation of density-based clustering methods. Information Sciences, 179(20), 3583–3602.
Article Google Scholar
Ananthanarayanan, R., Chenthamarakshan, V, Deshpande, P. M., & Krishnapuram, R. (2008). Rule based synonyms for entity extraction from noisy text. Proceedings of the 2nd ACM Workshop on Analytics for Noisy Unstructured Text Data (pp. 31–38).
Arasu, A. & Kaushik, R. (2009). A grammar-based entity representation framework for data cleaning. Proceedings of the 35th ACM SIGMOD International Conference on Management of Data (pp. 233–244).
Arasu, A., Ganti, V., & Kaushik, R. (2006). Efficient exact set-similarity joins. Proceedings of the 32nd International Conference on Very Large Data Bases (pp. 918–929).
Bajari, P., & Hortaçsu, A. (2003). The winner’s curse, reserve prices, and endogenous entry: empirical insights from eBay auctions”. RAND Journal of Economics, 34(2), 329–355.
Article Google Scholar
Bapna, R., Goes, P., & Gupta, A. (2003). Replicating online yankee auctions to analyze auctioneers’ and bidders’ strategies. Information Systems Research, 14(3), 244–268.
Article Google Scholar
Bapna, R., Goes, P., Gupta, A., & Jin, Y. (2004). User heterogeneity and its impact on electronic auction market design: an empirical exploration. MIS Quarterly, 28(1), 21–43.
Google Scholar
Bapna, R., Jank, W., & Shmueli, G. (2008). Consumer surplus in online auctions. Information Systems Research, 19(4), 400–416.
Article Google Scholar
Bapna, R., Chang, S. A., Goes, P., & Gupta, A. (2009). Overlapping online auctions: empirical characterization of bidder strategies and auction prices. MIS Quarterly, 33(4), 763–783.
Google Scholar
Becker, J.-M., Rai, A., Ringle, C. M., & Völckner, F. (2013). Discovering unobserved heterogeneity in structural equation models to avert validity threats. MIS Quarterly, 37(3), 665–694.
Google Scholar
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–35.
Article Google Scholar
Cagnina, L., Errecalde, M., Ingaramo, D., & Rosso, P. (2014). An efficient Particle Swarm Optimization approach to cluster short texts. Information Sciences, 265, 36–49.
Article Google Scholar
Cheema, A., Chakravarti, D., & Sinha, A. R. (2012). Bidding behavior in descending and ascending auctions. Marketing Science, 31(5), 779–800.
Article Google Scholar
Chehreghani, M. H., Abolhassani, H., & Chehreghani, M. H. (2009). Density link-based methods for clustering web pages. Decision Support Systems, 47(4), 374–382.
Article Google Scholar
Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3), 288–321.
Article Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39(1), 1–38.
Google Scholar
Dey, D. (2003). Record matching in data warehouses: a decision model for data consolidation. Operations Research, 51(2), 240–254.
Article Google Scholar
Dey, D., Sarkar, S., & De, P. (2002). A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 14(3), 567–582.
Article Google Scholar
Duan, W. (2010). Analyzing the impact of intermediaries in electronic markets: an empirical investigation of online consumer-to-consumer (C2C) auctions. Electronic Markets, 20(2), 85–93.
Article Google Scholar
Easley, R. F., Wood, C. A., & Barkataki, S. (2010). Bidding patterns, experience, and avoiding the winner’s curse in online auctions. Journal of Management Information Systems, 27(3), 241–268.
Article Google Scholar
Einav, L., Kuchler, T., Levin, J., & Sundaresan, N. (2015). Assessing sale strategies in online markets using matched listings. American Economic Journal: Microeconomics, 7(2), 215–247.
Google Scholar
Elfenbein, D., Fisman, R., & McManus, B. (2012). Charity as a substitute for reputation: evidence from an online marketplace. Review of Economic Studies, 79(4), 1441–1468.
Article Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.
Article Google Scholar
Forman, C., Ghose, A., & Wiesenfeld, B. (2008). Examining the relationship between reviews and sales: the role of reviewer identity disclosure in electronic markets. Information Systems Research, 19(3), 291–313.
Article Google Scholar
Frischmann, T., Hinz, O., & Skiera, B. (2012). Retailers’ use of shipping cost strategies: free shipping or partitioned prices? International Journal of Electronic Commerce, 16(3), 65–87.
Article Google Scholar
Ghose, A., Telang, R., & Krishnan, R. (2005). Effect of electronic secondary markets on the supply chain. Journal of Management Information Systems, 22(2), 91–120.
Article Google Scholar
Gilkeson, J. H., & Reynolds, K. (2003). Determinants of internet auction success and closing price: an exploratory study. Psychology & Marketing, 20(6), 537–566.
Article Google Scholar
Goes, P., Tu, Y., & Tung, A. (2013). Seller heterogeneity in electronic marketplaces: a study of new and experienced sellers in eBay. Decision Support Systems, 56, 247–258.
Article Google Scholar
Hinz, O., Hill, S., & Kim, J.-Y. (2016). TV’s dirty little secret: the negative effect of popular tv on online auction sales. MIS Quarterly, forthcoming.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
Article Google Scholar
Hou, J., & Blodgett, J. (2010). Market structure and quality uncertainty: a theoretical framework for online auction research. Electronic Markets, 20(1), 21–32.
Article Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Article Google Scholar
Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420.
Article Google Scholar
Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in data: An introduction to cluster analysis. Hobocken: Wiley.
Google Scholar
Kocas, C. (2002). Evolution of prices in electronic markets under diffusion of price-comparison shopping. Journal of Management Information Systems, 19(3), 99–119.
Google Scholar
Larsen, M. D., & Rubin, D. B. (2001). Iterative automated record linkage using models mixture. Journal of the American Statistical Association, 96(453), 32–41.
Article Google Scholar
Li, X., Morie, P., & Roth, D. (2005). Semantic integration in text: from ambiguous names to identifiable entities. AI Magazine, 26(1), 45–58.
Google Scholar
Li, Y., Luo, C., & Chung, S. M. (2008). Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5), 641–652.
Article Google Scholar
Lim, E.-P., Srivastava, J., Probhakar, S., & Richardson, J. (1993). Entity identification in database integration. Proceedings of the 9th IEEE International Conference on Data Engineering (pp. 294–301).
Liu, Y., & Sutanto, J. (2012). Buyers’ purchasing time and herd behavior on deal-of-the-day group-buying websites. Electronic Markets, 22(2), 83–93.
Article Google Scholar
Lucking-Reiley, D. (1999). Using field experiments to test equivalence between auction formats: magic on the internet. American Economic Review, 89(5), 1063–1080.
Article Google Scholar
Lucking-Reiley, D., Bryan, D., Prasad, N., & Reeves, D. (2007). Pennies from eBay: the determinants of price in online auctions. The Journal of Industrial Economics, 55(2), 223–233.
Article Google Scholar
Manning, C. D., Raghavan, P., & Schuetze, H. (2009). Introduction to information retrieval. Cambridge: Cambride University Press.
Google Scholar
Mingfeng, L., Lucas, H. C., Jr., & Shmueli, G. (2013). Too big to fail: large samples and the p-value problem. Information Systems Research, 24(4), 906–917.
Article Google Scholar
Mudambi, S. M., & Schuff, D. (2010). What makes a helpful online review? A study of customer reviews on Amazon.com. MIS Quarterly, 34(1), 185–200.
Google Scholar
Mumtaz, A., Coviello, E., Lanckriet, G. R. G., & Chan, A. B. (2007). Clustering dynamic textures with the hierarchical em algorithm for modeling video. IEEE Transactions on Pattern Analysis & Machine Intelligence, 35(7), 1606–1621.
Article Google Scholar
Ostrovsky, M., & Schwarz, M. (2009). Reserve prices in internet advertising auctions: A field experiment. Working Paper, http://faculty-gsb.stanford.edu/ostrovsky/papers/rp.pdf.
Papapetrou, O., Siberski, W., & Fuhr, N. (2011). Decentralized probabilistic text clustering. IEEE Transactions on Knowledge and Data Engineering, 24(10), 1848–1861.
Article Google Scholar
Sadinle, M., & Fienberg, S. E. (2013). A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 108(502), 385–397.
Article Google Scholar
Sarkar, A., Chakraborty, A., & Chaudhuri, A. (2008). A method of finding predictor genes for a particular disease using a clustering algorithm. Communications in Statistics – Simulation and Computation, 37(1), 203–211.
Article Google Scholar
Shimodaira, H. (2002). An approximately unbiased test of phylogenetic tree selection. Systematic Biology, 51(3), 492–508.
Article Google Scholar
Shimodaira, H. (2004). Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. The Annals of Statistics, 32(6), 2616–2641.
Article Google Scholar
Shiu, J.-L., & Sun, C.-H. D. (2014). The determinants of price in online auctions: more evidence from unbalanced panel data. Journal of Applied Statistics, 41(2), 382–392.
Article Google Scholar
Srinivasan, K., & Wang, X. (2010). Bidders’ experience and learning in online auctions: issues and implications. Marketing Science, 29(6), 988–993.
Article Google Scholar
Stern, B., & Stafford, M. R. (2006). Individual and social determinants of winning bids in online auctions. Journal of Consumer Behavior, 5(1), 43–55.
Article Google Scholar
Tan, C.-H., Teo, H.-H., & Xu, H. (2010). Online auction: the effects of transaction probability and listing price on a seller’s decision-making behavior. Electronic Markets, 20(1), 67–79.
Article Google Scholar
Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning object identification rules for information integration. Information Systems, 26(8), 607–633.
Article Google Scholar
Zhao, H., & Ram, S. (2008). Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data & Knowledge Engineering, 66(3), 368–381.
Article Google Scholar
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393.
Article Google Scholar
Zhou, W., & Hinz, O. (2015). Determining profit-optimizing return policies—a two-step approach on data from taobao.com. Electronic Markets, forthcoming.

Download references

Author information

Authors and Affiliations

University of Passau, Innstr. 43, 94032, Passau, Germany
Michael Scholz, Markus Franz & Oliver Hinz

Authors

Michael Scholz
View author publications
You can also search for this author in PubMed Google Scholar
Markus Franz
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Hinz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Scholz.

Additional information

Responsible Editor: Hans-Dieter Zimmermann

Rights and permissions

Reprints and permissions

About this article

Cite this article

Scholz, M., Franz, M. & Hinz, O. The Ambiguous Identifier Clustering Technique. Electron Markets 26, 143–156 (2016). https://doi.org/10.1007/s12525-016-0217-2

Download citation

Received: 22 July 2015
Accepted: 26 January 2016
Published: 13 February 2016
Issue Date: May 2016
DOI: https://doi.org/10.1007/s12525-016-0217-2

Keywords

JEL Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Ambiguous Identifier Clustering Technique

Abstract

Access this article

Similar content being viewed by others

Introducing Clustering with a Focus in Marketing and Consumer Analysis

Automated Product-Attribute Mapping

Why Did My Consumer Shop? Learning an Efficient Distance Metric for Retailer Transaction Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

JEL Classification

Navigation

The Ambiguous Identifier Clustering Technique

Abstract

Access this article

Similar content being viewed by others

Introducing Clustering with a Focus in Marketing and Consumer Analysis

Automated Product-Attribute Mapping

Why Did My Consumer Shop? Learning an Efficient Distance Metric for Retailer Transaction Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classification

Search

Navigation