Abstract
Many of the documents in large text collections are duplicates and versions of each other. In recent research, we developed new methods for finding such duplicates; however, as there was no directly comparable prior work, we had no measure of whether we had succeeded. Worse, the concept of “duplicate” not only proved difficult to define, but on reflection was not logically defensible. Our investigation highlighted a paradox of computer science research: objective measurement of outcomes involves a subjective choice of preferred measure; and attempts to define measures can easily founder in circular reasoning. Also, some measures are abstractions that simplify complex real-world phenomena, so success by a measure may not be meaningful outside the context of the research. These are not merely academic concerns, but are significant problems in the design of research projects. In this paper, the case of the duplicate documents is used to explore whether and when it is reasonable to claim that research is successful.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allan, J., Carterette, B., Lewis, J.: When will information retrieval be “good enough”? In: Proc. ACM-SIGIR Ann. Int. Conf. on Research and Development in Information Retrieval, pp. 433–440. ACM Press, New York (2005)
Askitis, N., Zobel, J.: Cache-conscious collision resolution in string hash tables. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 91–102. Springer, Heidelberg (2005)
Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 55–67. Springer, Heidelberg (2004)
Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Proc. ACM Ann. Int. Conf. on Information and Knowledge Management (CIKM) (2005) (to appear)
Booth, W.C., Colomb, G.G., Williams, J.M.: The Craft of Research. U. Chicago Press, Chicago (1995)
Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Carey, M., Schneider, D. (eds.) Proc. ACM-SIGMOD Ann. Int. Conf. on Management of Data, pp. 398–409. ACM Press, San Jose (1995)
Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences (SEQUENCES 1997), pp. 21–29. IEEE Computer Society Press, Positano (1997)
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002)
Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Baeza-Yates, R. (ed.) Proc. 1st Latin AmericanWeb Congress, pp. 37–45. IEEE, Santiago (2003)
Heintze, N.: Scalable document fingerprinting. In: 1996 USENIXWorkshop on Electronic Commerce, Oakland, California, USA, pp. 191–200 (1996)
Johnson, D.S.: A theoretician’s guide to the experimental analysis of algorithms. In: Goldwasser, M., Johnson, D.S., McGeoch, C.C. (eds.) Proceedings of the 5th and 6th DIMACS Implementation Challenges. American Mathematical Society, Providence (2002)
Manber, U.: Finding similar files in a large file system. In: Proc. USENIX Winter 1994 Technical Conference, San Fransisco, CA, USA, pp. 1–10 (1994)
Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: Proc. ACM Ann. Int. Conf. on Information and Knowledge Management (CIKM) (2005) (to appear)
Moffat, A., Zobel, J.: What does it mean to ‘measure performance’? In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds.) WISE 2004. LNCS, vol. 3306, pp. 1–12. Springer, Heidelberg (2004)
Roberts, F.S.: Measurement Theory. Addison-Wesley, Reading (1979)
Suppes, P., Pavel, M., Falmagne, J.-C.: Representations and models in psychology. Annual Review of Psychology 45, 517–544 (1994)
Tichy, W.F.: Should computer scientists experiment more? IEEE Computer 31(5), 32–40 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zobel, J., Bernstein, Y. (2006). The Case of the Duplicate Documents Measurement, Search, and Science. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_4
Download citation
DOI: https://doi.org/10.1007/11610113_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)