Abstract
Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.
Chapter PDF
References
Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C., Ding, L., Kolari, P., Sheth, A.P., Arpinar, I.B., Joshi, A., Finin, T.: Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection. In: WWW, pp. 407–416 (2006)
Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: WWW, pp. 651–660 (2005)
Bhattacharya, I., Getoor, L.: Deduplication and group detection using links. In: Workshop on Link Analysis and Group Detection, ACM SIGKDD (2004)
Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: DMKD, pp. 11–18 (2004)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: STOC (1998)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 327–336 (2002)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Workshop on Inf. Integration on the Web (2003)
Datar, M., Indyk, P.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG 2004: Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262. ACM Press, New York (2004)
Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96 (2005)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 432–442 (1999)
Ioannou, E., Niederé, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. In: Bellahsène, Z., Léonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 556–570. Springer, Heidelberg (2008)
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa. American Statistical Association (1989)
Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 716–767 (2006)
Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: SDM (2005)
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Minack, E., Paiu, R., Costache, S., Demartini, G., Gaugaz, J., Ioannou, E., Chirita, P.-A., Nejdl, W.: Leveraging personal metadata for desktop search - the Beagle++ system. In: Journal of Web Semantics (2010)
Morrison, D.R.: PATRICIA - Practical Algorithm To Retrieve Information Coded in Alphanumeric. J. ACM (1968)
Open Calais, http://www.opencalais.com/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ioannou, E., Papapetrou, O., Skoutas, D., Nejdl, W. (2010). Efficient Semantic-Aware Detection of Near Duplicate Resources. In: Aroyo, L., et al. The Semantic Web: Research and Applications. ESWC 2010. Lecture Notes in Computer Science, vol 6089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13489-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-13489-0_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13488-3
Online ISBN: 978-3-642-13489-0
eBook Packages: Computer ScienceComputer Science (R0)