Efficient Semantic-Aware Detection of Near Duplicate Resources

Ioannou, Ekaterini; Papapetrou, Odysseas; Skoutas, Dimitrios; Nejdl, Wolfgang

doi:10.1007/978-3-642-13489-0_10

Efficient Semantic-Aware Detection of Near Duplicate Resources

Ekaterini Ioannou²³,
Odysseas Papapetrou²³,
Dimitrios Skoutas²³ &
…
Wolfgang Nejdl²³

Conference paper

1404 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6089))

Abstract

Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.

Download to read the full chapter text

Chapter PDF

References

Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C., Ding, L., Kolari, P., Sheth, A.P., Arpinar, I.B., Joshi, A., Finin, T.: Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection. In: WWW, pp. 407–416 (2006)
Google Scholar
Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: WWW, pp. 651–660 (2005)
Google Scholar
Bhattacharya, I., Getoor, L.: Deduplication and group detection using links. In: Workshop on Link Analysis and Group Detection, ACM SIGKDD (2004)
Google Scholar
Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: DMKD, pp. 11–18 (2004)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: STOC (1998)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 327–336 (2002)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Workshop on Inf. Integration on the Web (2003)
Google Scholar
Datar, M., Indyk, P.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG 2004: Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262. ACM Press, New York (2004)
Chapter Google Scholar
Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96 (2005)
Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 432–442 (1999)
Google Scholar
Ioannou, E., Niederé, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. In: Bellahsène, Z., Léonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 556–570. Springer, Heidelberg (2008)
Chapter Google Scholar
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa. American Statistical Association (1989)
Google Scholar
Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 716–767 (2006)
Google Scholar
Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: SDM (2005)
Google Scholar
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Google Scholar
Minack, E., Paiu, R., Costache, S., Demartini, G., Gaugaz, J., Ioannou, E., Chirita, P.-A., Nejdl, W.: Leveraging personal metadata for desktop search - the Beagle++ system. In: Journal of Web Semantics (2010)
Google Scholar
Morrison, D.R.: PATRICIA - Practical Algorithm To Retrieve Information Coded in Alphanumeric. J. ACM (1968)
Google Scholar
Open Calais, http://www.opencalais.com/

Download references

Author information

Authors and Affiliations

L3S Research Center/Leibniz Universität Hannover,
Ekaterini Ioannou, Odysseas Papapetrou, Dimitrios Skoutas & Wolfgang Nejdl

Authors

Ekaterini Ioannou
View author publications
You can also search for this author in PubMed Google Scholar
Odysseas Papapetrou
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrios Skoutas
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Nejdl
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Free University Amsterdam, De Boelelaan 1081a, 1081 HV, Amsterdem, The Netherlands
Lora Aroyo
Institute of Computer Science, FORTH and Computer Science Department, University of Crete, P.O. Box 1385, 71110, Heraklion, Greece
Grigoris Antoniou
School of Science and Technology, Department of Media Technology, Aalto University, P.O. Box15500, 00076, Aalto, Finland
Eero Hyvönen
Department of AI, Free University Amsterdam, De Boelelaan 1081A, 1081HV, Amsterdam, The Netherlands
Annette ten Teije
Institut für Informatik, B6, 26, Universität Mannheim, 68159, Mannheim, Germany
Heiner Stuckenschmidt
Knowledge Media Institute, The Open University, Walton Hall, MK7 6AA, Milton Keynes, UK
Liliana Cabral
Stanford Biomedical Informatics Research Center, 251 Campus Drive, 94305-5479, Stanford, CA, USA
Tania Tudorache

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ioannou, E., Papapetrou, O., Skoutas, D., Nejdl, W. (2010). Efficient Semantic-Aware Detection of Near Duplicate Resources. In: Aroyo, L., et al. The Semantic Web: Research and Applications. ESWC 2010. Lecture Notes in Computer Science, vol 6089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13489-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-13489-0_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13488-3
Online ISBN: 978-3-642-13489-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics