An Ontology-Based Method for Duplicate Detection in Web Data Tables

  • Patrice Buche
  • Juliette Dibie-Barthélemy
  • Rania Khefifi
  • Fatiha Saïs
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6860)

Abstract

We present, in this paper, a duplicate detection method in semantically annotated Web data tables, driven by a domain Termino-Ontological Resource (TOR). Our method relies on the fuzzy semantic annotations automatically associated with the Web data tables. A fuzzy semantic annotation is automatically associated with each row of a Web data table. It corresponds to the instantiation of a composed concept of the domain TOR, which represents the semantic n-ary relationship that exists between the columns of the Web data table. A fuzzy semantic annotation contains fuzzy values expressed as fuzzy sets. We propose an automatic duplicate detection method which consists in detecting the pairs of duplicate fuzzy semantic annotations and relies on (i) knowledge declared in the domain TOR and on (ii) similarity measures between fuzzy sets. Two new similarity measures are defined to compare both, the symbolic fuzzy values and the numerical fuzzy values. Our method has been tested on a real application in the domain of chemical risk in food.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Hignette, G., Buche, P., Dibie-Barthélemy, J., Haemmerlé, O.: Fuzzy annotation of web data tables driven by a domain ontology. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 638–653. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  2. 2.
    Zadeh, L.: Fuzzy sets. Information and Control 8, 338–353 (1965)CrossRefMATHGoogle Scholar
  3. 3.
    Saïs, F., Pernelle, N., Rousset, M.C.: Combining a logical and a numerical method for data reconciliation. J. Data Semantics 12, 66–94 (2009)CrossRefGoogle Scholar
  4. 4.
    Buche, P., Haemmerlé, O.: Towards a unified querying system of both structured and semi-structured imprecise data using fuzzy view. In: Ganter, B., Mineau, G.W. (eds.) ICCS 2000. LNCS, vol. 1867, pp. 207–220. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  5. 5.
    Buche, P., Dibie-Barthélemy, J., Chebil, H.: Flexible sparql querying of web data tables driven by an ontology. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS, vol. 5822, pp. 345–357. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  6. 6.
    Roche, C., Calberg-Challot, M., Damas, L., Rouard, P.: Ontoterminology - a new paradigm for terminology. In: KEOD, pp. 321–326 (2009)Google Scholar
  7. 7.
    Reymonet, A., Thomas, J., Aussenac-Gilles, N.: Modelling ontological and terminological resources in OWL DL. In: OntoLex-Workshop at ISWC 2007 (2007)Google Scholar
  8. 8.
    Dubois, D., Prade, H.: The three semantics of fuzzy sets. Fuzzy Sets and Systems 90, 141–150 (1997)CrossRefMATHGoogle Scholar
  9. 9.
    Bouchon-Meunier, B., Rifqi, M., Bothorel, S.: Towards general measures of comparison of objects. Fuzzy Sets and Systems 11, 143–153 (1996)CrossRefMATHGoogle Scholar
  10. 10.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)Google Scholar
  11. 11.
    Jaccard, P.: Etude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)Google Scholar
  12. 12.
    Tversky, A.: Features of similarity. Psychological Review 84, 327–352 (1977)CrossRefGoogle Scholar
  13. 13.
    Largeron, C., Kaddour, B., Fernandez, M.: Softjaccard: une mesure de similarité entre ensembles de chaînes de caractères pour l’unification d’entités nommées. In: Extaction et Gestion des Connaissances (EGC) (2009)Google Scholar
  14. 14.
    Hsieh, C.H., Chen, S.H.: Similarity of generalized fuzzy numbers with graded mean integration represntation. In: Proc. 8th IFSA World Congr., vol. 2, pp. 551–555 (1999)Google Scholar
  15. 15.
    Chen, S.M.: New methods for subjective mental workload assessment and fuzzy risk analysis. Cybernetics and Systems 27, 449–472 (1996)CrossRefMATHGoogle Scholar
  16. 16.
    Chen, S.J., Chen, S.M.: Fuzzy risk analysis based on similarity measures of generalized fuzzy numbers. IEEE 11(1), 45–56 (2003)Google Scholar
  17. 17.
    Cohn, D.A., Atlas, L.E., Ladner, R.E.: Improving generalization with active learning. Machine Learning 15(2), 201–221 (1994)Google Scholar
  18. 18.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)CrossRefMATHGoogle Scholar
  19. 19.
    Saïs, F., Pernelle, N., Rousset, M.C.: L2R: A logical method for reference reconciliation. In: AAAI Conference on Artificial Intelligence, pp. 329–334 (2007)Google Scholar
  20. 20.
    Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W.: Google fusion tables: data management, integration and collaboration in the cloud. In: SoCC, pp. 175–180 (2010)Google Scholar
  21. 21.
    Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidon, J.: Google fusion tables: web-centered data management and collaboration. In: SIGMOD Conference, pp. 1061–1066 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Patrice Buche
    • 1
    • 2
  • Juliette Dibie-Barthélemy
    • 3
  • Rania Khefifi
    • 4
  • Fatiha Saïs
    • 4
  1. 1.INRA - UMR IATEMontpellier Cedex 2France
  2. 2.LIRMM, CNRS-UM2MontpellierFrance
  3. 3.INRA - Mét@risk & AgroParisTechParis Cedex 5France
  4. 4.LRI (CNRS & Paris-Sud 11 University)/INRIA SaclayOrsay CedexFrance

Personalised recommendations