World Wide Web

, Volume 14, Issue 2, pp 157–186 | Cite as

Approximate entity extraction in temporal databases

  • Wei Lu
  • Gabriel Pui Cheong Fung
  • Xiaoyong Du
  • Xiaofang Zhou
  • Lijiang Chen
  • Ke Deng
Article

Abstract

We study the problem of efficiently extracting K entities, in a temporal database, which are most similar to a given search query. This problem is well studied in relational databases, where each entity is represented as a single record and there exist a variety of methods to define the similarity between a record and the search query. However, in temporal databases, each entity is represented as a sequence of historical records. How to properly define the similarity of each entity in the temporal database still remains an open problem. The main challenging is that, when a user issues a search query for an entity, he or she is prone to mix up information of the same entity at different time points. As a result, methods, which are used in relational databases based on record granularity, cannot work any further. Instead, we regard each entity as a set of “virtual records”, where attribute values of a “virtual record” can be from different records of the same entity. In this paper, we propose a novel evaluation model, based on which the similarity between each “virtual record” and the query can be effectively quantified, and the maximum similarity of its “virtual records” is taken as the similarity of an entity. For each entity, as the number of its “virtual records” is exponentially large, calculating the similarity of the entity is challenging. As a result, we further propose a Dominating Tree Algorithm (DTA), which is based on the bounding-pruning-refining strategy, to efficiently extract K entities with greatest similarities. We conduct extensive experiments on both real and synthetic datasets. The encouraging results show that our model for defining the similarity between each entity and the search query is effective, and the proposed DTA can perform at least two orders of magnitude improvement on the performance comparing with the naive approach.

Keywords

temporal databases approximate entity extraction bounding-pruning-refining n-partite graph 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB, pp. 586–597 (2002)Google Scholar
  2. 2.
    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)Google Scholar
  3. 3.
    Behm, A., Ji, S., Li, C., Lu, J.: Space-constrained gram-based indexing for efficient approximate string search. In: ICDE, pp. 604–615 (2009)Google Scholar
  4. 4.
    Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. Stanford University (2005)Google Scholar
  5. 5.
    Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the SERF project. J. IEEE Data Eng. Bull. 29(2), 13–20 (2006)Google Scholar
  6. 6.
    Bergamaschi, S., Gelati, G., Guerra, F., Vincini, M.: An intelligent data integration approach for collaborative project management in virtual enterprises. World Wide Web 9(1), 35–61 (2006)CrossRefGoogle Scholar
  7. 7.
    Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Technical report, University of Texas, Austin (2002)Google Scholar
  8. 8.
    Brouwer, A.E., Cohen, A.M., Neumaier, A.: Distance-Regular Graphs. Springer, Berlin Heidelberg New York (1989)MATHGoogle Scholar
  9. 9.
    Chandel, A., Hassanzadeh, O., Koudas, N., Sadoghi, M., Srivastava, D.: Benchmarking declarative approximate selection predicates. In: SIGMOD, pp. 353–364 (2007)Google Scholar
  10. 10.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, pp. 5 (2006)Google Scholar
  11. 11.
    Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB, pp. 327–338 (2007)Google Scholar
  12. 12.
    Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: SIGMOD, pp. 201–212 (1998)Google Scholar
  13. 13.
    Date, C.J., Darwen, H., Lorentzos, N.: Temporal Data & the Relational Model. Elsevier’s Science & Technology (2002)Google Scholar
  14. 14.
    Do, H.-H., Rahm, E.: COMA–a system for flexible combination of schema matching approaches. In: VLDB, pp. 610–621 (2002)Google Scholar
  15. 15.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)Google Scholar
  16. 16.
    Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW, pp. 90–101 (2003)Google Scholar
  17. 17.
    Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: VLDB, pp. 636–647 (2004)Google Scholar
  18. 18.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)MATHCrossRefGoogle Scholar
  19. 19.
    Harary, F.: Graph Theory. Addison-Wesley, Reading (1994)Google Scholar
  20. 20.
    Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. J. Data Min. Knowl. Discov. 2(1), 9–37 (1998)CrossRefGoogle Scholar
  21. 21.
    Kappel, G., Kapsammeri, E., Retschitzegger, W.: Integrating XML and relational database systems. World Wide Web 7(4), 343–384 (2004)CrossRefGoogle Scholar
  22. 22.
    Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: VLDB, pp. 1078–1086 (2004)Google Scholar
  23. 23.
    Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)CrossRefGoogle Scholar
  24. 24.
    Li, C., Wang, B., Yang, X.: VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)Google Scholar
  25. 25.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)Google Scholar
  26. 26.
    On, B.-W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: ICDE, pp. 496–505 (2007)Google Scholar
  27. 27.
    Pak, A.N., Chung, C.-W.: A wikipedia matching approach to contextual advertising. World Wide Web 13(3), 251–274 (2010)CrossRefGoogle Scholar
  28. 28.
    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004)Google Scholar
  29. 29.
    Stonebraker, M.: The design of the postgres storage system. In: VLDB, pp. 289–300 (1987)Google Scholar
  30. 30.
    Tejada, S., Knoblock, C., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: SIGKDD, pp. 350–359 (2002)Google Scholar
  31. 31.
    Turn, P.: Onan extremal problem in graph theory. Journal of Matematiko Fizicki Lapok (in Hungarian) (1941)Google Scholar
  32. 32.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth-Heinemann (1979)Google Scholar
  33. 33.
    Vernicaand, R., Li, C.: Efficient top-k algorithms for fuzzy search in string collections. In: KEYS, pp. 9 (2009)Google Scholar
  34. 34.
    Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759–770 (2009)Google Scholar
  35. 35.
    Winkler, W.E.: The state of record linkage and current research problems. US Bureau of the Census (1999)Google Scholar
  36. 36.
    Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: VLDB, pp. 933–944 (2008)Google Scholar
  37. 37.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)Google Scholar
  38. 38.
    Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)Google Scholar
  39. 39.
    Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD, pp. 353–364 (2008)Google Scholar
  40. 40.
    Yin, X., Han, J., Yu, P.S.: LinkClus: efficient clustering via heterogeneous semantic links. In: VLDB, pp. 427–438 (2006)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Wei Lu
    • 1
    • 2
  • Gabriel Pui Cheong Fung
    • 3
  • Xiaoyong Du
    • 1
    • 2
  • Xiaofang Zhou
    • 1
    • 2
    • 4
  • Lijiang Chen
    • 5
  • Ke Deng
    • 4
  1. 1.School of InformationRenmin University of ChinaBeijingChina
  2. 2.Key Labs of Data Engineering and Knowledge EngineeringMinistry of EducationBeijingChina
  3. 3.Data Mining and Machine Learning GroupArizona State UniversityTempeUSA
  4. 4.School of ITEEThe University of QueenslandBrisbaneAustralia
  5. 5.Department of Computer SciencePeking UniversityBeijingChina

Personalised recommendations