Conflict-Aware Historical Data Fusion

  • Vladimir Zadorozhny
  • Ying-Feng Hsu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6929)

Abstract

Historical data reports on numerous events for overlapping time intervals, locations, and names. As a result, it may include severe data conflicts caused by database redundancy that prevent researchers from obtaining the correct answers to queries on an integrated historical database. In this paper, we propose a novel conflict-aware data fusion strategy for historical data sources. We evaluated our approach on a large-scale data warehouse that integrates historical data from approximately 50,000 reports on US epidemiological data for more than 100 years. We demonstrate that our approach significantly reduces data aggregation error in the integrated historical database.

Keywords

Data Fusion Integrity Constraint Historical Database Measle Case Aggregate Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Afrati, F., Kolaitis, P.: Repair Checking in Inconsistent Databases: Algorithms and Complexity. In: Proc. of ICDT (2009)Google Scholar
  2. 2.
    Agarwal, S., Keller, A., Wiederhold, G., Saraswat, K.: Flexible Relation: An Approach for Integrating Data from Multiple, Possibly Inconsistent Databases. In: Proc. of ICDE (1995)Google Scholar
  3. 3.
    Arenas, M., Bertossi, L., Chomicki, J.: Specifying and Querying Database Repairs using Logic Programs with Exceptions. In: Proc. of FQAS (2000)Google Scholar
  4. 4.
    Bernstein, P., Melnik, S.: Model Management 2.0: Manipulating Richer Mappings. In: Proc. of ACM SIGMOD (2007)Google Scholar
  5. 5.
    Bertossi, L.: Consistent Query Answering in Databases. ACM SIGMOD Record 35(2) (2006)Google Scholar
  6. 6.
    Bertossi, L., Chomicki, J.: Query Answering in Inconsistent Databases. In: Logics for Emerging Applications of Databases. Springer, Heidelberg (2003)Google Scholar
  7. 7.
    Bleiholder, J., Naumann, F.: Data Fusion. ACM Computing Surveys 41(1) (2008)Google Scholar
  8. 8.
    Bohannon, P., Flaster, M., Fan, W., Rastorgi, R.: A Cost-based Model and Effective Heuristic for Repairing Constraints by Value Modification. In: Proc. of ACM SIGMOD (2005)Google Scholar
  9. 9.
    Brodie, M.: Data Integration at Scale: From Relational Data Integration to Information Ecosystems. In: Proc. of AINA (2010)Google Scholar
  10. 10.
    Brodie, M.: Data Management Challenges in Very Large Enterprises. In: Proc. of VLDB (2002)Google Scholar
  11. 11.
    Bry, F.: Query Answering in Information Systems with Integrity Constraints. In: Proc. of IICIS (1997)Google Scholar
  12. 12.
    Caroprese, L., Greco, S.: Active Integrity Constraints for Database Consistency Maintenance. IEEE TKDE 21(7) (2009)Google Scholar
  13. 13.
    Chomicki, J., Staworko, S., Marcinkowski, J.: Computing Consistent Query Answers Using Conflict Hypergraph. In: Proc. of CIKM (2004)Google Scholar
  14. 14.
    Date, J., Darwen, H., Lorentzos: Temporal Data and the Relational Model. Morgan Kaufmann, San Francisco (2003)Google Scholar
  15. 15.
    Dong, X., Naumann, F.: Data Fusion - Resolving Data Conflicts for Integration. In: PVLDB, vol. 2(2) (2009)Google Scholar
  16. 16.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE TKDE 19(1) (2007)Google Scholar
  17. 17.
    Flesca, S., Furfaro, F., Parisi, F.: Querying and Repairing Inconsistent Numerical Databases. ACM TODS 35(2) (2010)Google Scholar
  18. 18.
    Flesca, S., Furfaro, F., Parisi, F.: Consistent Query Answers on Numerical Databases Under Aggregate Constraints. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 279–294. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  19. 19.
    Fagin, R., Kolaitis, P., Popa, L.: Data Exchange: Getting to the Core. ACM TODS 30(1) (2005)Google Scholar
  20. 20.
    Haas, L.: Beauty and the Beast: The Theory and Practice of Information Integration. In: Schwentick, T., Suciu, D. (eds.) ICDT 2007. LNCS, vol. 4353, pp. 28–43. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  21. 21.
    Imelinski, T., Lipski, W.: Incomplete Information in Relational Databases. Journal of ACM 31(4) (1984)Google Scholar
  22. 22.
    Jensen, C., Snograss, R.: Temporal Data Management. IEEE TKDE 11(1) (1999)Google Scholar
  23. 23.
    Kay, S.: Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice-Hall, Englewood Cliffs (1993)MATHGoogle Scholar
  24. 24.
    Rahm, E., Bernstein, P.: A Survey of Approaches to Automatic Schema Matching. The VLDB Journal 10(4) (2001)Google Scholar
  25. 25.
    Senn, S.: Overstating the Evidence - Double Counting in Meta-analysis and Related Problems. BMC Medical Research Methodology 9(10) (2009)Google Scholar
  26. 26.
    Snodgrass, R.: Developing Time-oriented Database Applications in SQL. Morgan Kaufmann, San Francisco (2000)Google Scholar
  27. 27.
    Staworko, S., Chomicki, J.: Consistent Query Answers in the Presence of Universal Constraints. Inf. Syst. 35(1) (2010)Google Scholar
  28. 28.
    Wijsen, J.: Consistent Query Answering under Primary Keys: A Characterization of Tractable Queries. In: Proc. of ICDT (2009)Google Scholar
  29. 29.
    Wijsen, J.: Database repairing using updates. ACM TODS 30(3) (2005)Google Scholar
  30. 30.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth Discovery and Copying Detection in a Dynamic World. In: PVLDB, vol. 2(1) (2009)Google Scholar
  31. 31.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating Conflicting Data: The Role of Source Dependence. In: PVLDB, vol. 2(1) (2009)Google Scholar
  32. 32.
    Yin, X., Han, J., Yu, P.: Truth Discovery with Multiple Conflicting Information Provided on the Web. In: Proc. of SIGKDD (2007)Google Scholar
  33. 33.
    Zadorozhny, V., Raschid, L., Gal, A.: Scalable Catalog Infrastructure for Managing Access Costs and Source Selection in Wide Area Networks. International Journal of Cooperative Information Systems 17(1) (2008)Google Scholar
  34. 34.
    Zadorozhny, V., Gal, A., Raschid, L., Ye, Q.: AReNA: Adaptive Distributed Catalog Infrastructure Based On Relevance Networks. In: Proc. of VLDB (2005)Google Scholar
  35. 35.
    Zadorozhny, V., Bright, L., Vidal, M.E., Raschid, L., Urhan, T.: Efficient Evaluation of Queries in a Mediator for WebSources. In: Proc. of ACM SIGMOD (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Vladimir Zadorozhny
    • 1
  • Ying-Feng Hsu
    • 1
  1. 1.Graduate Program of Information Science and TechnologyUniversity of PittsburghPittsburgh

Personalised recommendations