Data integration for earthquake disaster using real-world data

  • Chuanzhao Tian
  • Guoqing LiEmail author
Research Article - Applied Geophysics


The purpose of entity resolution (ER) is to identify records that refer to the same real-world entity from different sources. Most traditional ER studies identify records based on string-based data, so the ER problem relies mostly on string comparison techniques. There is little research on numeric-based data. Traditional ER approaches are widely used in many domains, such as papers, gene sequencing and restaurants, but they have not been used in an earthquake disaster. In this paper, earthquake disaster event information that was collected from different websites is denoted with numeric data. To solve the problem of ER in numeric data, we use the following methods to conduct experiments. First, we treat numbers as strings and use string-based approaches. Second, we use the Euclidean distance to measure the difference between two records. Third, we combine the above two strategies and use a comprehensive approach to measure the distance between the two records. We experimentally evaluate our methods on real datasets that represent earthquake disaster event information. The experimental results show that a comprehensive approach can achieve high performance.


Data integration Earthquake disaster Numeric data Entity resolution 



The authors thank the anonymous referees for their valuable comments and suggestions, which improved the technical content and the presentation of the article. This research was supported by the National Key Research and Development Program of China (2016YFB0501504).


  1. Ayat N, Afsarmanesh H, Akbarinia R, Valduriez P (2012) An uncertain data integration system. In: On the Move to meaningful internet systems: OtmCrossRefGoogle Scholar
  2. Ayat N, Akbarinia R, Afsarmanesh H, Valduriez P (2014) Entity resolution for probabilistic data. Inf Sci 277:492–511CrossRefGoogle Scholar
  3. Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82CrossRefGoogle Scholar
  4. Boyer RS, Moore JS (1977) A fast string searching algorithm. Commun ACM 20(10):762–772CrossRefGoogle Scholar
  5. Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1, 1992, Proceedings. SpringerGoogle Scholar
  6. Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555CrossRefGoogle Scholar
  7. Christen P, Goiser K (2007) Quality and complexity measures for data linkage and deduplication. Complexity 43:127–151Google Scholar
  8. Clark DE (2004) Practical introduction to record linkage for injury research. Injury Prev 10(3):186–191CrossRefGoogle Scholar
  9. Du MW, Chang SC (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowl Data Eng 6(4):620–633CrossRefGoogle Scholar
  10. Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16CrossRefGoogle Scholar
  11. Fan X (2016) GEOFON data center. Recent Dev World Seismol 452(8):33–41Google Scholar
  12. Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210CrossRefGoogle Scholar
  13. Galil Z, Giancarlo R (1988) Data structures and algorithms for approximate string matching. J Complex 4(1):33–72CrossRefGoogle Scholar
  14. Geller RJ (2007) Earthquake prediction: a critical review. Geophys J Int 131(3):425–450CrossRefGoogle Scholar
  15. Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18Google Scholar
  16. Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293CrossRefGoogle Scholar
  17. Jaro MA (1980) UNIMATCH, a record linkage system: users manual. Bureau of the CensusGoogle Scholar
  18. Kelman CW, Bass AJ, Holman CDJ (2010) Research use of linked health data — a best practice protocol. Aust N Z J Publ Health 26(2):251–255CrossRefGoogle Scholar
  19. Khan B, Rauf A, Shah SH, Khusro S (2011) Identification and removal of duplicated records. World Appl Sci J 13(5):1178–1184Google Scholar
  20. Knuth DE, Morris JH Jr, Pratt VR (1977) Fast pattern matching in strings. SIAM J Comput 6(2):323–350CrossRefGoogle Scholar
  21. Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493CrossRefGoogle Scholar
  22. Koudas N, Marathe A, Srivastava D (2004) Flexible string matching against large databases in practice. In: Thirtieth international conference on very large data basesGoogle Scholar
  23. Lee S, Lee J, Hwang SW (2014) Efficient entity matching using materialized lists. Inf Sci 261:170–184CrossRefGoogle Scholar
  24. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol 10, No 8, pp 707–710Google Scholar
  25. Li L, Li J, Gao H (2015) Rule-based method for entity resolution. IEEE Trans Knowl Data Eng 27(1):250–263CrossRefGoogle Scholar
  26. Magnani M, Montesi D (2010) A survey on uncertainty management in data integration. J Data Inf Qual 2(1):1–33CrossRefGoogle Scholar
  27. Miller FP, Vandome AF, Mcbrewster J (1980) Approximate string matching. ACM Comput Surv 12(4):381–402CrossRefGoogle Scholar
  28. Monge AE (2000) Matching algorithms within a duplicate detection system. IEEE Data Eng Bull 23(4):14–20Google Scholar
  29. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88CrossRefGoogle Scholar
  30. Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun ACM 23(12):676–687CrossRefGoogle Scholar
  31. Pinheiro JC, Sun DX (1998). Methods for linking and mining massive heterogeneous databases. In: Proceedings of the fourth international conference on knowledge discovery and data mining, August. AAAI Press, pp 309–313Google Scholar
  32. Ristad ES, Yianilos PN (1998) Learning string-edit distance. IEEE Trans Pattern Anal Mach Intell 20(5):522–532CrossRefGoogle Scholar
  33. Steorts RC, Ventura SL, Sadinle M, Fienberg SE (2014) A comparison of blocking methods for record linkage. In: International conference on privacy in statistical databases. Springer, ChamCrossRefGoogle Scholar
  34. Sun CC, Shen DR, Kou Y, Nie TZ, Yu G (2016) Entity resolution oriented clustering algorithm. J Softw 27(9):2303–2319 (in Chinese) Google Scholar
  35. Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: Algorithms-esa 95, third European symposium, Corfu, Greece, September. DBLPGoogle Scholar
  36. Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211CrossRefGoogle Scholar
  37. Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. Adv Math 20(3):367–387CrossRefGoogle Scholar
  38. Winkler WE (2004) Methods for evaluating and creating data quality. Inf Syst 29(7):531–550CrossRefGoogle Scholar
  39. Winkler WE (2006) Overview of record linkage and current research directions. In: Bureau of the CensusGoogle Scholar
  40. Zhu B, Suo M, Chen Y, Zhang Z, Li S (2018) Mixed H∞ and passivity control for a class of stochastic nonlinear sampled-data systems. J Frankl Inst 355(7):3310–3329CrossRefGoogle Scholar

Copyright information

© Institute of Geophysics, Polish Academy of Sciences & Polish Academy of Sciences 2019

Authors and Affiliations

  1. 1.Institute of Remote Sensing and Digital EarthChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations