Advertisement

Flexible and Efficient Distributed Resolution of Large Entities

  • András J. Molnár
  • András A. Benczúr
  • Csaba István Sidló
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7153)

Abstract

Entity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from different data sources with typos or historical data. In other cases they may contain significantly more records, especially when we search for entities on a higher level of a concept hierarchy than records.

In this paper we give theoretical foundation of a variety of practically important match functions. We show that under these formulations, ER with large entities can be solved efficiently with algorithms based on MapReduce, a distributed computing paradigm. Our algorithm can efficiently incorporate probabilistic and similarity-based record match, enabling flexible match function definition. We demonstrate the usability of our model and algorithm in a real-world insurance ER scenario, where we identify household groups of client records.

Keywords

Phone Number Match Function Partial Algebra Postal Address Entity Resolution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-Swoosh: A family of algorithms for generic, distributed entity resolution. In: Proc. 27th Int. Conf. on Distributed Computing Systems (2007)Google Scholar
  2. 2.
    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)CrossRefGoogle Scholar
  3. 3.
    Bhattacharya, I., Getoor, L.: A Latent dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58 (2006)Google Scholar
  4. 4.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1(1), 5 (2007)CrossRefGoogle Scholar
  5. 5.
    Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: Proc. 12th ACM SIGKDD, pp. 529–534 (2006)Google Scholar
  6. 6.
    Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identification and document categorization: two tasks with one joint model. In: KDD 2008: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 25–33. ACM, New York (2008)Google Scholar
  7. 7.
    Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: Proc. 9th ACM SIGKDD, pp. 39–48 (2003)Google Scholar
  8. 8.
    Boley, M., Horváth, T., Poigné, A., Wrobel, S.: Efficient Closed Pattern Mining in Strongly Accessible Set Systems (Extended Abstract). In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 382–389. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  9. 9.
    Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD 2007, pp. 437–448. ACM (2007)Google Scholar
  10. 10.
    Chen, S., Borthwick, A., Carvalho, V.R.: The case for cost-sensitive and easy-to-interpret models in industrial record linkage. In: 9th International Workshop on Quality in Databases (2011)Google Scholar
  11. 11.
    Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD 2008, pp. 151–159. ACM (2008)Google Scholar
  12. 12.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. In: IEEE TKDE preprint (2011)Google Scholar
  13. 13.
    Christen, P., Churches, T., Hegland, M.: Febrl – A Parallel Open Source Data Linkage System. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  14. 14.
    Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: CIKM 2009, pp. 1565–1568. ACM (2009)Google Scholar
  15. 15.
    Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press (2001)Google Scholar
  16. 16.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  17. 17.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE TKDE, 1–16 (2007)Google Scholar
  18. 18.
    Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)CrossRefzbMATHGoogle Scholar
  19. 19.
    Getoor, L., Diehl, C.: Link mining: a survey. ACM SIGKDD Explorations Newsletter 7(2), 3–12 (2005)CrossRefGoogle Scholar
  20. 20.
    Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)Google Scholar
  21. 21.
    Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endow. 3, 417–428 (2010)CrossRefGoogle Scholar
  22. 22.
    Hall, R., Sutton, C., McCallum, A.: Unsupervised deduplication using cross-field dependencies. In: KDD 2008, pp. 310–317. ACM (2008)Google Scholar
  23. 23.
    Han, H., Xu, W., Zha, H., Giles, C.: A hierarchical naive Bayes mixture model for name disambiguation in author citations. In: Proc. 2005 ACM Symposium on Applied Computing, pp. 1065–1069 (2005)Google Scholar
  24. 24.
    Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)CrossRefGoogle Scholar
  25. 25.
    Kang, U., Tsourakakis, C., Faloutsos, C.: Pegasus: A peta-scale graph mining system implementation and observations. In: ICDM, pp. 229–238. IEEE (2009)Google Scholar
  26. 26.
    Kim, H.-S., Lee, D.: Parallel linkage. In: CIKM 2007. ACM (2007)Google Scholar
  27. 27.
    Kirsten, T., Kolb, L., Hartung, M., Gross, A., Köpcke, H., Rahm, E.: Data partitioning for parallel entity matching. Computing Research Repository (2010)Google Scholar
  28. 28.
    Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)Google Scholar
  29. 29.
    Köpcke, H., Rahm, E.: Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 197–210 (2010)CrossRefGoogle Scholar
  30. 30.
    Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 484–493 (2010)CrossRefGoogle Scholar
  31. 31.
    McCarthy, J., Lehnert, W.: Using decision trees for coreference resolution. In: Proc. 14th Int. Conf. on Artificial Intelligence, pp. 1050–1055 (1995)Google Scholar
  32. 32.
    Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB Workshop, pp. 25–32 (2006)Google Scholar
  33. 33.
    Menestrina, D., Whang, S.E., Garcia-Molina, H.: Evaluating entity resolution results. Proc. VLDB Endow. 3, 208–219 (2010)CrossRefGoogle Scholar
  34. 34.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: SIGKDD, pp. 269–278 (2002)Google Scholar
  35. 35.
    Sidló, C.I.: Generic Entity Resolution in Relational Databases. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds.) ADBIS 2009. LNCS, vol. 5739, pp. 59–73. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  36. 36.
    Sidló, C.I.: Entity resolution with heavy indexing. In: Proc. ADBIS, CEUR Workshop Proceedings (2011)Google Scholar
  37. 37.
    Sidló, C.I., Garzó, A., Molnár, A., Benczúr, A.A.: Infrastructures and bounds for distributed entity resolution. In: 9th International Workshop on Quality in Databases (2011)Google Scholar
  38. 38.
    Talburt, J.R.: Entity Resolution and Information Quality, 1st edn. Morgan Kaufmann (2010)Google Scholar
  39. 39.
    Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. Proc. of the VLDB Endow. 1(2), 1253–1264 (2008)CrossRefGoogle Scholar
  40. 40.
    Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. Proc. VLDB Endow. 3, 1326–1337 (2010)CrossRefGoogle Scholar
  41. 41.
    Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proc. 35th Int. Conf. on Management of Data, pp. 219–232. ACM (2009)Google Scholar
  42. 42.
    White, T.: Hadoop: The Definitive Guide. Yahoo Press (2010)Google Scholar
  43. 43.
    Wick, M.L., Rohanimanesh, K., Schultz, K., McCallum, A.: A unified approach for schema matching, coreference and canonicalization. In: KDD 2008, pp. 722–730. ACM (2008)Google Scholar
  44. 44.
    Yakout, M., Elmagarmid, A.K., Elmeleegy, H., Ouzzani, M., Qi, A.: Behavior based record linkage. Proc. VLDB Endow. 3, 439–448 (2010)CrossRefGoogle Scholar
  45. 45.
    Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD, pp. 915–926. ACM (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • András J. Molnár
    • 1
  • András A. Benczúr
    • 1
  • Csaba István Sidló
    • 1
  1. 1.Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and ControlHungarian Academy of SciencesHungary

Personalised recommendations