Skip to main content

Unifying logic rules and machine learning for entity enhancing

Abstract

This paper proposes a notion of entity enhancing, which unifies entity resolution and conflict resolution, to identify tuples that refer to the same real-world entity and at the same time, correct semantic inconsistencies. We propose to unify rule-based and machine learning (ML) methods for entity enhancing, by embedding ML classifiers as predicates in logic rules. We model entity enhancing by extending the chase. We show that the chase warrants correctness justification and the Church-Rosser property. Moreover, we settle fundamental problems associated with entity enhancing, including the enhancing, consistency, satisfiability, and implication problems, ranging from NP-complete and coNP-complete to Π p2 -complete. Taken together, these provide a new theoretical framework for unifying entity resolution and conflict resolution.

This is a preview of subscription content, access via your institution.

References

  1. Wikibon. A comprehensive list of big data statistics, 2012. http://wikibon.org/blog/big-data-statistics/

    Google Scholar 

  2. Fan W F, Gao H, Jia X B, et al. Dynamic constraints for record matching. VLDB J, 2011, 20: 495–520

    Article  Google Scholar 

  3. Bertossi L, Kolahi S, Lakshmanan L V S. Data cleaning and query answering with matching dependencies and matching functions. Theory Comput Syst, 2013, 52: 441–482

    MATH  Article  MathSciNet  Google Scholar 

  4. Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Trans Knowl Discov Data, 2007, 1: 5

  5. Arasu A, Ré C, Suciu D. Large-scale deduplication with constraints using Dedupalog. In: Proceedings of the 25th International Conference on Data Engineering, 2009

    Book  Google Scholar 

  6. Mudgal S, Li H, Rekatsinas T, et al. Deep learning for entity matching: a design space exploration. In: Proceedings of International Conference on Management of Data, 2018

    Book  Google Scholar 

  7. Arasu A, Götz M, Kaushik R. On active learning of record matching packages. In: Proceedings of International Conference on Management of Data, 2010

    Book  Google Scholar 

  8. Fan W F, Geerts F, Jia X B, et al. Conditional functional dependencies for capturing data inconsistencies. ACM Trans Database Syst, 2008, 33: 1–48

    Google Scholar 

  9. Golab L, Karloff H, Korn F, et al. On generating near-optimal tableaux for conditional functional dependencies. In: Proceedings of the VLDB Endowment, 2008

    Google Scholar 

  10. Fan W F, Geerts F, Tang N, et al. Conflict resolution with data currency and consistency. J Data Inf Qual, 2014, 5: 1–37

    Article  Google Scholar 

  11. Arenas M, Bertossi L, Chomicki J. Consistent query answers in inconsistent databases. In: Proceedings of Symposium on Principles of Database Systems, 1999

    MATH  Book  Google Scholar 

  12. Chu X, Ilyas I F, Papotti P. Holistic data cleaning: putting violations into context. In: Proceedings of IEEE International Conference on Data Engineering, 2013

    Google Scholar 

  13. Chiticariu L, Li Y Y, Reiss F R. Rule-based information extraction is dead! Long live rule-based information extraction systems! In: Proceedings of Empirical Methods in Natural Language Processing, 2013

    Google Scholar 

  14. Fan W F, Li J Z, Ma S, et al. Interaction between record matching and data repairing. In: Proceedings of International Conference on Management of Data, 2011

    Book  Google Scholar 

  15. Dong X, Halevy A, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of International Conference on Management of Data, 2005

    Book  Google Scholar 

  16. Whang S E, Benjelloun O, Garcia-Molina H. Generic entity resolution with negative rules. VLDB J, 2009, 18: 1261–1277

    Article  Google Scholar 

  17. Sadri F, Ullman J D. The interaction between functional dependencies and template dependencies. In: Proceedings of International Conference on Management of Data, 1980

    MATH  Book  Google Scholar 

  18. Bahmani Z, Bertossi L, Vasiloglou N. ERBlox: combining matching dependencies with machine learning for entity resolution. Int J Approx Reason, 2017, 83: 118–141

    MATH  Article  MathSciNet  Google Scholar 

  19. Whang S E, Garcia-Molina H. Joint entity resolution on multiple datasets. VLDB J, 2013, 22: 773–795

    Article  Google Scholar 

  20. Verroios V, Garcia-Molina H, Papakonstantinou Y. Waldo: an adaptive human interface for crowd entity resolution. In: Proceedings of International Conference on Management of Data, 2017

    Book  Google Scholar 

  21. Firmani D, Saha B, Srivastava D. Online entity resolution using an Oracle. Proc VLDB Endow, 2016, 9: 384–395

    Article  Google Scholar 

  22. Ebraheem M, Thirumuruganathan S, Joty S, et al. Distributed representations of tuples for entity resolution. In: Proceedings of Very Large Data Bases, 2018

    Google Scholar 

  23. Qian K, Popa L, Sen P. Active learning for large-scale entity resolution. In: Proceedings of Conference on Information and Knowledge Management, 2017

    Book  Google Scholar 

  24. Zhang D X, Guo L, He X N, et al. A graph-theoretic fusion framework for unsupervised entity resolution. In: Proceedings of the 34th International Conference on Data Engineering, 2018

    Book  Google Scholar 

  25. Yakout M, Elmagarmid A K, Neville J, et al. Guided data repair. In: Proceedings of Very Large Data Bases, 2011

    Book  Google Scholar 

  26. He J, Veltri E, Santoro D, et al. Interactive and deterministic data cleaning. In: Proceedings of International Conference on Management of Data, 2016

    Book  Google Scholar 

  27. Assadi A, Milo T, Novgorodov S. Dance: data cleaning with constraints and experts. In: Proceedings of International Conference on Data Engineering, 2017

    Google Scholar 

  28. Guo S T, Dong X L, Srivastava D, et al. Record linkage with uniqueness constraints and erroneous values. In: Proceedings of Very Large Data Bases, 2010

    Book  Google Scholar 

  29. Fan W F, Li J Z, Ma S, et al. Towards certain fixes with editing rules and master data. VLDB J, 2012, 21: 213–238

    Article  Google Scholar 

  30. Fan W F, Lu P, Tian C, et al. Deducing certain fixes to graphs. Proc VLDB Endow, 2019, 12: 752–765

    Article  Google Scholar 

  31. Yakout M, Berti-Équille L, Elmagarmid A K. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of International Conference on Management of Data, 2013. 553–564

    Google Scholar 

  32. Abiteboul S, Hull R, Vianu V. Foundations of Databases. Reading: Addison-Wesley, 1995

    MATH  Google Scholar 

  33. Aires J P, Meneguzzi F. Norm conflict identification using deep learning. In: Proceedings of International Conference on Autonomous Agents and Multiagent Systems, 2017. 194–207

    Google Scholar 

  34. Sycara K P. Machine learning for intelligent support of conflict resolution. Decision Support Syst, 1993, 10: 121–136

    Article  Google Scholar 

  35. Loshin D. Master Data Management. San Francisco: Knowledge Integrity Inc., 2009

    MATH  Google Scholar 

  36. Chandra A K, Merlin P M. Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of Symposium on the Theory of Computing, 1977

    Book  Google Scholar 

  37. Aggarwal C C. Data Classification: Algorithms and Applications. Boca Raton: CRC Press, 2014

    Book  Google Scholar 

  38. Fan W F, Geerts F. Foundations of Data Quality Management. San Rafael: Morgan & Claypool Publishers, 2012

    MATH  Book  Google Scholar 

  39. Klug A. On conjunctive queries containing inequalities. J ACM, 1988, 35: 146–160

    MATH  Article  MathSciNet  Google Scholar 

  40. Baudinet M, Chomicki J, Wolper P. Constraint-generating dependencies. J Comput Syst Sci, 1999, 59: 94–115

    MATH  Article  MathSciNet  Google Scholar 

  41. Beeri C, Bernstein P A. Computational problems related to the design of normal form relational schemas. ACM Trans Database Syst, 1979, 4: 30–59

    Article  Google Scholar 

  42. Rutenburg V. Complexity of generalized graph coloring. In: Proceedings of International Symposium on Mathematical Foundations of Computer Science, 1986

    MATH  Book  Google Scholar 

  43. Schaefer M, Umans C. Completeness in the polynomial-time hierarchy: a compendium. 2002. http://ovid.cs.depaul.edu/documents/phcom.pdf

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by Shenzhen Institute of Computing Sciences, Beijing Advanced Innovation Center for Big Data and Brain Computing (Beihang University), Royal Society Wolfson Research Merit Award (Grant No. WRM/R1/180014), European Research Council (Grant No. 652976), Engineering and Physical Sciences Research Council (Grant No. EP/M025268/1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ping Lu.

Additional information

Professor Wenfei Fan is the chair of web data management at the University of Edinburgh, UK, the chief scientist of Shenzhen Institute of Computing Science, and a chief scientist of Beijing Advanced Innovation Center for Big Data and Brain Computing, China. He received his Ph.D. from the University of Pennsylvania (USA), and his MS.c. and BS.c. from Peking University (China). He joined the University of Edinburgh in 2004; prior to that, he was a member of technical staff at Bell Laboratories in Murray Hill, NJ, USA.

He is a foreign member of Chinese Academy of Science, a fellow of the Royal Society (FRS), a fellow of the Royal Society of Edinburgh (FRSE), a member of the Academy of Europe (MAE), and an ACM Fellow (FACM). He is a recipient of Royal Society Wolfson Research Merit Award in 2018, ERC Advanced Fellowship in 2015, the Roger Needham Award in 2008 (UK), Yangtze River Scholar in 2007 (China), the Outstanding Overseas Young Scholar Award in 2003 (China), the Career Award in 2001 (USA), and several Test-of-Time and Best Paper Awards (Alberto O. Mendelzon Test-of-Time Award of ACM PODS 2015 and 2010, Best Paper Awards for SIGMOD 2017, VLDB 2010, ICDE 2007, and Computer Networks 2002).

Prof. Fan “has made fundamental contributions to both theory and practice of data management. He has both formalized the problems of querying big data and has developed radically new techniques that overcome the limits associated with conventional database systems. In addition, he has made seminal contributions to data quality, in which he devised new techniques for data cleaning that have found wide commercial adoption. He has also contributed to our understanding of semi-structured data” (cf. the Royal Society, UK). His current research interests include database theory and systems, in particular big data, data quality, data sharing, distributed computation, query languages, and social media marketing.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fan, W., Lu, P. & Tian, C. Unifying logic rules and machine learning for entity enhancing. Sci. China Inf. Sci. 63, 172001 (2020). https://doi.org/10.1007/s11432-020-2917-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-020-2917-1

Keywords

  • logic rules
  • machine learning
  • entity enhancing
  • entity resolution
  • conflict resolution