Skip to main content
Log in

ERABQS: entity resolution based on active machine learning and balancing query strategy

  • Research
  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Entity Resolution (ER) is a crucial process in the field of data management and integration. The primary goal of ER is to identify different profiles (or records) that refer to the same real-world entity across databases. The challenging problem is that labeling a large sample of profiles can be very expensive and time-consuming. Active Machine Learning (ActiveML) addresses this issue by selecting the most representative or informative profiles pairs to be labeled. The informativeness is determined by the capacity to diminish the uncertainty of the model. Conversely, representativeness evaluates whether a selected instance effectively reflects the overall input patterns of unlabeled data. Traditional ActiveML techniques typically rely on one strategy, Which may severely restrict the performance of the ActiveML process and lead to slow convergence. Especially in ER problems with a lack of initial training data. In this paper, we overcame this issue by inventing an approach for balancing the two above strategies. The implemented solution named EBEES (Epsilon-based Balancing Exploration and Exploitation Strategy), Which contains two variations: Adaptive-\(\epsilon \) and \(\epsilon \)-decreasing. We evaluated the EBEES on twelve datasets. Comparing the EBEES strategy against the state-of-the-art methods, without an initial training data, showed an enhanced performance in terms of F1-score, model stability, and rapid convergence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Algorithm 2
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of supporting data

Datastes are available from github.

Code availability

The code is available upon reasonable request.

Notes

  1. https://scikit-learn.org

  2. https://modal-python.readthedocs.io

  3. https://pandas.pydata.org

  4. https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md

  5. https://github.com/wbsg-uni-mannheim/UnsupervisedBootAL/tree/master/datasets

References

Download references

Funding

The authors did not receive any financial support for this study.

Author information

Authors and Affiliations

Authors

Contributions

The initial concept was jointly devised by J.M. and H.I. . J.M. developed the theory and executed the computational work. The analytical methods were authenticated by H.I. and T.H., while R.Y. oversaw and reviewed the research outcomes. All authors engaged in discussions about the results and collaborated on the final manuscript.

Corresponding author

Correspondence to Jabrane Mourad.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interest.

Competing interests

The authors declare no competing interests.

Consent for publication

All authors agree with the content and give explicit consent to submit.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mourad, J., Hiba, T., Yassir, R. et al. ERABQS: entity resolution based on active machine learning and balancing query strategy. J Intell Inf Syst (2024). https://doi.org/10.1007/s10844-024-00853-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10844-024-00853-0

Keywords

Navigation