Skip to main content
Log in

A two-stage entity event deduplication method based on graph node selection and node optimization strategy

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Entity event deduplication is the task of identifying all duplication entity events that have described the same entity within a set of events. However, the traditional entity event deduplication method has two challenges. First, the traditional method usually used global comparison when finding the duplication entity event, are all entity events in the dataset need to be compared, leading to low performance. Second, when the entity event evolves, the traditional method does not identify it well and reduces the effectiveness. To address these two problems and improve the performance and effectiveness, we propose a two-stage deduplication method based on graph node selection and optimization (TS-NSNO) strategy. In the first stage (TS-NS), we propose a graph node selection strategy, which transforms the global comparison into a local comparison by selecting the leader node, greatly reduces the number of calculations and improves the performance. In the second stage (TS-NO), we propose a graph node optimization strategy, by combining the spatiotemporal distance and entity event importance change of the event evolution, which optimizes the entity event with incorrect judgment to improve the effectiveness. We conduct extensive experiments on real entity event datasets of different sizes, and the results show that our method performs better in terms of performance and effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 1
Fig. 6
Algorithm 2
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

Data will be made available on request.

Notes

  1. You can see the data set and dictionary built in this experiment through this link www.github.com/jiaxu-git/TS-NSNO.

References

  • Ai W, Xu J, Shao H et al (2021) An entity event deduplication method based on connected subgraph. In: 2021 7th international conference on systems and informatics (ICSAI), IEEE, pp 1–6

  • Arun P, Sumesh M (2015) Near-duplicate web page detection by enhanced TDW and simHash technique. In: 2015 international conference on computing and network communications (CoCoNet), IEEE, pp 765–770

  • Bodankar R, Waghmare M (2020) Int J Sci Res Sci Eng Technol. Identification and effective summary extraction with deduplication of data in news articles 7:96–102

    Google Scholar 

  • Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings. compression and complexity of SEQUENCES 1997 (Cat. No. 97TB100171), IEEE, pp 21–29

  • Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, pp 380–388

  • Chen Z (2010) Graph-based clustering and its application in coreference resolution. In: Proceedings of the 2010 workshop on graph-based methods for natural language processing, pp 1–9

  • Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. pp 4171–4186

  • Fedoryszak M, Frederick B, Rajaram V et al (2019) Real-time event detection on social data streams. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2774–2782

  • Ge Y, Wu J, Dai G et al (2019) Text deduplication with minimum loss ratio. In: Proceedings of the 2019 11th international conference on machine learning and computing, pp 310–316

  • Han S, Hao X, Huang H (2018) An event-extraction approach for business analysis from online Chinese news. Electron Commerc Res Appl 28:244–260

    Article  Google Scholar 

  • Hossny AH, Mitchell L, Lothian N et al (2020) Feature selection methods for event detection in twitter: a text mining approach. Soc Netw Anal Min 10(1):1–15

    Article  Google Scholar 

  • Huang D, Hu S, Cai Y et al (2014) Discovering event evolution graphs based on news articles relationships. In: 2014 IEEE 11th international conference on e-business engineering, IEEE, pp 246–251

  • Jadhav A, Rajan V (2018) Extractive summarization with SWAP-NET: sentences and words from alternating pointer networks. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pp 142–151

  • Liu S, Liu K, He S et al (2016) A probabilistic soft logic based approach to exploiting latent and global information in event classification. In: Thirtieth AAAI conference on artificial intelligence, p 2993–2999

  • Liu B, Niu D, Wei H et al (2018) Matching article pairs with graphical decomposition and convolutions. arXiv preprint arXiv:1802.07459

  • Manku GS, Jain A, Das Sarma A (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on World wide web, pp 141–150

  • McConky K, Nagi R, Sudit M et al (2012) Improving event co-reference by context extraction and dynamic feature weighting. In: 2012 IEEE international multi-disciplinary conference on cognitive methods in situation awareness and decision support, IEEE, pp 38–43

  • Navarro-Colorado B, Saquete E (2016) Cross-document event ordering through temporal, lexical and distributional knowledge. Knowl Based Syst 110:244–254

    Article  Google Scholar 

  • Schinas M, Papadopoulos S, Petkos G et al (2015) Multimodal graph-based event detection and summarization in social media streams. In: Proceedings of the 23rd ACM international conference on multimedia, pp 189–192

  • Sharapova E, Sharapov R (2019) Detection of fuzzy duplicate texts in news feeds. 2019 systems of signal synchronization. Generating and processing in telecommunications (SYNCHROINFO), IEEE, pp 1–5

  • Tomadaki E, Salway A (2005) Matching verb attributes for cross-document event co-reference. In: Proceedings of interdisciplinary workshop on the identification and representation of verb features and verb classes, pp 127–132

  • UzZaman N, Allen JF (2010) Extracting events and temporal expressions from text. In: 2010 IEEE fourth international conference on semantic computing, IEEE, pp 1–8

  • Wang X, Dong X, Chen S (2020) Text duplicated-checking algorithm implementation based on natural language semantic analysis. In: 2020 IEEE 5th information technology and mechatronics engineering conference (ITOEC), IEEE, pp 732–735

  • Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442

    Article  ADS  CAS  PubMed  Google Scholar 

  • Yang CC, Shi X, Wei CP (2009) Discovering event evolution graphs from news corpora. IEEE Trans Syst Man Cybern Part A Syst Hum 39(4):850–863

    Article  Google Scholar 

  • Zhang X, Yao Y, Ji Y et al (2016) Effective and fast near duplicate detection via signature-based compression metrics. Math Probl Eng 10:1–12

    Google Scholar 

  • Zhang X, Liu Z, Liu W et al (2011) Event similarity computation in text. In: 2011 International conference on internet of things and 4th international conference on cyber. Physical and social computing, IEEE, pp 419–423

Download references

Funding

This work was supported by National Natural Science Foundation of China (Grant No. 61802444), the Research Foundation of Education Bureau of Hunan Province of China (Grant No. 22B0275, No. 20B625, No. 18B196), and Local Community Structure Detection Algorithms in Complex Networks (Grant No. 2020YJ009).

Author information

Authors and Affiliations

Authors

Contributions

The authors contributed to each part of this paper equally.

Corresponding author

Correspondence to Tao Meng.

Ethics declarations

Conflict of interest

All authors declare that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ai, W., Xu, J., Shao, H. et al. A two-stage entity event deduplication method based on graph node selection and node optimization strategy. Soft Comput (2024). https://doi.org/10.1007/s00500-023-09623-6

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00500-023-09623-6

Keywords

Navigation