Abstract
Entity event deduplication is the task of identifying all duplication entity events that have described the same entity within a set of events. However, the traditional entity event deduplication method has two challenges. First, the traditional method usually used global comparison when finding the duplication entity event, are all entity events in the dataset need to be compared, leading to low performance. Second, when the entity event evolves, the traditional method does not identify it well and reduces the effectiveness. To address these two problems and improve the performance and effectiveness, we propose a two-stage deduplication method based on graph node selection and optimization (TS-NSNO) strategy. In the first stage (TS-NS), we propose a graph node selection strategy, which transforms the global comparison into a local comparison by selecting the leader node, greatly reduces the number of calculations and improves the performance. In the second stage (TS-NO), we propose a graph node optimization strategy, by combining the spatiotemporal distance and entity event importance change of the event evolution, which optimizes the entity event with incorrect judgment to improve the effectiveness. We conduct extensive experiments on real entity event datasets of different sizes, and the results show that our method performs better in terms of performance and effectiveness.
Similar content being viewed by others
Data Availability
Data will be made available on request.
Notes
You can see the data set and dictionary built in this experiment through this link www.github.com/jiaxu-git/TS-NSNO.
References
Ai W, Xu J, Shao H et al (2021) An entity event deduplication method based on connected subgraph. In: 2021 7th international conference on systems and informatics (ICSAI), IEEE, pp 1–6
Arun P, Sumesh M (2015) Near-duplicate web page detection by enhanced TDW and simHash technique. In: 2015 international conference on computing and network communications (CoCoNet), IEEE, pp 765–770
Bodankar R, Waghmare M (2020) Int J Sci Res Sci Eng Technol. Identification and effective summary extraction with deduplication of data in news articles 7:96–102
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings. compression and complexity of SEQUENCES 1997 (Cat. No. 97TB100171), IEEE, pp 21–29
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, pp 380–388
Chen Z (2010) Graph-based clustering and its application in coreference resolution. In: Proceedings of the 2010 workshop on graph-based methods for natural language processing, pp 1–9
Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. pp 4171–4186
Fedoryszak M, Frederick B, Rajaram V et al (2019) Real-time event detection on social data streams. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2774–2782
Ge Y, Wu J, Dai G et al (2019) Text deduplication with minimum loss ratio. In: Proceedings of the 2019 11th international conference on machine learning and computing, pp 310–316
Han S, Hao X, Huang H (2018) An event-extraction approach for business analysis from online Chinese news. Electron Commerc Res Appl 28:244–260
Hossny AH, Mitchell L, Lothian N et al (2020) Feature selection methods for event detection in twitter: a text mining approach. Soc Netw Anal Min 10(1):1–15
Huang D, Hu S, Cai Y et al (2014) Discovering event evolution graphs based on news articles relationships. In: 2014 IEEE 11th international conference on e-business engineering, IEEE, pp 246–251
Jadhav A, Rajan V (2018) Extractive summarization with SWAP-NET: sentences and words from alternating pointer networks. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pp 142–151
Liu S, Liu K, He S et al (2016) A probabilistic soft logic based approach to exploiting latent and global information in event classification. In: Thirtieth AAAI conference on artificial intelligence, p 2993–2999
Liu B, Niu D, Wei H et al (2018) Matching article pairs with graphical decomposition and convolutions. arXiv preprint arXiv:1802.07459
Manku GS, Jain A, Das Sarma A (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on World wide web, pp 141–150
McConky K, Nagi R, Sudit M et al (2012) Improving event co-reference by context extraction and dynamic feature weighting. In: 2012 IEEE international multi-disciplinary conference on cognitive methods in situation awareness and decision support, IEEE, pp 38–43
Navarro-Colorado B, Saquete E (2016) Cross-document event ordering through temporal, lexical and distributional knowledge. Knowl Based Syst 110:244–254
Schinas M, Papadopoulos S, Petkos G et al (2015) Multimodal graph-based event detection and summarization in social media streams. In: Proceedings of the 23rd ACM international conference on multimedia, pp 189–192
Sharapova E, Sharapov R (2019) Detection of fuzzy duplicate texts in news feeds. 2019 systems of signal synchronization. Generating and processing in telecommunications (SYNCHROINFO), IEEE, pp 1–5
Tomadaki E, Salway A (2005) Matching verb attributes for cross-document event co-reference. In: Proceedings of interdisciplinary workshop on the identification and representation of verb features and verb classes, pp 127–132
UzZaman N, Allen JF (2010) Extracting events and temporal expressions from text. In: 2010 IEEE fourth international conference on semantic computing, IEEE, pp 1–8
Wang X, Dong X, Chen S (2020) Text duplicated-checking algorithm implementation based on natural language semantic analysis. In: 2020 IEEE 5th information technology and mechatronics engineering conference (ITOEC), IEEE, pp 732–735
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442
Yang CC, Shi X, Wei CP (2009) Discovering event evolution graphs from news corpora. IEEE Trans Syst Man Cybern Part A Syst Hum 39(4):850–863
Zhang X, Yao Y, Ji Y et al (2016) Effective and fast near duplicate detection via signature-based compression metrics. Math Probl Eng 10:1–12
Zhang X, Liu Z, Liu W et al (2011) Event similarity computation in text. In: 2011 International conference on internet of things and 4th international conference on cyber. Physical and social computing, IEEE, pp 419–423
Funding
This work was supported by National Natural Science Foundation of China (Grant No. 61802444), the Research Foundation of Education Bureau of Hunan Province of China (Grant No. 22B0275, No. 20B625, No. 18B196), and Local Community Structure Detection Algorithms in Complex Networks (Grant No. 2020YJ009).
Author information
Authors and Affiliations
Contributions
The authors contributed to each part of this paper equally.
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that he has no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ai, W., Xu, J., Shao, H. et al. A two-stage entity event deduplication method based on graph node selection and node optimization strategy. Soft Comput (2024). https://doi.org/10.1007/s00500-023-09623-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s00500-023-09623-6