Abstract
Data fusion is an efficient way to achieve an improved accuracy and more specific inferences by fusing and aggregating data from different sensors. However, due to the increasing complexity of spatial data with massive and multi-source heterogeneous characteristics, the existing methods cannot satisfy quite well the requirement for the integrity of data and the accuracy of fusion results in some specific situations. By considering the geographical properties of spatial data, a multi-source heterogeneous spatial big data fusion method based on multiple similarity and voting decision (SDFSV) is proposed in this paper, which develops a three-step record linking algorithm to improve the quality of entity recognition for the incremental fusion of massive data. Then, a one-time voting algorithm is introduced into the proposed method, so that the data conflicts can be significantly reduced and thus the accuracy of the data fusion can be improved. And a relation deduction method based on rule and entity recognition is presented to enhance the data integrity. In addition, in order to promote traceability and interpretability of fusion results, it is necessary to construct a data traceability mechanism. Experimental results show that SDFSV has an improved performance by using the data of Beijing Medical Institutions collected from 10 data sources.
This is a preview of subscription content, access via your institution.






Data availability
The data used in this study are generated by the author’s independent experiment.
References
Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1):89–113
Bellahsene Z, Bonifati A, Rahm E (2011) Schema matching and mapping. Springer, Berlin
Bordes A, Usunier N, Garcia-Duran A (2013) Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th international conference on neural information processing systems, pp 2787–2795
Bramer M, Macintosh A, Coenen F (2000) Research and development in intelligent systems XVI. Springer, London
Burger JD, Henderson JC, Morgan WT (2002) Statistical named entity recognizer adaptation. In: Proceedings of the sixth conference on natural language learning at HLT-NAACL, pp 1–4
Carreras X, Màrquez L, Padró L (2002) Named entity extraction using AdaBoost. In: Proceedings of the sixth conference on natural language learning, pp 1–4
Chang JP, Chen ZS, Wang ZJ, Jin L, Pedrycz W (2022) Assessing the spatial synergy between integrated urban rail transit system and urban form: a BULI-based MCLSGA model with wisdom of crowds. IEEE Trans Fuzzy Syst
Charikar M, Guruswami V, Wirth A (2005) Clustering with qualitative information. J Comput Syst Sci 71(3):360–383
Che X, Mi J, Chen D (2018) Information fusion and numerical characterization of a multi-source information system. Knowl Based Syst 145:121–133
Chen ZS, Liu XL, Chin KS, Pedrycz W, Tsui KL, Skibniewski MJ (2021) Online-review analysis based large-scale group decision-making for determining passenger demands and evaluating passenger satisfaction: case study of high-speed rail system in China. Inf Fusion 69:22–39
Chen ZS, Zhang X, Rodriguez RM, Pedrycz W, Martinez L, Skibniewski MJ (2022) Expertise-structure and risk-appetite-integrated two-tiered collective opinion generation framework for large scale group decision making. IEEE Trans Fuzzy Syst
Curran JR, Clark S (2003) Language independent NER using a maximum entropy tagger. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL, pp 164–167
Dong X L, Berti-Equille L, Srivastava D (2009) Integrating conflicting data: the role of source dependence. In: Proceedings of the VLDB endowment, pp 550–561
Dong XL, Naumann F (2009) Data fusion: resolving data conflicts for integration. In: Proceedings of the VLDB endowment, pp 1654–1655
Dong XL, Saha B, Srivastava D (2012) Less is more: selecting sources wisely for integration. In: Proceedings of the VLDB endowment, pp 37–48
Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Getoor L, Machanavajjhala A (2012) Entity resolution: theory, practice and open challenges. In: Proceedings of the VLDB endowment, pp 2018–2019
Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Disc 2(1):9–37
Hong L, Zou L, Lian X, Yu PS (2015) Subgraph matching with set similarity in a large graph database. IEEE Trans Knowl Data Eng 27(9):2507–2521
Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991
Klein LA (2004) Sensor and data fusion: a tool for information assessment and decision making. SPIE, Washington
Kolb L, Thor A, Rahm E (2012) Load balancing for map reduce-based entity resolution. In: Proceedings of the IEEE 28th international conference on data engineering, pp 618–629
Kou G, Olgu Akdeniz Ö, Dinçer H, Yüksel S (2021) Fintech investments in European banks: a hybrid IT2 fuzzy multidimensional decision-making approach. Financ Innov 7(1):1–28
Li T, Kou G, Peng Y (2020) Improving malicious URLs detection via feature engineering: linear and nonlinear space transformation methods. Inf Syst 91:101494
Li G, Kou G, Peng Y (2021a) Heterogeneous large-scale group decision making using fuzzy cluster analysis and its application to emergency response plan selection. IEEE Trans Syst Man Cybern Syst 52(6):3391–3403
Li T, Kou G, Peng Y, Shi Y (2017) Classifying with adaptive hyper-spheres: an incremental classifier based on competitive learning. IEEE Trans Syst Man Cybern Syst 50(4):1218–1229
Li T, Kou G, Peng Y, Yu PY (2021b) An integrated cluster detection, optimization, and interpretation approach for financial data. IEEE Trans Cybern
Mayfield J, McNamee P, Piatko C (2003) Named entity recognition using hundreds of thousands of features. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL, pp 184–187
McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL, pp 188–191
Meng X, Du Z (2016) Research on the big data fusion: issues and challenges. J Comput Res Dev 53(2):231–246
Nakamura EF, Loureiro AAF, Frery AC (2007) Information fusion for wireless sensor networks: Methods, models, and classifications. ACM Comput Surv CSUR 39(3):9-es
Papadakis G, Koutrika G, Palpanas T, Nejdl W (2013) Meta-blocking: taking entity resolution to the next level. IEEE Trans Knowl Data Eng 26(8):1946–1960
Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10(4):334–350
Rajeswari V, Kavitha M, Varughese DK (2019) A weighted graph-oriented ontology matching algorithm for enhancing ontology mapping and alignment in semantic web. Soft Comput 23(18):8661–8676
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Singh MK, Dutta A, Venkatesh KS (2020) Multi-sensor data fusion for accurate surface modeling. Soft Comput 24(19):14449–14462
Spaccapietra S (2005) Journal on data semantics IV. Springer, Berlin
Tahani H, Keller JM (1990) Information fusion in computer vision using the fuzzy integral. IEEE Trans Syst Man Cybern 20(3):733–741
Tao X, Liu L, Zhao F, Huang Y, Liang Y, Zhu S (2019) Ontology and weighted DS evidence theory-based vulnerability data fusion method. J Univ Comput Sci 25(3):203–221
Varshney PK (1997) Multisensor data fusion. Electron Commun Eng J 9(6):245–253
Wang F, Hu L, Zhou J, Hu J, Zhao K (2017) A semantics-based approach to multi-source heterogeneous information fusion in the internet of things. Soft Comput 21(8):2005–2013
Wang D, Zou L, Zhao D (2015) Top-k queries on RDF graphs. Inf Sci 316:201–217
Xiao F (2022) GEJS: a generalized evidential divergence measure for multisource information fusion. IEEE Trans Syst Man Cybern Syst
Xiao F, Cao Z, Lin C T (2022a) A complex weighted discounting multisource information fusion with its application in pattern classification. IEEE Trans Knowl Data Eng
Xiao F, Pedrycz W (2022) Negation of the quantum mass function for multisource quantum information fusion with its application to pattern classification. IEEE Trans Pattern Anal Mach Intell
Xiao F, Wen J, Pedrycz W (2022b) Generalized divergence-based decision making method with an application to pattern classification. IEEE Trans Knowl Data Eng
Xu W, Yu J (2017) A novel approach to information fusion in multi-source datasets: a granular computing viewpoint. Inf Sci 378:410–423
Yager RR, Liu L (2008) Classic works of the Dempster–Shafer theory of belief functions. Springer, Berlin
Yinglei H, Dexin Q, Shengyuan Z (2022) Smart transportation travel model based on multiple data sources fusion for defense systems. Soft Comput 26(7):3247–3259
Zhao K, Sun R, Li L, Hou M, Yuan G, Sun R (2021) An optimal evidential data fusion algorithm based on the new divergence measure of basic probability assignment. Soft Comput 25(17):11449–11457
Zhao K, Li L, Chen Z, Sun R, Yuan G, Li J (2022) A survey: optimization and applications of evidence fusion algorithm based on Dempster–Shafer theory. Appl Soft Comput 109075
Zhu Z, Li G (2017) A preliminary study on knowledge fusion from the overall perspective of data, information, and knowledge—the association and comparison of data fusion, information fusion and knowledge fusion (in Chinese). Intell Theory Pract 40(2):12–18
Funding
This research was funded by National Key Research and Development Program of China, Grant number 2016YFB0501805, and National Development and Reform Commission of China, Grant number JZNYYY001.
Author information
Authors and Affiliations
Contributions
Each author had made some contribution to this article.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Z., Zhou, J. & Sun, R. A multi-source heterogeneous spatial big data fusion method based on multiple similarity and voting decision. Soft Comput 27, 2479–2492 (2023). https://doi.org/10.1007/s00500-022-07734-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-022-07734-0
Keywords
- Data fusion
- Spatial big data
- Multi-source heterogeneity
- Multiple similarity
- Voting decision