Cluster Computing

, Volume 22, Supplement 1, pp 2383–2394 | Cite as

A distributed incremental information acquisition model for large-scale text data

  • Shengtao Sun
  • Jibing GongEmail author
  • Albert Y. Zomaya
  • Aizhi Wu


Timely discovering and acquiring information from incremental data on the Internet is a hot topic in a big data era. This paper presents a distributed incremental information acquisition model for large-scale text data. To obtain a lower false positive rate and higher efficiency of the traditional Bloom filter, a distributed multidimensional Bloom filter is designed and proposed to cope with the deduplication of large-scale Web URL text data. Three methods related to Bloom filter were compared based on the false positive rate and response efficiency. The results show that the distributed incremental information acquisition model for large-scale text data can achieve a high duplicate removal rate with a lower false positive rate.


Big data analytics Deduplication of large-scale text data Distributed incremental information acquisition model Distributed multidimensional bloom filter False positive rate 



This work is supported by the National High Technology Research and Development 863 Program of China (No. 2015AA124102) and the Hebei Natural Science Foundation of China (No. F2015203280). Shengtao Sun also acknowledges the Chinese Scholarship Council (No. 201608130030) for a visiting scholarship at University of Sydney. The authors would like to show great appreciation for the works done by Lin Zhang, Yi Zhao and Lili Wang from the research group of Knowledge Engineering (KEG), in Yanshan University.


  1. 1.
    Wang, L., Song, W., Liu, P.: Link the remote sensing big data to the image features via wavelet transformation. Clust. Comput. 19(2), 793–810 (2016)CrossRefGoogle Scholar
  2. 2.
    Ranjan, R., Georgakopoulos, D., Wang, L.: A note on software tools and technologies for delivering smart media-optimized big data applications in the cloud. Computing 98, 1–5 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Chen, D., Li, X., Wang, L., et al.: Fast and scalable multi-way analysis of massive neural data. IEEE Trans. Comput. 64(3), 707–719 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Deng, Z., Han, W., Wang, L., et al.: An efficient online direction-preserving compression approach for trajectory streaming data. Fut. Gener. Comput. Syst. 68, 150–162 (2017)CrossRefGoogle Scholar
  5. 5.
    Li, J., Zhang, P., Li, Y., et al.: A data-check based distributed storage model for storing hot temporary data. Fut. Gener. Comput. Syst. 73, 13–21 (2017)CrossRefGoogle Scholar
  6. 6.
    Melnik, S., Gubarev, A., Long, J.J., et al.: Dremel: interactive analysis of web-scale datasets. Commun. ACM 54, 114–123 (2011)CrossRefGoogle Scholar
  7. 7.
    Voras, I., Zagar, M.: Adapting the Bloom filter to multithreaded environments. In: The 15th IEEE Mediterranean Electrotechnical Conference, Valletta, Malta, pp. 1488–1493 (2010)Google Scholar
  8. 8.
    Ma, Y., Wang, L., Zomaya, A.Y., et al.: Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic dag scheduling. IEEE Trans. Parallel Distrib. Syst. 25(8), 2126–2137 (2014)CrossRefGoogle Scholar
  9. 9.
    Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016)CrossRefGoogle Scholar
  10. 10.
    Xiang, Z., Schwartz, Z., Gerdes Jr., J.H., Uysal, M.: What can big data and text analytics tell us about hotel guest experience and satisfaction? Int. J. Hosp. Manag. 44, 120–130 (2015)CrossRefGoogle Scholar
  11. 11.
    Jensen, K., Nguyen, H.T., Van Do, T., Arnes, A.: A big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017)CrossRefGoogle Scholar
  12. 12.
    Ma, L., Zhang, Y.: Using Word2Vec to process big text data. In: IEEE International Conference on Big Data, Santa Clara, pp. 2895–2897 (2015)Google Scholar
  13. 13.
    Schmidt, K., Bachle, S., Scholl, P., Nold, G.: Big Scale Text Analytics and Smart Content Navigation. Enabling Real-Time Business Intelligence, Lecture Notes in Business Information Processing, vol. 206, pp. 167–170. Springer, Berlin (2015)Google Scholar
  14. 14.
    Deng, Z., Wu, X., Wang, L., et al.: Parallel processing of dynamic continuous qeries over streaming data flows. IEEE Trans. Parallel Distrib. Syst. 26(3), 834–846 (2015)CrossRefGoogle Scholar
  15. 15.
    Chen, D., Wang, L., Zomaya, A.Y., et al.: Parallel simulation of complex evacuation scenarios with adaptive agent models. IEEE Trans. Parallel Distrib. Syst. 26(3), 847–857 (2015)CrossRefGoogle Scholar
  16. 16.
    Cho, J., Garcia-Molina, H.: Dealing with web data: history and look ahead. Proc. VLDB Endow. 3(1–2), 4–4 (2010)CrossRefGoogle Scholar
  17. 17.
    Sharma, D.K., Sharma, A.K.: A novel architecture for deep web crawler. Int. J. Inf. Technol. Web Eng. 6(1), 25–48 (2011)CrossRefGoogle Scholar
  18. 18.
    Zhang, Z., Dong, G., Peng, Z., et al.: A framework for incremental deep web crawler based on URL classification. In: The International Conference on Web Information Systems and Mining, Taiyuan, China, pp. 302–310 (2011)Google Scholar
  19. 19.
    Guo, H., Chen, Q., Xin, C., Wang, X., Bi, Ye: A real environment oriented parallel duplicates removal approach for large scale Chinese webpages. J. Comput. Inf. Syst. 7(5), 1420–1427 (2011)Google Scholar
  20. 20.
    Zhang, F., Liu, M., Gui, F., Shen, W., Shami, Abdallah, Ma, Yunlong: A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust. Comput. 18(4), 1493–1501 (2015)CrossRefGoogle Scholar
  21. 21.
    Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., Bal, H.: WebPIE: a web-scale parallel inference engine using MapReduce. Web Semant. 10, 59–75 (2012)CrossRefGoogle Scholar
  22. 22.
    Ben, X., Jia, D., Yuan, L.: A three layer distributed architecture for large-scale duplicated web page detection. Comput. Digital Eng. 10, 1751–1755 (2015)Google Scholar
  23. 23.
    Jose, J., Subramoni, H., Luo, M., et al.: Memcached design on high performance RDMA capable interconnects. In: The International Conference on Parallel Processing, Taipei, Taiwan, pp. 743–752 (2011)Google Scholar
  24. 24.
    Josiah, L.: Garlson: Redis in Action. Manning Publications Co., Greenwich (2013)Google Scholar
  25. 25.
    Subramanyam, R., Gupta, I., Leslie, L.M., Wang, W.: Idempotent distributed counters using a forgetful bloom filter. Clust. Comput. 19(2), 879–892 (2016)CrossRefGoogle Scholar
  26. 26.
    Tarkoma, S., Rothenberg, C., Lagerspetz, E.: Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor. 14(1), 131–155 (2011)CrossRefGoogle Scholar
  27. 27.
    Crainiceanu, A., Lemire, D.: Bloofi: multidimensional Bloom filters. Inf. Syst. 54, 311–324 (2015)CrossRefGoogle Scholar
  28. 28.
    Wu, Y., Huang, H., Zhou, X., et al.: A space-saving URL duplication removal method for web crawler. J. Inf. Comput. Sci. 9(5), 1195–1203 (2012)Google Scholar
  29. 29.
    Han, H., Jung, H., Eom, H., et al.: Scatter-Gather-Merge: an efficient star-join query processing algorithm for data-parallel frameworks. Clust. Comput. 14(2), 183–197 (2011)CrossRefGoogle Scholar
  30. 30.
    Alewiwi, M., Orencik, C., Savas, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19(1), 109–126 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.School of Information Science and EngineeringYanshan UniversityQinhuangdaoPeople’s Republic of China
  2. 2.School of Information TechnologiesUniversity of SydneySydneyAustralia
  3. 3.College of Vehicle and EnergyYanshan UniversityQinhuangdaoPeople’s Republic of China
  4. 4.Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, and Key Laboratory for Software Engineering of Hebei ProvinceQinhuangdaoPeople’s Republic of China

Personalised recommendations