Document Nearest Neighbors Query Based on Pairwise Similarity with MapReduce

Lv, Peipei; Yang, Peng; Dong, Yong-Qiang; Gu, Liang

doi:10.1007/978-3-030-05051-1_3

Peipei Lv^16,17,
Peng Yang^16,17,
Yong-Qiang Dong^16,17 &
…
Liang Gu^16,17

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11334))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1652 Accesses

Abstract

With the continuous development of Web technology, many Internet issues evolve into Big Data problems, characterized by volume, variety, velocity and variability. Among them, how to organize plenty of web pages and retrieval information needed is a critical one. An important notion is document classification, in which nearest neighbors query is the key issue to be solved. Most parallel nearest neighbors query methods adopt Cartesian Product between training set and testing set resulting in poor time efficiency. In this paper, two methods are proposed on document nearest neighbor query based on pairwise similarity, i.e. brute-force and pre-filtering. brute-force is constituted by two phases (i.e. copying and filtering) and one map-reduce procedure is conducted. In order to obtain nearest neighbors for each document, each document pair is copied twice and all records generated are shuffled. However, time efficiency of shuffle is sensitive to the number of the intermediate results. For the purpose of intermediate results reduction, pre-filtering is proposed for nearest neighbor query based on pairwise similarity. Since only first top-k neighbors are output for each document, the size of records shuffled is kept in the same magnitude as input size in pre-filtering. Additionally, detailed theoretical analysis is provided. The performance of the algorithms is demonstrated by experiments on real world dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Adaptive Similarity Search in Massive Datasets

An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Article 09 November 2015

Notes

1.
http://www.sogou.com/labs/dl/c.html.

References

Ahmed, O.S., Franklin, S.E., Wulder, M.A., White, J.C.: Extending airborne lidar-derived estimates of forest canopy cover and height over large areas using knn with landsat time series data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 9(8), 3489–3496 (2016)
Article Google Scholar
Al Aghbari, Z.: Array-index: a plug&search K nearest neighbors method for high-dimensional data. Data Knowl. Eng. 52(3), 333–352 (2005)
Article Google Scholar
Almalawi, A.M., Fahad, A., Tari, Z., Cheema, M.A., Khalil, I.: \( k \) NNVWC: an efficient \( k \)-nearest neighbors approach based on various-widths clustering. IEEE Trans. Knowl. Data Eng. 28(1), 68–81 (2016)
Article Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MathSciNet Google Scholar
Cha, G.H., Zhu, X., Petkovic, D., Chung, C.W.: An efficient indexing method for nearest neighbor searches in high-dirnensional image databases. IEEE Trans. Multimed. 4(1), 76–87 (2002)
Article Google Scholar
Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)
Article Google Scholar
Dai, J., Ding, Z.M.: MapReduce based fast kNN join. Chin. J. Comput. (2015)
Google Scholar
Deng, Z., Zhu, X., Cheng, D., Zong, M., Zhang, S.: Efficient kNN classification algorithm for big data. Neurocomputing 195, 143–148 (2016)
Article Google Scholar
Dhanabal, S., Chandramathi, S.: A review of various k-nearest neighbor query processing techniques. Int. J. Comput. Appl. 31(7), 14–22 (2011)
Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 265–268. Association for Computational Linguistics (2008)
Google Scholar
Fier, F.: Distributed similarity joins on big textual data: toward a robust cost-based framework (2017)
Google Scholar
Ghiassi, M., Fa’al, F., Abrishamchi, A.: Large metropolitan water demand forecasting using DAN2, FTDNN, and KNN models: a case study of the city of Tehran, Iran. Urban Water J. 14(6), 655–659 (2017)
Article Google Scholar
Kibanov, M., Becker, M., Mueller, J., Atzmueller, M., Hotho, A., Stumme, G.: Adaptive kNN using expected accuracy for classification of geo-spatial data. arXiv preprint arXiv:1801.01453 (2017)
Lai, J., Liaw, Y.C., Liu, J.: Fast k-nearest-neighbor search based on projection and triangular inequality. Pattern Recognit. 40(2), 351–359 (2007)
Article Google Scholar
Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. AcM sIGMoD Rec. 40(4), 11–20 (2012)
Article Google Scholar
Li, S.Z., Chan, K.L., Wang, C.: Performance evaluation of the nearest feature line method in image classification and retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 11, 1335–1349 (2000)
Google Scholar
Liaw, Y.C., Leou, M.L., Wu, C.M.: Fast exact k nearest neighbors search using an orthogonal search tree. Pattern Recognit. 43(6), 2351–2358 (2010)
Article Google Scholar
Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 155–162. ACM (2009)
Google Scholar
Liu, T., Moore, A.W., Gray, A.: New algorithms for efficient high-dimensional nonparametric classification. J. Mach. Learn. Res. 7, 1135–1158 (2006)
MathSciNet MATH Google Scholar
Maillo, J., Triguero, I., Herrera, F.: A MapReduce-based k-nearest neighbor approach for big data classification. In: Trustcom/BigDataSE/ISPA, 2015 IEEE. vol. 2, pp. 167–172. IEEE (2015)
Google Scholar
McNames, J.: A fast nearest-neighbor algorithm based on a principal axis search tree. IEEE Trans. Pattern Anal. Mach. Intell. 23(9), 964–976 (2001)
Article Google Scholar
Nodarakis, N., Sioutas, S., Tsoumakos, D., Tzimas, G., Pitoura, E.: Rapid AkNN query processing for fast classification of multidimensional data in the cloud. Eprint Arxiv (2014)
Google Scholar
Omohundro, S.M.: Five balltree construction algorithms. International Computer Science Institute Berkeley (1989)
Google Scholar
Schiaffino, L., et al.: Feature selection for KNN classifier to improve accurate detection of subthalamic nucleus during deep brain stimulation surgery in Parkinson’s patients. In: Torres, I., Bustamante, J., Sierra, D. (eds.) VII Latin American Congress on Biomedical Engineering CLAIB 2016, Bucaramanga, Santander, Colombia, October 26th -28th, 2016. IP, vol. 60, pp. 441–444. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-4086-3_111
Chapter Google Scholar
Sproull, R.F.: Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6(1–6), 579–589 (1991)
Article MathSciNet Google Scholar
Tan, S.: An effective refinement strategy for KNN text classifier. Expert Syst. Appl. 30(2), 290–298 (2006)
Article Google Scholar
Tombros, A., Ali, Z.: Factors affecting web page similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 487–501. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31865-1_35
Chapter Google Scholar
Velásquez, J.D., et al.: Docode 5: building a real-world plagiarism detection system. Eng. Appl. Artif. Intell. 64, 261–271 (2017)
Article Google Scholar
Wang, Y., Wang, Z.O.: A fast KNN algorithm for text categorization. In: 2007 International Conference on Machine Learning and Cybernetics, vol. 6, pp. 3436–3441. IEEE (2007)
Google Scholar
Yu, C., Ooi, B.C., Tan, K.L., Jagadish, H.: Indexing the distance: an efficient method to KNN processing. In: VLDB, vol. 1, pp. 421–430 (2001)
Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM (1998)
Google Scholar
Zhang, C., Li, F., Jestes, J.: Efficient parallel KNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM (2012)
Google Scholar
Zhang, S., Li, X., Zong, M., Zhu, X., Wang, R.: Efficient knn classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 29(5), 1774–1785 (2018)
Article MathSciNet Google Scholar
Zhou, Y., Zhang, C., Wang, J.: Tunable nearest neighbor classifier. In: Rasmussen, C.E., Bülthoff, H.H., Schölkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 204–211. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28649-3_25
Chapter Google Scholar

Download references

Acknowledgment

This work is supported by the National Science Foundation of China under grants No. 61472080, No. 61672155, No. 61272532, the Consulting Project of Chinese Academy of Engineering under grant 2018-XY-07, National High Technology Research and Development Program (863 Program) of China under grant No. 2013AA013503 and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, China
Peipei Lv, Peng Yang, Yong-Qiang Dong & Liang Gu
Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of Education, Nanjing, China
Peipei Lv, Peng Yang, Yong-Qiang Dong & Liang Gu

Authors

Peipei Lv
View author publications
You can also search for this author in PubMed Google Scholar
Peng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Qiang Dong
View author publications
You can also search for this author in PubMed Google Scholar
Liang Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peng Yang .

Editor information

Editors and Affiliations

Rutgers University, Newark, NJ, USA
Jaideep Vaidya
Guangzhou University, Guangzhou, China
Jin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lv, P., Yang, P., Dong, YQ., Gu, L. (2018). Document Nearest Neighbors Query Based on Pairwise Similarity with MapReduce. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11334. Springer, Cham. https://doi.org/10.1007/978-3-030-05051-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-05051-1_3
Published: 07 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05050-4
Online ISBN: 978-3-030-05051-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Document Nearest Neighbors Query Based on Pairwise Similarity with MapReduce

Abstract

Access this chapter

Similar content being viewed by others

An Adaptive Similarity Search in Massive Datasets

An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Document Nearest Neighbors Query Based on Pairwise Similarity with MapReduce

Abstract

Access this chapter

Similar content being viewed by others

An Adaptive Similarity Search in Massive Datasets

An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation