A Simhash-Based Generalized Framework for Citation Matching in MapReduce

Wang, Pengsen; Wu, Bin; Li, Xiaoming; Wang, Lin; Wang, Bai

doi:10.1007/978-3-319-25660-3_7

Pengsen Wang¹⁹,
Bin Wu¹⁹,
Xiaoming Li¹⁹,
Lin Wang¹⁹ &
…
Bai Wang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9441))

834 Accesses

Abstract

Citation matching is to find the cited papers according to only a small amount of information. There have been some works on citation matching. Most of the solutions require expensive model processing to achieve good results. However, when dealing with millions of citations in large digital libraries, these solutions may not be efficient enough. To address this problem, we propose a simhash-based generalized framework in MapReduce for citation matching. In the framework, we use title exact matching and distance-based short text similarity metrics to implement citation matching. Moreover, customizing citation fields, citation field weights and word segmentation weights are used for improving the accuracy. We also design a heuristic algorithm which can automatically calculate the weights of each citation field. For disposing the large-scale datasets, we implement the framework in Hadoop, a popular parallel computation platform. We do our experiments with real datasets from a Chinese Medicine Digital Library, and a comparative experiment with Cora corpus (McCallum’s citation matching test set). The results of experiments confirm the efficiency and effectiveness of our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Councill, I.G., Li, H., Zhuang, Z., et al.: Learning metadata from the evidence in an on-line citation matching scheme. In: JCDL, pp. 276–285 (2006)
Google Scholar
Pasula, H., Marthi, B., Milch, B., et al.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems, pp. 1425–1432 (2003)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Google Scholar
Hitchcock, S., et al.: Citation linking: improving access to online journals. In: Proceedings of the Second ACM International Conference on Digital Libraries, pp. 115–122 (1997)
Google Scholar
Liao, Z., Zhang, Z.: A generalized joint inference approach for citation matching. In: Wobcke, W., Zhang, M. (eds.) AI 2008. LNCS (LNAI), vol. 5360, pp. 601–607. Springer, Heidelberg (2008)
Chapter Google Scholar
Koo, H.K., Kim, T., Chun, H.W., et al.: Effects of unpopular citation fields in citation matching performance. In: ICISA, pp. 1–7 (2011)
Google Scholar
Kan, M.Y., Tan, Y.F.: Record matching in digital library metadata. Commun. ACM 51(2), 91–94 (2008)
Article Google Scholar
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Google Scholar
Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł.: Large scale citation matching using Apache Hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 362–365. Springer, Heidelberg (2013)
Chapter Google Scholar
Liu, Y., Wu, Q., Han, Y., et al.: The fingerprint analysis technique-oriented research on microblog for public opinion analysis. In: ICIMCS, pp. 372–375 (2013)
Google Scholar
Pham, T.A.N, Nguyen, V.K.: A simhash-based scheme for locating product information from the web. In: SoICT, pp. 199–206 (2011)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)
Google Scholar
McCallun, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference clustering. In: KDD, pp. 169–179 (2000)
Google Scholar
Chierichetti, F., Dalvi, N., Kumar, R.: Correlation clustering in MapReduce. In: KDD, pp. 641–650 (2014)
Google Scholar

Download references

Acknowledgments

This work is supported in part by the National Key Basic Research and Department (973) Program of China (No. 2013CB329606), and the Co-construction Project of Beijing Municipal Commission of Education.

Author information

Authors and Affiliations

Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Pengsen Wang, Bin Wu, Xiaoming Li, Lin Wang & Bai Wang

Authors

Pengsen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Li
View author publications
You can also search for this author in PubMed Google Scholar
Lin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bai Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengsen Wang .

Editor information

Editors and Affiliations

Institute of Infocomm Research, Singapore, Singapore
Xiao-Li Li
Ho Chi Minh City University of Tech, Ho Chi Minh City, Vietnam
Tru Cao
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Science & Technology, Japan Advanced Institute of, Nomi-shi, Ishikawa, Japan
Tu-Bao Ho
The University of Hong Kong, Hong Kong, China
David Cheung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, P., Wu, B., Li, X., Wang, L., Wang, B. (2015). A Simhash-Based Generalized Framework for Citation Matching in MapReduce. In: Li, XL., Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D. (eds) Trends and Applications in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science(), vol 9441. Springer, Cham. https://doi.org/10.1007/978-3-319-25660-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-25660-3_7
Published: 26 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25659-7
Online ISBN: 978-3-319-25660-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics