Abstract
Citation matching is to find the cited papers according to only a small amount of information. There have been some works on citation matching. Most of the solutions require expensive model processing to achieve good results. However, when dealing with millions of citations in large digital libraries, these solutions may not be efficient enough. To address this problem, we propose a simhash-based generalized framework in MapReduce for citation matching. In the framework, we use title exact matching and distance-based short text similarity metrics to implement citation matching. Moreover, customizing citation fields, citation field weights and word segmentation weights are used for improving the accuracy. We also design a heuristic algorithm which can automatically calculate the weights of each citation field. For disposing the large-scale datasets, we implement the framework in Hadoop, a popular parallel computation platform. We do our experiments with real datasets from a Chinese Medicine Digital Library, and a comparative experiment with Cora corpus (McCallum’s citation matching test set). The results of experiments confirm the efficiency and effectiveness of our framework.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Councill, I.G., Li, H., Zhuang, Z., et al.: Learning metadata from the evidence in an on-line citation matching scheme. In: JCDL, pp. 276–285 (2006)
Pasula, H., Marthi, B., Milch, B., et al.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems, pp. 1425–1432 (2003)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Hitchcock, S., et al.: Citation linking: improving access to online journals. In: Proceedings of the Second ACM International Conference on Digital Libraries, pp. 115–122 (1997)
Liao, Z., Zhang, Z.: A generalized joint inference approach for citation matching. In: Wobcke, W., Zhang, M. (eds.) AI 2008. LNCS (LNAI), vol. 5360, pp. 601–607. Springer, Heidelberg (2008)
Koo, H.K., Kim, T., Chun, H.W., et al.: Effects of unpopular citation fields in citation matching performance. In: ICISA, pp. 1–7 (2011)
Kan, M.Y., Tan, Y.F.: Record matching in digital library metadata. Commun. ACM 51(2), 91–94 (2008)
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł.: Large scale citation matching using Apache Hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 362–365. Springer, Heidelberg (2013)
Liu, Y., Wu, Q., Han, Y., et al.: The fingerprint analysis technique-oriented research on microblog for public opinion analysis. In: ICIMCS, pp. 372–375 (2013)
Pham, T.A.N, Nguyen, V.K.: A simhash-based scheme for locating product information from the web. In: SoICT, pp. 199–206 (2011)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)
McCallun, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference clustering. In: KDD, pp. 169–179 (2000)
Chierichetti, F., Dalvi, N., Kumar, R.: Correlation clustering in MapReduce. In: KDD, pp. 641–650 (2014)
Acknowledgments
This work is supported in part by the National Key Basic Research and Department (973) Program of China (No. 2013CB329606), and the Co-construction Project of Beijing Municipal Commission of Education.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, P., Wu, B., Li, X., Wang, L., Wang, B. (2015). A Simhash-Based Generalized Framework for Citation Matching in MapReduce. In: Li, XL., Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D. (eds) Trends and Applications in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science(), vol 9441. Springer, Cham. https://doi.org/10.1007/978-3-319-25660-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-25660-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25659-7
Online ISBN: 978-3-319-25660-3
eBook Packages: Computer ScienceComputer Science (R0)