Skip to main content

A Simhash-Based Generalized Framework for Citation Matching in MapReduce

  • Conference paper
  • First Online:
Trends and Applications in Knowledge Discovery and Data Mining

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9441))

  • 834 Accesses

Abstract

Citation matching is to find the cited papers according to only a small amount of information. There have been some works on citation matching. Most of the solutions require expensive model processing to achieve good results. However, when dealing with millions of citations in large digital libraries, these solutions may not be efficient enough. To address this problem, we propose a simhash-based generalized framework in MapReduce for citation matching. In the framework, we use title exact matching and distance-based short text similarity metrics to implement citation matching. Moreover, customizing citation fields, citation field weights and word segmentation weights are used for improving the accuracy. We also design a heuristic algorithm which can automatically calculate the weights of each citation field. For disposing the large-scale datasets, we implement the framework in Hadoop, a popular parallel computation platform. We do our experiments with real datasets from a Chinese Medicine Digital Library, and a comparative experiment with Cora corpus (McCallum’s citation matching test set). The results of experiments confirm the efficiency and effectiveness of our framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Councill, I.G., Li, H., Zhuang, Z., et al.: Learning metadata from the evidence in an on-line citation matching scheme. In: JCDL, pp. 276–285 (2006)

    Google Scholar 

  2. Pasula, H., Marthi, B., Milch, B., et al.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems, pp. 1425–1432 (2003)

    Google Scholar 

  3. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)

    Google Scholar 

  4. Hitchcock, S., et al.: Citation linking: improving access to online journals. In: Proceedings of the Second ACM International Conference on Digital Libraries, pp. 115–122 (1997)

    Google Scholar 

  5. Liao, Z., Zhang, Z.: A generalized joint inference approach for citation matching. In: Wobcke, W., Zhang, M. (eds.) AI 2008. LNCS (LNAI), vol. 5360, pp. 601–607. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  6. Koo, H.K., Kim, T., Chun, H.W., et al.: Effects of unpopular citation fields in citation matching performance. In: ICISA, pp. 1–7 (2011)

    Google Scholar 

  7. Kan, M.Y., Tan, Y.F.: Record matching in digital library metadata. Commun. ACM 51(2), 91–94 (2008)

    Article  Google Scholar 

  8. Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)

    Google Scholar 

  9. Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł.: Large scale citation matching using Apache Hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 362–365. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  10. Liu, Y., Wu, Q., Han, Y., et al.: The fingerprint analysis technique-oriented research on microblog for public opinion analysis. In: ICIMCS, pp. 372–375 (2013)

    Google Scholar 

  11. Pham, T.A.N, Nguyen, V.K.: A simhash-based scheme for locating product information from the web. In: SoICT, pp. 199–206 (2011)

    Google Scholar 

  12. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)

    Google Scholar 

  13. McCallun, A., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference clustering. In: KDD, pp. 169–179 (2000)

    Google Scholar 

  14. Chierichetti, F., Dalvi, N., Kumar, R.: Correlation clustering in MapReduce. In: KDD, pp. 641–650 (2014)

    Google Scholar 

Download references

Acknowledgments

This work is supported in part by the National Key Basic Research and Department (973) Program of China (No. 2013CB329606), and the Co-construction Project of Beijing Municipal Commission of Education.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pengsen Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, P., Wu, B., Li, X., Wang, L., Wang, B. (2015). A Simhash-Based Generalized Framework for Citation Matching in MapReduce. In: Li, XL., Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D. (eds) Trends and Applications in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science(), vol 9441. Springer, Cham. https://doi.org/10.1007/978-3-319-25660-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25660-3_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25659-7

  • Online ISBN: 978-3-319-25660-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics