Large Scale Citation Matching Using Apache Hadoop

  • Mateusz Fedoryszak
  • Dominika Tkaczyk
  • Łukasz Bolikowski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8092)

Abstract

During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. In this paper we present a citation matching method and show how to scale it up to handle great amounts of data using appropriate indexing and a MapReduce paradigm in the Hadoop environment.

Keywords

citation matching approximate indexing MapReduce Hadoop CRF SVM 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Hitchcock, S.M., Carr, L.A., Harris, S.W., Hey, J.M.N., Hall, W.: Citation Linking: Improving Access to Online Journals. Proceedings of Digital Libraries 97, 115–122 (1997)Google Scholar
  2. 2.
    Fedoryszak, M., Bolikowski, Ł., Tkaczyk, D., Wojciechowski, K.: Methodology for Evaluating Citation Parsing and Matching. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intell. Tools for Building a Scientific Information. SCI, vol. 467, pp. 145–154. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  3. 3.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51(1), 1–13 (2004)Google Scholar
  4. 4.
    Paradies, M., Malaika, S., Siméon, J., Khatchadourian, S., Sattler, K.-U.: Entity matching for semistructured data in the Cloud. In: SAC 2012, p. 453. ACM Press (2012)Google Scholar
  5. 5.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML 2001, pp. 282–289. Citeseer (2001)Google Scholar
  6. 6.
    Tkaczyk, D., Bolikowski, L., Czeczko, A., Rusek, K.: A modular metadata extraction system for born-digital articles. In: Proceedings of the 10th IAPR International Workshop on Document Analysis Systems (2012)Google Scholar
  7. 7.
    Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92(1), 191–211 (1992)MathSciNetMATHCrossRefGoogle Scholar
  8. 8.
    Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2), 83–97 (1955)MathSciNetCrossRefGoogle Scholar
  9. 9.
    McCallum, A.K., Nigam, K., Rennie, J.: Automating the construction of internet portals with machine learning. Information Retrieval, 127–163 (2000)Google Scholar
  10. 10.
    Poon, H., Domingos, P.: Joint Inference in Information Extraction. In: Artificial Intelligence, vol. 22, pp. 913–918. AAAI Press (2007)Google Scholar
  11. 11.
    Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)MATHCrossRefGoogle Scholar
  12. 12.
    Kawa, A., Bolikowski, Ł., Czeczko, A., Dendek, P.J., Tkaczyk, D.: Data Model for Analysis of Scholarly Documents in the MapReduce Paradigm. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intell. Tools for Building a Scientific Information. SCI, vol. 467, pp. 155–170. Springer, Heidelberg (2013)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Mateusz Fedoryszak
    • 1
  • Dominika Tkaczyk
    • 1
  • Łukasz Bolikowski
    • 1
  1. 1.Interdisciplinary Centre for Mathematical and Computational ModellingUniversity of WarsawPoland

Personalised recommendations