Skip to main content

Content Analysis of Scientific Articles in Apache Hadoop Ecosystem

Part of the Studies in Computational Intelligence book series (SCI,volume 541)

Abstract

Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved on Hadoop clusters.

Keywords

  • Hadoop
  • Big data
  • Text mining
  • Citation matching
  • Document similarity
  • Document classification
  • CoAnSys

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • DOI: 10.1007/978-3-319-04714-0_10
  • Chapter length: 16 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   109.00
Price excludes VAT (Canada)
  • ISBN: 978-3-319-04714-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Hardcover Book
USD   179.99
Price excludes VAT (Canada)
Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    https://github.com/DigitalPebble/behemoth

  2. 2.

    http://code.google.com/p/protobuf/

  3. 3.

    http://pig.apache.org/

  4. 4.

    http://oozie.apache.org/

  5. 5.

    http://polon.nauka.gov.pl

  6. 6.

    http://www.synat.pl/

  7. 7.

    http://www.openaire.eu/

  8. 8.

    http://nicta.github.com/scoobi/

  9. 9.

    http://bazhum.icm.edu.pl/

  10. 10.

    http://bazekon.icm.edu.pl/

  11. 11.

    http://baztech.icm.edu.pl/

  12. 12.

    http://agro.icm.edu.pl/

  13. 13.

    http://cejsh.icm.edu.pl/

  14. 14.

    http://pldml.icm.edu.pl/

  15. 15.

    http://przyrbwn.icm.edu.pl/

References

  1. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: the next frontier for innovation, competition, and productivity. Technical report, Mc Kinsey (2011)

    Google Scholar 

  2. Dendek, P.J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P., Bolikowski, Ł.: How to perform research in Hadoop environment not losing mental equilibrium—case study. arXiv:1303.5234 [cs.SE] (2013)

    Google Scholar 

  3. Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3–4), 327–348 (2004)

    CrossRef  Google Scholar 

  4. Bembenik, R., Skonieczny, L., Rybinski, H., Niezgodka, M.: Intelligent Tools for Building a Scientific Information Platform Studies in Computational Intelligence. Springer, Berlin (2012)

    CrossRef  Google Scholar 

  5. Manghi, P., Manola, N., Horstmann, W., Peters, D.: An infrastructure for managing EC funded research output—the OpenAIRE project. Grey J: Int. J. Grey Lit. 6, 31–40 (2010)

    Google Scholar 

  6. Manghi, P., Bolikowski, Ł., Manola, N., Schirrwagen, J., Smith, T.: OpenAIREplus: the European scholarly communication data infrastructure. In: D-Lib Magazine, vol. 18(9/10) (2012)

    Google Scholar 

  7. Dendek, P.J., Bolikowski, Ł., Lukasik, M.: Evaluation of features for author name disambiguation using linear support vector machines. In: Proceedings of the 10th IAPR International Workshop on Document Analysis Systems, pp. 440–444 (2012)

    Google Scholar 

  8. Dendek, P.J., Wojewodzki, M., Bolikowski, Ł.: Author disambiguation in the YADDA2 software platform. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 131–143. Springer, Berlin Heidelberg (2013)

    CrossRef  Google Scholar 

  9. Bolikowski, Ł., Dendek, P.J.: Towards a flexible author name disambiguation framework. In: Sojka, P., Bouche, T., (eds.): Towards a Digital Mathematics Library, pp. 27–37. Masaryk University Press (2011)

    Google Scholar 

  10. Tkaczyk, D., Bolikowski, Ł., Czeczko, A., Rusek, K.: A modular metadata extraction system for born-digital articles. In: 2012 10th IAPR International Workshop on Document Analysis Systems (DAS), pp. 11-16. (2012)

    Google Scholar 

  11. Lukasik, M., Kusmierczyk, T., Bolikowski, Ł., Nguyen, H.: Hierarchical, multilabel classification of scholarly publications: modifications of ML-KNN algorithm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niez- godka, M., (eds.): Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467 pp. 343–363. Springer, Heidelberg (2013)

    Google Scholar 

  12. Kusmierczyk, T.: Reconstruction of MSC classification tree. Master’s Thesis, The University of Warsaw (2012)

    Google Scholar 

  13. Fedoryszak, M., Bolikowski, Ł., Tkaczyk, D., Wojciechowski, K.: Methodology for evaluating citation parsing and matching. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 145–154. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  14. Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł.: Large scale citation matching using apache hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C. (eds.) Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, vol. 8092, pp. 362–365. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  15. Lin, J.: MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! Sept 2012

    Google Scholar 

  16. Kawa, A., Bolikowski, A., Czeczko, A., Dendek, P., Tkaczyk, D.: Data model for analysis of scholarly documents in the mapreduce paradigm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 155–169. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  17. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    CrossRef  Google Scholar 

  18. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    CrossRef  MATH  Google Scholar 

  19. Cloudera: Mapreduce algorithms. http://blog.cloudera.com/wp-content/uploads/2010/01/5-MapReduceAlgorithms.pdf (2009)

  20. Lee, H., Her, J., Kim, S.R.: Implementation of a large-scalable social data analysis system based on mapreduce. In: 2011 First ACIS/JNU International Conference on Computers, Networks, Systems and Industrial Engineering (CNSI), pp. 228–233 (2011)

    Google Scholar 

  21. Wan, J., Yu, W., Xu, X.: Design and implement of distributed document clustering based on mapreduce. In: Proceedings of the 2nd symposium international computer science and computational technology (ISCSCT), pp. 278–280 (2009)

    Google Scholar 

  22. Porter, M.F.: Readings in information retrieval, pp. 313–316. Morgan Kaufmann Publishers, San Francisco (1997)

    Google Scholar 

  23. Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. HLT-Short '08, pp. 265−268. Association for Computational Linguistics, Stroudsburg, PA, USA (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Piotr Jan Dendek , Artur Czeczko , Mateusz Fedoryszak , Adam Kawa , Piotr Wendykier or Łukasz Bolikowski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Dendek, P.J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P., Bolikowski, Ł. (2014). Content Analysis of Scientific Articles in Apache Hadoop Ecosystem. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. Studies in Computational Intelligence, vol 541. Springer, Cham. https://doi.org/10.1007/978-3-319-04714-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-04714-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-04713-3

  • Online ISBN: 978-3-319-04714-0

  • eBook Packages: EngineeringEngineering (R0)