Content Analysis of Scientific Articles in Apache Hadoop Ecosystem

Chapter

Abstract

Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved on Hadoop clusters.

Keywords

Hadoop Big data Text mining Citation matching Document similarity Document classification CoAnSys 

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Interdisciplinary Centre for Mathematical and Computational ModellingUniversity of WarsawWarsawPoland

Personalised recommendations