A Novel Clustering Approach Using Hadoop Distributed Environment

  • Nagesh VadaparthiEmail author
  • P. Srinivas Rao
  • Y. Srinivas
  • M. Athmaja
Part of the SpringerBriefs in Applied Sciences and Technology book series (BRIEFSAPPLSCIENCES)


Nowadays, information retrieval plays a vital role by allowing users to retrieve documents of their interest based on relevance score. Such systems can be implemented either in distributed systems or parallel systems to achieve high throughput. If such kind of framework is deployed in a cloud, grouping of relevant documents is essential to retrieve documents of interest. Hence, an efficient and scalable clustering is required to process huge volume of documents. To handle huge documents and to provide scalability while processing Apache Hadoop is efficient with its powerful feature map reduce. Hence, in this paper, a novel approach is proposed that is capable of clustering bulk data with high throughput. This paper also demonstrates the need of parallel caching approach for obtaining effective results.


Data clustering Parallel computing Hadoop HDFS MapReduce 


  1. 1.
    Lynch C (2008) Big data: how do your data grow? Nature 455(7209):28–29CrossRefGoogle Scholar
  2. 2.
    Ye K et al (2012) vHadoop: a scalable hadoop virtual cluster platform for mapreduce-based parallel machine learning with performance consideration. In: IEEE international conference on cluster computing workshops, pp 152–160Google Scholar
  3. 3.
    Dean J et al (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  4. 4.
    White T (2010) Hadoop: the definitive guide. Yahoo PressGoogle Scholar
  5. 5.
    Vadaparthi Nagesh et al (2011) Segmentation of brain MR images based on finite skew gaussian mixture model with fuzzy C-Means clustering and -EM algorithm. Int J Comput Appl 28(10):18–26Google Scholar
  6. 6.
    Sabena S et al (2011) Image retrieval using canopy and improved K mean clustering. In: International conference on emerging technology trends (ICETT) 2011, pp 15–19Google Scholar
  7. 7.
    McCallum A et al (2011) Efficient clustering of high-dimensional data sets with application to reference matching. White papersGoogle Scholar
  8. 8.
    Bradley PS et al (1998) Scaling clustering algorithms to large databases. In: Proceeding of 4th international conference on knowledge discovery and data mining (KDD-98). AAAI Press, Menlo ParkGoogle Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  • Nagesh Vadaparthi
    • 1
    Email author
  • P. Srinivas Rao
    • 1
  • Y. Srinivas
    • 2
  • M. Athmaja
    • 3
  1. 1.MVGR College of EngineeringVizianagaramIndia
  2. 2.GIT, GITAM UniversityVisakhapatnamIndia
  3. 3.Tata Consultancy ServicesHyderabadIndia

Personalised recommendations