A Novel Map-Reduce Based Augmented Clustering Algorithm for Big Text Datasets

Kanimozhi, K. V.; Venkatesan, M.

doi:10.1007/978-981-10-3223-3_41

K. V. Kanimozhi¹⁸ &
M. Venkatesan¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 542 ))

1175 Accesses
5 Citations

Abstract

Text clustering is a well known technique for improving quality in information retrieval, In Today’s real world data is not organized in the essential manner for a precise mining, given a large unstructured text document collection it is essential to organize into clusters of related documents. It is a contemporary challenge to explore compact and meaning insights from large collections of the unstructured text documents. Although many frequent item mining algorithms have been discovered yet most do not scale for “Big Data” and also takes more processing time. This paper presents a high scalable speedy and efficient map reduce based augmented clustering algorithm based on bivariate n-gram frequent item to reduce high dimensionality and derive high quality clusters for Big Text documents and also the comparative analysis is shown for the sample text datasets with stop word removal the proposed algorithm performs better than without stop word removal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kanimozhi, K.V., Venkatesan, M.: Survey on text clustering techniques. Adv. Res. Electr. Electron. Eng. 2(12), 55–58 (2015)
Google Scholar
Kanimozhi, K.V., Venkatesan, M.: Big text datasets Clustering based on frequent item sets—a survey. Int. J. Innovat. Res. Sci. Eng. 2(5). ISSN: 2454– 9665 (2016)
Google Scholar
Naaz, E., Sharma, D., Sirisha, D., Venkatesan, M.: Enhanced k-means Clustering approach for health care analysis using clinical documents. Int. J. Pharm. Clin. Res. 8(1), 60–64. ISSN- 0975 1556 (2016)
Google Scholar
Venkatesan, M., Thangavelu, A.: A multiple window based Co-location pattern mining approach for various types of spatial Data. Int. J. Comput. Appl. Technol. 48(2), 144–154 (2013). Inderscience Publisher
Google Scholar
Venkatesan, M., Thangavelu, A.: A Delaunay Diagram-based Min–Max CP-Tree Algorithm for Spatial Data Analysis, WIREs Data Mining and Knowledge Discovery, vol. 5, pp. 142–154. Wiley Publisher (2015)
Google Scholar
Venkatesan, M., Thangavelu, A., Prabhavathy, P.: A Novel Cp-Tree based Co-located Classifier for big data analysis. Int. J. Commun. Netw. Distrib. Syst. 15, 191–211 (2015). Inderscience
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques: KDD-2000 Workshop on Text Mining (2000)
Google Scholar
Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68, 1271–1288 (2009). Elsevier
Google Scholar
Edith, H., Rene, A.G., Carrasco-Ochoa, J.A., Martinez-Trinidad, J.F.: Document clustering based on maximal frequent sequences. In: Proceedings of FinTAL 2006, LNAI, vol. 4139, pp. 257–67 (2006)
Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent term based text clustering. In: Proceedings of ACM SIGKDD International Conference on knowledge Discovery and Data Mining. pp. 436–442 (2002)
Google Scholar
Fung, B., Wang, K., Ester, M.: Hierarchal document clustering using frequent item sets. In: Proceedings of the 3rd SIAM International Conference on Data Mining (2003)
Google Scholar
Moens, S., Aksehirli, E., Goethals, B.: Frequent Item set Mining for Big data (2014)
Google Scholar
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Parallel FP-Growth for query recommendation. In: Proceedings of ACM Conference on Recommender systems, pp 107–114 (2008)
Google Scholar
Qiu, H., Gu, R., Yuan, C., Huang, Y.: YAFIM: a parallel frequent item set mining algorithm with spark. In: 28th International Parallel & Distributed Processing Symposium Workshops. IEEE (2014)
Google Scholar
Zhang, W., Yoshida, T., Tang, X., Wang, Q.: Text Clustering using frequent item sets. Knowl.-based Syst. 23, 379–388 (2010). Elsevier
Google Scholar

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their useful comments and thanks for their helpful suggestion.

Author information

Authors and Affiliations

School of Computing Science and Engineering, VIT University, Vellore, Tamilnadu, India
K. V. Kanimozhi & M. Venkatesan

Authors

K. V. Kanimozhi
View author publications
You can also search for this author in PubMed Google Scholar
M. Venkatesan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. V. Kanimozhi .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, Andhra Pradesh, India
Suresh Chandra Satapathy
Shri Ramswaroop Memorial Group of Professional Colleges (SRMGPC), Lucknow, Uttar Pradesh, India
Vikrant Bhateja
Department of Computer Science and Engineering, CMR Technical Campus, Hyderabad, India
K. Srujan Raju
DVR and Dr. HS MIC College of Technology, Kanchikacherla, Andhra Pradesh, India
B. Janakiramaiah

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kanimozhi, K.V., Venkatesan, M. (2018). A Novel Map-Reduce Based Augmented Clustering Algorithm for Big Text Datasets. In: Satapathy, S., Bhateja, V., Raju, K., Janakiramaiah, B. (eds) Data Engineering and Intelligent Computing. Advances in Intelligent Systems and Computing, vol 542 . Springer, Singapore. https://doi.org/10.1007/978-981-10-3223-3_41

Download citation

DOI: https://doi.org/10.1007/978-981-10-3223-3_41
Published: 01 June 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3222-6
Online ISBN: 978-981-10-3223-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics