Abstract
Text clustering is a well known technique for improving quality in information retrieval, In Today’s real world data is not organized in the essential manner for a precise mining, given a large unstructured text document collection it is essential to organize into clusters of related documents. It is a contemporary challenge to explore compact and meaning insights from large collections of the unstructured text documents. Although many frequent item mining algorithms have been discovered yet most do not scale for “Big Data” and also takes more processing time. This paper presents a high scalable speedy and efficient map reduce based augmented clustering algorithm based on bivariate n-gram frequent item to reduce high dimensionality and derive high quality clusters for Big Text documents and also the comparative analysis is shown for the sample text datasets with stop word removal the proposed algorithm performs better than without stop word removal.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kanimozhi, K.V., Venkatesan, M.: Survey on text clustering techniques. Adv. Res. Electr. Electron. Eng. 2(12), 55–58 (2015)
Kanimozhi, K.V., Venkatesan, M.: Big text datasets Clustering based on frequent item sets—a survey. Int. J. Innovat. Res. Sci. Eng. 2(5). ISSN: 2454– 9665 (2016)
Naaz, E., Sharma, D., Sirisha, D., Venkatesan, M.: Enhanced k-means Clustering approach for health care analysis using clinical documents. Int. J. Pharm. Clin. Res. 8(1), 60–64. ISSN- 0975 1556 (2016)
Venkatesan, M., Thangavelu, A.: A multiple window based Co-location pattern mining approach for various types of spatial Data. Int. J. Comput. Appl. Technol. 48(2), 144–154 (2013). Inderscience Publisher
Venkatesan, M., Thangavelu, A.: A Delaunay Diagram-based Min–Max CP-Tree Algorithm for Spatial Data Analysis, WIREs Data Mining and Knowledge Discovery, vol. 5, pp. 142–154. Wiley Publisher (2015)
Venkatesan, M., Thangavelu, A., Prabhavathy, P.: A Novel Cp-Tree based Co-located Classifier for big data analysis. Int. J. Commun. Netw. Distrib. Syst. 15, 191–211 (2015). Inderscience
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques: KDD-2000 Workshop on Text Mining (2000)
Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data Knowl. Eng. 68, 1271–1288 (2009). Elsevier
Edith, H., Rene, A.G., Carrasco-Ochoa, J.A., Martinez-Trinidad, J.F.: Document clustering based on maximal frequent sequences. In: Proceedings of FinTAL 2006, LNAI, vol. 4139, pp. 257–67 (2006)
Beil, F., Ester, M., Xu, X.: Frequent term based text clustering. In: Proceedings of ACM SIGKDD International Conference on knowledge Discovery and Data Mining. pp. 436–442 (2002)
Fung, B., Wang, K., Ester, M.: Hierarchal document clustering using frequent item sets. In: Proceedings of the 3rd SIAM International Conference on Data Mining (2003)
Moens, S., Aksehirli, E., Goethals, B.: Frequent Item set Mining for Big data (2014)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: Parallel FP-Growth for query recommendation. In: Proceedings of ACM Conference on Recommender systems, pp 107–114 (2008)
Qiu, H., Gu, R., Yuan, C., Huang, Y.: YAFIM: a parallel frequent item set mining algorithm with spark. In: 28th International Parallel & Distributed Processing Symposium Workshops. IEEE (2014)
Zhang, W., Yoshida, T., Tang, X., Wang, Q.: Text Clustering using frequent item sets. Knowl.-based Syst. 23, 379–388 (2010). Elsevier
Acknowledgements
We would like to thank the anonymous reviewers for their useful comments and thanks for their helpful suggestion.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kanimozhi, K.V., Venkatesan, M. (2018). A Novel Map-Reduce Based Augmented Clustering Algorithm for Big Text Datasets. In: Satapathy, S., Bhateja, V., Raju, K., Janakiramaiah, B. (eds) Data Engineering and Intelligent Computing. Advances in Intelligent Systems and Computing, vol 542 . Springer, Singapore. https://doi.org/10.1007/978-981-10-3223-3_41
Download citation
DOI: https://doi.org/10.1007/978-981-10-3223-3_41
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3222-6
Online ISBN: 978-981-10-3223-3
eBook Packages: EngineeringEngineering (R0)