Document representation based on probabilistic word clustering in customer-voice classification

Lee, Younghoon; Song, Seokmin; Cho, Sungzoon; Choi, Jinhae

doi:10.1007/s10044-018-00772-1

Document representation based on probabilistic word clustering in customer-voice classification

Industrial and commercial application
Published: 18 January 2019

Volume 22, pages 221–232, (2019)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Younghoon Lee ORCID: orcid.org/0000-0003-4199-936X^1,2,
Seokmin Song¹,
Sungzoon Cho¹ &
…
Jinhae Choi²

380 Accesses
6 Citations
Explore all metrics

Abstract

Customer-voice data have an important role in different fields including marketing, product planning, and quality assurance. However, owing to the manual processes involved, there are problems associated with the classification of customer-voice data. This study focuses on building automatic classifiers for customer-voice data with newly proposed document representation methods based on neural-embedding and probabilistic word-clustering approaches. Semantically similar terms are classified into a common cluster. The words generated from neural embedding are clustered according to the membership strength of each word relative to each cluster derived from a probabilistic clustering method such as the fuzzy C-means clustering method or Gaussian mixture model. It is expected that the proposed method can be suitable for the classification of customer-voice data consisting of unstructured text by considering the membership strength. The results demonstrate that the proposed method achieved an accuracy of 89.24% with respect to representational effectiveness and an accuracy of 87.76% with respect to the classification performance of customer-voice data consisting of 12 classes. Further, the method provided an intuitive interpretation for the generated representation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Research on Web Service Clustering Method Based on Word Embedding and Topic Model

Construction of voice access clustering model for online shopping user groups based on electronic communication data mining algorithm

Article 09 July 2021

References

Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern information retrieval, vol 463. ACM Press, New York
Google Scholar
Bekkerman R, El-Yaniv R, Tishby N, Winter Y (2003) Distributional word clusters vs. words for text categorization. J Mach Learn Res 3(Mar):1183–1208
MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
MATH Google Scholar
Bouziane H, Messabih B, Chouarfia A (2011) Profiles and majority voting-based ensemble method for protein secondary structure prediction. Evol Bioinform Online 7:171
Article Google Scholar
Cai L, Hofmann T (2003) Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 182–189
Cai Z, Hu X, Li H, Graesser A (2016) Can word probabilities from lda be simply added up to represent documents? In: Proceedings of the 9th international conference on educational data mining
Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10(1):57–78
Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Article MATH Google Scholar
Dai AM, Olah C, Le QV (2015) Document embedding with paragraph vectors. arXiv:1507.07998
Domingos P, Pazzani M (1997) On the optimality of the simple bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130
Article MATH Google Scholar
dos Santos CN, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp 69–78
Dumais ST (2004) Latent semantic analysis. Ann Rev Inf Sci Technol 38(1):188–230
Article Google Scholar
Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Article Google Scholar
Gallant SI (1993) Neural network learning and expert systems. MIT Press, Cambridge
Book MATH Google Scholar
Gaskin SP, Griffin A, Hauser JR, Katz GM, Klein RL (2010) Voice of the customer. Wiley, Hoboken
Book Google Scholar
Ghayoomi M (2012) Word clustering for persian statistical parsing. In: Isahara H, Kanzaki K (eds) Advances in natural language processing. Springer, Berlin, Heidelberg, pp 126–137
Chapter Google Scholar
Griffin A, Hauser JR (1993) The voice of the customer. Mark Sci 12(1):1–27
Article Google Scholar
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Article Google Scholar
James CB (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Dordrecht
MATH Google Scholar
Katz GM (2001) The one right way to gather the voice of the customer. PDMA Vis Mag 25(4):1–6
Google Scholar
Kim HK, Kim H, Cho S (2017) Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266:336–352
Article Google Scholar
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv:1408.5882
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the 29th international conference on artificial intelligence (AI’2015), vol 333, pp 2267–2273
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Processes 25(2–3):259–284
Article Google Scholar
Langley P, Iba W, Thompson K (1992) An analysis of bayesian classifiers. Aaai 90:223–228
Google Scholar
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. ICML 14:1188–1196
Google Scholar
Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15
Manning CD, Schütze H (1999) Foundations of statistical natural language processing, vol 999. MIT Press, Cambridge
MATH Google Scholar
McCulloch WS, Pitts W (1990) A logical calculus of the ideas immanent in nervous activity. Bull Math Biol 52(1–2):99–115
Article MATH Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp 3111–3119
Mitrofanova O (2009) Automatic word clustering in studying semantic structure of texts. Adv Comput Linguist Res Comput Sci Mexico 41:27–34
Google Scholar
Mucherino A, Papajorgji PJ, Pardalos PM (2009) k-Nearest neighbor classification. In: Data mining in agriculture. Springer, New York, pp 83–106
Orrite C, Rodríguez M, Martínez F, Fairhurst M (2008) Classifier ensemble generation for the majority vote rule. In: Ruiz-Shulcloper J, Kropatsch WG (eds) Iberoamerican congress on pattern recognition. Springer, Berlin, Heidelberg, pp 340–347
Google Scholar
Sagae K, Gordon AS (2009) Clustering words by syntactic similarity improves dependency parsing of predicate-argument structures. In: Proceedings of the 11th international conference on parsing technologies. Association for Computational Linguistics, pp 192–201
Saha SK, Mitra P, Sarkar S (2008) Word clustering and word selection based feature reduction for MaxEnt based Hindi NER. In: Proceedings of ACL-08, HLT, pp 488–495
Sahlgren M (2006) The word-space model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Ph.D. thesis, Institutionen för lingvistik
Sayeedunnissa SF, Hussain AR, Hameed MA (2013) Supervised opinion mining of social network data using a bag-of-words approach on the cloud. In: Proceedings of seventh international conference on bio-inspired computing: theories and applications (BIC-TA 2012). Springer, pp 299–309
Steinwart I, Christmann A (2008) Support vector machines. Springer, Berlin
MATH Google Scholar
Suárez-Paniagua V, Segura-Bedmar I, Martínez P (2015) Word embedding clustering for disease named entity recognition. In: Proceedings of the fifth BioCreative challenge evaluation workshop. pp 299–304
Temkin BD, Chatham B, Amato M (2005) The customer experience value chain: an enterprisewide approach for meeting customer needs. Forrester Res
Vapnik VN, Vapnik V (1998) Statistical learning theory, vol 1. Wiley, New York
MATH Google Scholar
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
Book MATH Google Scholar
Walker SH, Duncan DB (1967) Estimation of the probability of an event as a function of several independent variables. Biometrika 54(1–2):167–179
Article MathSciNet MATH Google Scholar
Xing C, Wang D, Zhang X, Liu C (2014) Document classification with distributions of word vectors. In: Signal and information processing association annual summit and conference (APSIPA), 2014 Asia-Pacific. IEEE, pp 1–5
Zhong S (2005) Efficient online spherical k-means clustering. In: Proceedings 2005 IEEE international joint conference on neural networks, 2005., vol 5. IEEE, pp 3180–3185

Download references

Acknowledgements

I would like to express my appreciation to LG Electronics who provided me the dataset of customer-voice used in experiments section in our study.

Author information

Authors and Affiliations

Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 151-742, Korea
Younghoon Lee, Seokmin Song & Sungzoon Cho
Data Driven User Experience Team, Mobile Communication Lab, LG Electronics, 56 Digitalro, Geumcheon-gu, Seoul, 153-802, Korea
Younghoon Lee & Jinhae Choi

Authors

Younghoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Seokmin Song
View author publications
You can also search for this author in PubMed Google Scholar
Sungzoon Cho
View author publications
You can also search for this author in PubMed Google Scholar
Jinhae Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sungzoon Cho.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, Y., Song, S., Cho, S. et al. Document representation based on probabilistic word clustering in customer-voice classification. Pattern Anal Applic 22, 221–232 (2019). https://doi.org/10.1007/s10044-018-00772-1

Download citation

Received: 08 June 2017
Accepted: 26 December 2018
Published: 18 January 2019
Issue Date: 05 February 2019
DOI: https://doi.org/10.1007/s10044-018-00772-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Document representation based on probabilistic word clustering in customer-voice classification

Abstract

Access this article

Similar content being viewed by others

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Research on Web Service Clustering Method Based on Word Embedding and Topic Model

Construction of voice access clustering model for online shopping user groups based on electronic communication data mining algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Document representation based on probabilistic word clustering in customer-voice classification

Abstract

Access this article

Similar content being viewed by others

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Research on Web Service Clustering Method Based on Word Embedding and Topic Model

Construction of voice access clustering model for online shopping user groups based on electronic communication data mining algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation