Abstract
In questions datasets, several questions could produce duplicates since they are similar questions due to the ability to write a question in different forms based on the flexibility of Natural Language. However, extracting relevant questions is time-consuming if it is performed manually. Therefore, the computational power of computers is necessary to group similar questions into clusters based on their semantic similarity but still the information included within a question may be insufficient to efficiently cluster the questions making it a challenging task. In this research, canopy clustering is employed as a previous step for K-means clustering, then it is compared to the Hierarchical Clustering approach. Quora questions dataset is used in the experiments to identify question pairs that are similar. In terms of F1 score and rand statistic measure, the results demonstrate that the Hierarchical-K-means approach provides better validity clustering measures than the Canopy-K-means approach. In addition to identifying matches, the Canopy approach serves with the top related questions that have the same intent in the same cluster in several canopies.
Similar content being viewed by others
Data availability statement
The data is available upon request from the authors.
References
Lalitha SY, Govardhan A (2015) Improved text clustering with neighbors. Int J Data Min Knowl Manag Process (IJDKP) 5(2):23–37
Alian M, Awajan A, Al-Hasan A, Akuzhia R (2021) Building Arabic paraphrasing benchmark based on transformation rules. ACM Trans Asian Low-Resour Lang Inf Process 20(4):1–17
Christen P (2012) Data matching: concepts and techniques for record link-age, entity resolution, and duplicate detection. Springer
Alian M, Awajan A (2020) Factors affecting sentence similarity and paraphrasing identification. Int J Speech Technol 23:851–859
Blooma MJ, Chua A, Goh D (2011) Quadripartite graph-based clustering of questions. In: Eighth International Conference on information technology: new generations,11:591–596
Sharma L, Graesser L, Nangia N, Evci U (2019) Natural language understanding with the Quora question pairs dataset. arXiv
Paranjpe D (2007) Clustering semantically similar and related questions. Stanford University, Research Report, https://nlp.stanford.edu/courses/cs224n/2007/fp/paranjpe.pdf
Cătălina M, Olaru A, Florea AM (2011) Semantic clustering of questions Mocanu. AI-Mas group University POLITEHNICA of Bucharest, Research Project
Mishra RB, Modi NK, Shah RR (2014) Performance analysis of single and complete link during agglomerative clustering of question papers by tagging the questions and trend analysis using single link. In: 2014 IEEE International Conference on advanced communication control and computing technologies (lCACCCT), 11:616–618
Suhaimi NS, Kamaliah SN, Arbin N, Othman Z (2015) Optimizing cluster of questions by using dynamic mutation in genetic algorithm. In: Third International Conference on artificial intelligence, modelling and simulation, 11:15–18
Nguyen NV, Boucher A, Ogier J, Tabbone S (2010) Clusters-based relevance feedback for CBIR: a combination of query movement and query expansion. In: Computing and communication technologies, research, innovation, and vision for the future (RIVF), 11:1–6
Kumar A, Ingle YS, Pande A, Dhule P (2014) Canopy clustering: a review on pre-clustering approach to K-means clustering. Int J Innov Adv Comput Sci 3(5):22–29
Irfan D, Xiaofei X, Shengchun D, He Z, Yunming Y (2009) S-Canopy: a feature-based clustering algorithm for supplier categorization. In: 4th IEEE Conference on industrial electronics and applications (ICIEA 2009), 11:677–681
Liu Y et al (2020) An integrated retrieval framework for similar questions: word-semantic embedded label clustering—LDA with question life cycle. Inf Sci 537:227–245
Hoogeveen D, Verspoor KM, Baldwin T (2015) CQADupStack: a benchmark data set for community question-answering research. In: Australasian Document Computing Symposium (ADCS), Parramatta, NSW, Australia, 11:1–8
Piernik M, Morzy T (2021) A study on using data clustering for feature extraction to improve the quality of classification. Knowl Inf Syst 63:1771–1805
Cohen W, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: The Eighth ACM SIGKDD International Conference on knowledge discovery and data mining-ACM SIGKDD, Edmonton, 11:475–480
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: The Second Annual International Conference on knowledge discovery in data- ACM SIGKDD , Boston, 11:169–178
Tan PN, Steinbach M, Karpatne A, Kumar V (2014) Introduction to data mining, 1st edn. Pearson Education Limited
Awad FH, Hamad MM (2022) Improved k-means clustering algorithm for big data based on distributed smartphoneneural engine processor. Electronics 11:883
Jeon Y, Yoo J, Lee J, Yoon S (2017) NC-link: a new linkage method for efficient hierarchical clustering of large-scale data. IEEE Access 5:5594–5608
Gev CM, Vries S, Trotman A (2012) Document clustering evaluation: divergence from a random baseline. In: CoRR, abs, 2012, p. 1208.5654
Alian M, Awajan A (2020) Paraphrasing identification techniques in English and Arabic Texts. In: The 11th International Conference on information and communication systems, Irbid, Jordan, 11:155–160
Christen P, Goiser K (2007) "Quality and complexity measures for data linkage and deduplication. In: Hamilton HJ, Guillet FJ (eds) Quality measures in data mining, vol 43. Springer, pp 127–151
Alian M, Al-Naymat G, Ramadan B (2020) Arabic real time entity resolution using inverted indexing. Lang Resour Eval 54:921–941
Funding
Funding information is not applicable/No funding was received.
Author information
Authors and Affiliations
Contributions
MA: Data gathering, writing paper, programming, results discussion. GA-N: Supervision, review of paper, advice for programming.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Rights and permissions
About this article
Cite this article
Alian, M., Al-Naymat, G. Questions clustering using canopy-K-means and hierarchical-K-means clustering. Int. j. inf. tecnol. 14, 3793–3802 (2022). https://doi.org/10.1007/s41870-022-01012-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-022-01012-w