Questions clustering using canopy-K-means and hierarchical-K-means clustering

Alian, Marwah; Al-Naymat, Ghazi

doi:10.1007/s41870-022-01012-w

Questions clustering using canopy-K-means and hierarchical-K-means clustering

Original Research
Published: 22 June 2022

Volume 14, pages 3793–3802, (2022)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

193 Accesses
3 Citations
Explore all metrics

Abstract

In questions datasets, several questions could produce duplicates since they are similar questions due to the ability to write a question in different forms based on the flexibility of Natural Language. However, extracting relevant questions is time-consuming if it is performed manually. Therefore, the computational power of computers is necessary to group similar questions into clusters based on their semantic similarity but still the information included within a question may be insufficient to efficiently cluster the questions making it a challenging task. In this research, canopy clustering is employed as a previous step for K-means clustering, then it is compared to the Hierarchical Clustering approach. Quora questions dataset is used in the experiments to identify question pairs that are similar. In terms of F1 score and rand statistic measure, the results demonstrate that the Hierarchical-K-means approach provides better validity clustering measures than the Canopy-K-means approach. In addition to identifying matches, the Canopy approach serves with the top related questions that have the same intent in the same cluster in several canopies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 2

A Fast and Effective Method for Clustering Large-Scale Chinese Question Dataset

Open-Domain Question Answering with Topic Clustering

A Large-Scale Community Questions Classification Accounting for Category Similarity: An Exploratory Study

Data availability statement

The data is available upon request from the authors.

References

Lalitha SY, Govardhan A (2015) Improved text clustering with neighbors. Int J Data Min Knowl Manag Process (IJDKP) 5(2):23–37
Article Google Scholar
Alian M, Awajan A, Al-Hasan A, Akuzhia R (2021) Building Arabic paraphrasing benchmark based on transformation rules. ACM Trans Asian Low-Resour Lang Inf Process 20(4):1–17
Article Google Scholar
Christen P (2012) Data matching: concepts and techniques for record link-age, entity resolution, and duplicate detection. Springer
Book Google Scholar
Alian M, Awajan A (2020) Factors affecting sentence similarity and paraphrasing identification. Int J Speech Technol 23:851–859
Blooma MJ, Chua A, Goh D (2011) Quadripartite graph-based clustering of questions. In: Eighth International Conference on information technology: new generations,11:591–596
Sharma L, Graesser L, Nangia N, Evci U (2019) Natural language understanding with the Quora question pairs dataset. arXiv
Paranjpe D (2007) Clustering semantically similar and related questions. Stanford University, Research Report, https://nlp.stanford.edu/courses/cs224n/2007/fp/paranjpe.pdf
Cătălina M, Olaru A, Florea AM (2011) Semantic clustering of questions Mocanu. AI-Mas group University POLITEHNICA of Bucharest, Research Project
Google Scholar
Mishra RB, Modi NK, Shah RR (2014) Performance analysis of single and complete link during agglomerative clustering of question papers by tagging the questions and trend analysis using single link. In: 2014 IEEE International Conference on advanced communication control and computing technologies (lCACCCT), 11:616–618
Suhaimi NS, Kamaliah SN, Arbin N, Othman Z (2015) Optimizing cluster of questions by using dynamic mutation in genetic algorithm. In: Third International Conference on artificial intelligence, modelling and simulation, 11:15–18
Nguyen NV, Boucher A, Ogier J, Tabbone S (2010) Clusters-based relevance feedback for CBIR: a combination of query movement and query expansion. In: Computing and communication technologies, research, innovation, and vision for the future (RIVF), 11:1–6
Kumar A, Ingle YS, Pande A, Dhule P (2014) Canopy clustering: a review on pre-clustering approach to K-means clustering. Int J Innov Adv Comput Sci 3(5):22–29
Google Scholar
Irfan D, Xiaofei X, Shengchun D, He Z, Yunming Y (2009) S-Canopy: a feature-based clustering algorithm for supplier categorization. In: 4th IEEE Conference on industrial electronics and applications (ICIEA 2009), 11:677–681
Liu Y et al (2020) An integrated retrieval framework for similar questions: word-semantic embedded label clustering—LDA with question life cycle. Inf Sci 537:227–245
Article MathSciNet Google Scholar
Hoogeveen D, Verspoor KM, Baldwin T (2015) CQADupStack: a benchmark data set for community question-answering research. In: Australasian Document Computing Symposium (ADCS), Parramatta, NSW, Australia, 11:1–8
Piernik M, Morzy T (2021) A study on using data clustering for feature extraction to improve the quality of classification. Knowl Inf Syst 63:1771–1805
Article Google Scholar
Cohen W, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: The Eighth ACM SIGKDD International Conference on knowledge discovery and data mining-ACM SIGKDD, Edmonton, 11:475–480
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: The Second Annual International Conference on knowledge discovery in data- ACM SIGKDD , Boston, 11:169–178
Tan PN, Steinbach M, Karpatne A, Kumar V (2014) Introduction to data mining, 1st edn. Pearson Education Limited
Google Scholar
Awad FH, Hamad MM (2022) Improved k-means clustering algorithm for big data based on distributed smartphoneneural engine processor. Electronics 11:883
Article Google Scholar
Jeon Y, Yoo J, Lee J, Yoon S (2017) NC-link: a new linkage method for efficient hierarchical clustering of large-scale data. IEEE Access 5:5594–5608
Google Scholar
Gev CM, Vries S, Trotman A (2012) Document clustering evaluation: divergence from a random baseline. In: CoRR, abs, 2012, p. 1208.5654
Alian M, Awajan A (2020) Paraphrasing identification techniques in English and Arabic Texts. In: The 11th International Conference on information and communication systems, Irbid, Jordan, 11:155–160
Christen P, Goiser K (2007) "Quality and complexity measures for data linkage and deduplication. In: Hamilton HJ, Guillet FJ (eds) Quality measures in data mining, vol 43. Springer, pp 127–151
Chapter Google Scholar
Alian M, Al-Naymat G, Ramadan B (2020) Arabic real time entity resolution using inverted indexing. Lang Resour Eval 54:921–941
Article Google Scholar

Download references

Funding

Funding information is not applicable/No funding was received.

Author information

Authors and Affiliations

Basic Sciences Department, Faculty of Science, The Hashemite University, P.O. Box 330127, Zarqa, 13133, Jordan
Marwah Alian
Artificial Intelligence Research Center (AIRC), College of Engineering and Information Technology, Ajman University, Ajman, United Arab Emirates
Ghazi Al-Naymat

Authors

Marwah Alian
View author publications
You can also search for this author in PubMed Google Scholar
Ghazi Al-Naymat
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MA: Data gathering, writing paper, programming, results discussion. GA-N: Supervision, review of paper, advice for programming.

Corresponding author

Correspondence to Marwah Alian.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alian, M., Al-Naymat, G. Questions clustering using canopy-K-means and hierarchical-K-means clustering. Int. j. inf. tecnol. 14, 3793–3802 (2022). https://doi.org/10.1007/s41870-022-01012-w

Download citation

Received: 18 January 2022
Accepted: 30 May 2022
Published: 22 June 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s41870-022-01012-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Questions clustering using canopy-K-means and hierarchical-K-means clustering

Abstract

Access this article

Similar content being viewed by others

A Fast and Effective Method for Clustering Large-Scale Chinese Question Dataset

Open-Domain Question Answering with Topic Clustering

A Large-Scale Community Questions Classification Accounting for Category Similarity: An Exploratory Study

Data availability statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Questions clustering using canopy-K-means and hierarchical-K-means clustering

Abstract

Access this article

Similar content being viewed by others

A Fast and Effective Method for Clustering Large-Scale Chinese Question Dataset

Open-Domain Question Answering with Topic Clustering

A Large-Scale Community Questions Classification Accounting for Category Similarity: An Exploratory Study

Data availability statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation