Skip to main content
Log in

Questions clustering using canopy-K-means and hierarchical-K-means clustering

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

In questions datasets, several questions could produce duplicates since they are similar questions due to the ability to write a question in different forms based on the flexibility of Natural Language. However, extracting relevant questions is time-consuming if it is performed manually. Therefore, the computational power of computers is necessary to group similar questions into clusters based on their semantic similarity but still the information included within a question may be insufficient to efficiently cluster the questions making it a challenging task. In this research, canopy clustering is employed as a previous step for K-means clustering, then it is compared to the Hierarchical Clustering approach. Quora questions dataset is used in the experiments to identify question pairs that are similar. In terms of F1 score and rand statistic measure, the results demonstrate that the Hierarchical-K-means approach provides better validity clustering measures than the Canopy-K-means approach. In addition to identifying matches, the Canopy approach serves with the top related questions that have the same intent in the same cluster in several canopies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability statement

The data is available upon request from the authors.

References

  1. Lalitha SY, Govardhan A (2015) Improved text clustering with neighbors. Int J Data Min Knowl Manag Process (IJDKP) 5(2):23–37

    Article  Google Scholar 

  2. Alian M, Awajan A, Al-Hasan A, Akuzhia R (2021) Building Arabic paraphrasing benchmark based on transformation rules. ACM Trans Asian Low-Resour Lang Inf Process 20(4):1–17

    Article  Google Scholar 

  3. Christen P (2012) Data matching: concepts and techniques for record link-age, entity resolution, and duplicate detection. Springer

    Book  Google Scholar 

  4. Alian M, Awajan A (2020) Factors affecting sentence similarity and paraphrasing identification. Int J Speech Technol 23:851–859

  5. Blooma MJ, Chua A, Goh D (2011) Quadripartite graph-based clustering of questions. In: Eighth International Conference on information technology: new generations,11:591–596

  6. Sharma L, Graesser L, Nangia N, Evci U (2019) Natural language understanding with the Quora question pairs dataset. arXiv

  7. Paranjpe D (2007) Clustering semantically similar and related questions. Stanford University, Research Report, https://nlp.stanford.edu/courses/cs224n/2007/fp/paranjpe.pdf

  8. Cătălina M, Olaru A, Florea AM (2011) Semantic clustering of questions Mocanu. AI-Mas group University POLITEHNICA of Bucharest, Research Project

    Google Scholar 

  9. Mishra RB, Modi NK, Shah RR (2014) Performance analysis of single and complete link during agglomerative clustering of question papers by tagging the questions and trend analysis using single link. In: 2014 IEEE International Conference on advanced communication control and computing technologies (lCACCCT), 11:616–618

  10. Suhaimi NS, Kamaliah SN, Arbin N, Othman Z (2015) Optimizing cluster of questions by using dynamic mutation in genetic algorithm. In: Third International Conference on artificial intelligence, modelling and simulation, 11:15–18

  11. Nguyen NV, Boucher A, Ogier J, Tabbone S (2010) Clusters-based relevance feedback for CBIR: a combination of query movement and query expansion. In: Computing and communication technologies, research, innovation, and vision for the future (RIVF), 11:1–6

  12. Kumar A, Ingle YS, Pande A, Dhule P (2014) Canopy clustering: a review on pre-clustering approach to K-means clustering. Int J Innov Adv Comput Sci 3(5):22–29

    Google Scholar 

  13. Irfan D, Xiaofei X, Shengchun D, He Z, Yunming Y (2009) S-Canopy: a feature-based clustering algorithm for supplier categorization. In: 4th IEEE Conference on industrial electronics and applications (ICIEA 2009), 11:677–681

  14. Liu Y et al (2020) An integrated retrieval framework for similar questions: word-semantic embedded label clustering—LDA with question life cycle. Inf Sci 537:227–245

    Article  MathSciNet  Google Scholar 

  15. Hoogeveen D, Verspoor KM, Baldwin T (2015) CQADupStack: a benchmark data set for community question-answering research. In: Australasian Document Computing Symposium (ADCS), Parramatta, NSW, Australia, 11:1–8

  16. Piernik M, Morzy T (2021) A study on using data clustering for feature extraction to improve the quality of classification. Knowl Inf Syst 63:1771–1805

    Article  Google Scholar 

  17. Cohen W, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: The Eighth ACM SIGKDD International Conference on knowledge discovery and data mining-ACM SIGKDD, Edmonton, 11:475–480

  18. McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: The Second Annual International Conference on knowledge discovery in data- ACM SIGKDD , Boston, 11:169–178

  19. Tan PN, Steinbach M, Karpatne A, Kumar V (2014) Introduction to data mining, 1st edn. Pearson Education Limited

    Google Scholar 

  20. Awad FH, Hamad MM (2022) Improved k-means clustering algorithm for big data based on distributed smartphoneneural engine processor. Electronics 11:883

    Article  Google Scholar 

  21. Jeon Y, Yoo J, Lee J, Yoon S (2017) NC-link: a new linkage method for efficient hierarchical clustering of large-scale data. IEEE Access 5:5594–5608

    Google Scholar 

  22. Gev CM, Vries S, Trotman A (2012) Document clustering evaluation: divergence from a random baseline. In: CoRR, abs, 2012, p. 1208.5654

  23. Alian M, Awajan A (2020) Paraphrasing identification techniques in English and Arabic Texts. In: The 11th International Conference on information and communication systems, Irbid, Jordan, 11:155–160

  24. Christen P, Goiser K (2007) "Quality and complexity measures for data linkage and deduplication. In: Hamilton HJ, Guillet FJ (eds) Quality measures in data mining, vol 43. Springer, pp 127–151

    Chapter  Google Scholar 

  25. Alian M, Al-Naymat G, Ramadan B (2020) Arabic real time entity resolution using inverted indexing. Lang Resour Eval 54:921–941

    Article  Google Scholar 

Download references

Funding

Funding information is not applicable/No funding was received.

Author information

Authors and Affiliations

Authors

Contributions

MA: Data gathering, writing paper, programming, results discussion. GA-N: Supervision, review of paper, advice for programming.

Corresponding author

Correspondence to Marwah Alian.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alian, M., Al-Naymat, G. Questions clustering using canopy-K-means and hierarchical-K-means clustering. Int. j. inf. tecnol. 14, 3793–3802 (2022). https://doi.org/10.1007/s41870-022-01012-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-022-01012-w

Keywords

Navigation