Skip to main content

Extractive text summarization using clustering-based topic modeling


Text summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. Extractive summarizers select a few best sentences out of the input document, while abstractive methods may modify the sentence structure or introduce new sentences. The proposed approach is an extractive text summarization technique, where we have expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. Our goal is to overcome the lack of coherence issues found in the summarization techniques. Topic modeling was initially proposed to model text data at the multi-document and word levels without considering sentence modeling. Subsequently, it has been applied at the sentence level and used for the document summarization; however, certain limitations were associated. Topic modeling does not perform as expected when applied to a single document at the sentence level. To address this shortcoming, we have proposed a summarization approach that is incorporated at the individual document and clusters level (instead of the sentence level). We aim to choose the best statement from each group (containing sentences of the same kind) found in the given text. We have tried to select the perfect topic by evaluating the probability distribution of the words and respective topics’ at the cluster level. The method is evaluated on two standard datasets and shows significant performance gains over existing text summarization techniques. Compared to other text summarization techniques, the Rouge parameters for automatic evaluation show a considerable improvement in F-measure, precision, and recall of the generated summary. Furthermore, a manual evaluation has demonstrated that the proposed approach outperforms the current state-of-the-art text summarization approaches.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Data availability

Enquiries about data availability should be directed to the authors.


  1. Click

  2. Click


  • Abdi A, Idris N, Alguliyev RM, Aliguliyev RM (2017) Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Comput 21(7):1785–1801

    Article  Google Scholar 

  • Abdi A, Shamsuddin SM, Aliguliyev RM (2018) Qmos: query-based multi-documents opinion-oriented summarization. Inf Process Manag 54(2):318–338

    Article  Google Scholar 

  • Abdi A, Shamsuddin SM, Hasan S, Piran J (2018) Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment. Expert Syst Appl 109:66–85.

    Article  Google Scholar 

  • Ali SM, Noorian Z, Bagheri E, Ding C, Al-Obeidat F (2020) Topic and sentiment aware microblog summarization for twitter. J Intell Inf Syst 54(1):129–156

    Article  Google Scholar 

  • Amplayo RK, Song M (2017) An adaptable fine-grained sentiment analysis for summarization of multiple short online reviews. Data Knowl Eng 110:54–67

    Article  Google Scholar 

  • Arora R, Ravindran B (2008) Latent Dirichlet allocation based multi-document summarization. In: Proceedings of the second workshop on analytics for noisy unstructured text data, pp 91–97

  • Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  • Barrios F, López F, Argerich L, Wachenchauzer R (2015) Variations of the similarity function of textrank for automated summarization. In: Argentine symposium on artificial intelligence (ASAI 2015)-JAIIO 44 (Rosario, 2015)

  • Barrios F, López F, Argerich L, Wachenchauzer R (2016) Variations of the similarity function of textrank for automated summarization. arXiv preprint arXiv:1602.03606

  • Barzilay R, McKeown KR (2005) Sentence fusion for multidocument news summarization. Comput Linguist 31(3):297–328.

    Article  MATH  Google Scholar 

  • Baxendale PB (1958) Machine-made index for technical literature—an experiment. IBM J Res Dev 2(4):354–361.

    Article  Google Scholar 

  • Belwal RC, Rai S, Gupta A (2020) A new graph-based extractive text summarization using keywords or topic modeling. J Ambient Intell Hum Comput 1–16

  • Belwal RC, Rai S, Gupta A (2021) Text summarization using topic-based vector space model and semantic measure. Inf Process Manag 58(3):102536

    Article  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

  • Boros E, Kantor PB, Neu DJ (2001) A clustering based approach to creating multi-document summaries. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval

  • Chang YL, Chien JT (2009) Latent dirichlet learning for document summarization. In: IEEE international conference on acoustics, speech and signal processing, 2009 ICASSP 2009. IEEE, pp 1689–1692.

  • Cuong HN, Tran VD, Van LN, Than K (2019) Eliminating overfitting of probabilistic topic models on short and noisy text: the role of dropout. Int J Approx Reason

  • Diao Y, Lin H, Yang L, Fan X, Chu Y, Wu D, Zhang D, Xu K (2020) Crhasum: extractive text summarization with contextualized-representation hierarchical-attention summarization network. Neural Comput Appl 32(15):11491–11503

    Article  Google Scholar 

  • Dumais ST (2004) Latent semantic analysis. Ann Rev Inf Sci Technol 38(1):188–230

    Article  Google Scholar 

  • Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479

    Article  Google Scholar 

  • Fattah MA, Ren F (2008) Automatic text summarization. World Acad Sci Eng Technol 37:2008

    Google Scholar 

  • Fattah MA, Ren F (2009) GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144.

    Article  Google Scholar 

  • Ferreira R, de Souza Cabral RD, e Silva GP, Freitas F, Cavalcanti GD, Lima R, Simske SJ, Favaro L (2013) Assessing sentence scoring techniques for extractive text summarization. Expert Syst Appl 40(14):5755–5764.

    Article  Google Scholar 

  • Fu X, Wang J, Zhang J, Wei J, Yang Z (2020) Document summarization with VHTM: variational hierarchical topic-aware mechanism. In: AAAI, pp 7740–7747

  • Fuad TA, Nayeem MT, Mahmud A, Chali Y (2019) Neural sentence fusion for diversity driven abstractive multi-document summarization. Comput Speech Language 58:216–230

    Article  Google Scholar 

  • Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artif Intell Rev 47(1):1–66

    Article  Google Scholar 

  • Ganesan K, Zhai C, Han J (2010) Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In: Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics, pp 340–348,

  • Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 19–25,

  • Gupta P, Pendluri VS, Vats I (2011) Summarizing text by ranking text units according to shallow linguistic features. In: 2011 13th international conference on advanced communication technology (ICACT). IEEE, pp 1620–1625.

  • Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In: 2010 17th working conference on reverse engineering (WCRE). IEEE, pp 35–44.

  • Harabagiu SM, Lacatusu VF, Morarescu P (2002) Multidocument summarization with gistexter. In: LREC, Citeseer, vol 1, pp 1456–1463.

  • Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. In: Advances in neural information processing systems, pp 1693–1701. arXiv:1506.03340

  • Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196

    Article  MATH  Google Scholar 

  • Hu M, Sun A, Lim EP (2008) Comments-oriented document summarization: understanding documents with readers’ feedback. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 291–298.

  • Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L (2019) Latent dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl 78(11):15169–15211

    Article  Google Scholar 

  • Kanapala A, Pal S, Pamula R (2019) Text summarization from legal documents: a survey. Artif Intell Rev 51(3):371–402

    Article  Google Scholar 

  • Kikuchi Y, Hirao T, Takamura H, Okumura M, Nagata M (2014) Single document summarization based on nested tree structure. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 315–320

  • Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Inf Softw Technol 49(3):230–243

    Article  Google Scholar 

  • Lee S, Belkasim S, Zhang Y (2013) Multi-document text summarization using topic model and fuzzy logic. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 159–168

  • Lim KW, Buntine W, Chen C, Du L (2016) Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes. Int J Approx Reason 78:172–191

    Article  MathSciNet  MATH  Google Scholar 

  • Lin CY (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out.

  • Liu X, Webster JJ, Kit C (2009) An extractive text summarizer based on significant words. In: International conference on computer processing of oriental languages. Springer, pp 168–178

  • Liu Y, Titov I, Lapata M (2019) Single document summarization as tree induction. In: Proceedings of the 2019 conference of the North American Chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers), pp 1745–1755

  • Lloret E, Palomar M (2009) A gradual combination of features for building automatic summarisation systems. In: International conference on text, speech and dialogue. Springer, pp 16–23.

  • Lloret E, Balahur A, Gómez JM, Montoyo A, Palomar M (2012) Towards a unified framework for opinion retrieval, mining and summarization. J Intell Inf Syst 39(3):711–747

    Article  Google Scholar 

  • Lovinger J, Valova I, Clough C (2019) GIST: general integrated summarization of text and reviews. Soft Comput 23(5):1589–1601

    Article  Google Scholar 

  • Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165.

    Article  MathSciNet  Google Scholar 

  • Luong MT, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025

  • Mani I, Bloedorn E (1998) Machine learning of generic and user-focused summarization. In: AAAI/IAAI, pp 821–826

  • Mao X, Yang H, Huang S, Liu Y, Li R (2019) Extractive summarization using supervised and unsupervised learning. Expert Syst Appl 133:173–181

    Article  Google Scholar 

  • Mihalcea R (2004) Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL 2004 on interactive poster and demonstration sessions. Association for Computational Linguistics, p 20.

  • Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing.

  • Moawad IF, Aref M (2012) Semantic graph reduction approach for abstractive text summarization. In: 2012 Seventh international conference on computer engineering & systems (ICCES). IEEE, pp 132–138

  • Mutlu B, Sezer EA, Akcayol MA (2019) Multi-document extractive text summarization: a comparative assessment on features. Knowl-Based Syst 183:104848

    Article  Google Scholar 

  • Na L, Ming-xia L, Ying L, Xiao-jun T, Hai-wen W, Peng X (2014) Mixture of topic model for multi-document summarization. In: The 26th chinese control and decision conference (2014 CCDC). IEEE, pp 5168–5172

  • Nagwani N (2015) Summarizing large text collection using topic modeling and clustering based on mapreduce framework. J Big Data 2(1):6

    Article  Google Scholar 

  • Nallapati R, Zhou B, Gulcehre C, Xiang B, et al (2016) Abstractive text summarization using sequence-to-sequence RNNS and beyond. arXiv preprint arXiv:1602.06023

  • Nallapati R, Zhai F, Zhou B (2017) Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. In: Thirty-first AAAI conference on artificial intelligence

  • Narayan S, Papasarantopoulos N, Cohen SB, Lapata M (2017) Neural extractive summarization with side information. arXiv preprint arXiv:1704.04530

  • Narayan S, Cohen SB, Lapata M (2018a) Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 1797–1807

  • Narayan S, Cohen SB, Lapata M (2018b) Ranking sentences for extractive summarization with reinforcement learning. In: Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp 1747–1759

  • Naveen GK, Nedungadi P (2014) Query-based multi-document summarization by clustering of documents. In: Proceedings of the 2014 international conference on interdisciplinary advances in applied computing, pp 1–8

  • Neto JL, Freitas AA, Kaestner CA (2002) Automatic text summarization using a machine learning approach. In: Brazilian symposium on artificial intelligence, Springer, pp 205–215

  • Nobata C, Sekine S, Murata M, Uchimoto K, Utiyama M, Isahara H (2001) Sentence extraction system assembling multiple evidence. In: NTCIR

  • Orăsan C (2009) Comparative evaluation of term-weighting methods for automatic summarization. J Quant Linguist 16(1):67–95

    Article  Google Scholar 

  • Ouyang Y, Li W, Li S, Lu Q (2011) Applying regression models to query-focused multi-document summarization. Inf Process Manag 47(2):227–237

    Article  Google Scholar 

  • Oya T, Mehdad Y, Carenini G, Ng R (2014) A template-based abstractive meeting summarization: Leveraging summary and source text relationships. In: Proceedings of the 8th international natural language generation conference (INLG), pp 45–53

  • Ozsoy MG, Alpaslan FN, Cicekli I (2011) Text summarization using latent semantic analysis. J Inf Sci 37(4):405–417.

    Article  MathSciNet  Google Scholar 

  • Powell L, Gelich A, Ras ZW (2019) Developing artwork pricing models for online art sales using text analytics. In: International joint conference on rough sets. Springer, pp 480–494

  • Qazvinian V, Radev DR (2008) Scientific paper summarization using citation summary networks. arXiv preprint arXiv:0807.1560

  • Rahman N, Borah B (2019) Improvement of query-based text summarization using word sense disambiguation. Complex Intell Syst 1–11

  • Roul RK (2021) Topic modeling combined with classification technique for extractive multi-document text summarization. Soft Comput 25(2):1113–1127

    Article  Google Scholar 

  • Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685

  • Rush AM, Harvard S, Chopra S, Weston J (2017) A neural attention model for sentence summarization. In: ACLWeb Proceedings of the 2015 conference on empirical methods in natural language processing

  • Saggion H (2014) Creating summarization systems with summa. In: LREC. Citeseer, pp 4157–4163

  • See A, Liu PJ, Manning CD (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368

  • Silla CN, Pappa GL, Freitas AA, Kaestner CA (2004) Automatic text summarization with genetic algorithm-based attribute selection. In: Ibero-American conference on artificial intelligence. Springer, pp 305–314

  • Singh RK, Khetarpaul S, Gorantla R, Allada SG (2021) SHEG: summarization and headline generation of news articles using deep learning. Neural Comput Appl 33(8):3251–3265

    Article  Google Scholar 

  • Steinberger J, Ježek K (2009) Update summarization based on latent semantic analysis. In: International conference on text speech and dialogue. Springer, pp 77–84

  • Van Lierde H, Chow TW (2019) Query-oriented text summarization based on hypergraph transversals. Inf Process Manag 56(4):1317–1338

    Article  Google Scholar 

  • Vázquez E, Arnulfo Garcia-Hernandez R, Ledeneva Y (2018) Sentence features relevance for extractive text summarization using genetic algorithms. J Intell Fuzzy Syst 35(1):353–365

    Article  Google Scholar 

  • Wong KF, Wu M, Li W (2008) Extractive summarization using supervised and semi-supervised learning. In: Proceedings of the 22nd international conference on computational linguistics (Coling 2008), pp 985–992

  • Yang L, Cai X, Zhang Y, Shi P (2014) Enhancing sentence-level clustering with ranking-based clustering framework for theme-based summarization. Inf Sci 260:37–50

    Article  Google Scholar 

  • Yang M, Qu Q, Shen Y, Lei K, Zhu J (2020) Cross-domain aspect/sentiment-aware abstractive review summarization by combining topic modeling and deep reinforcement learning. Neural Comput Appl 32(11):6421–6433

    Article  Google Scholar 

  • Yousefi-Azar M, Hamey L (2017) Text summarization using unsupervised deep learning. Expert Syst Appl 68:93–105

    Article  Google Scholar 

  • Zhang X, Lapata M, Wei F, Zhou M (2018) Neural latent extractive document summarization. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 779–784

Download references


No funding was received for this work.

Author information

Authors and Affiliations



RCB and AG conceived this research and designed experiments. SR participated in editing and drafting the article. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ramesh Chandra Belwal.

Ethics declarations

Conflict of interest

There are no conflict of interest associated with this publication. We all declare that there are no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

’Not applicable’. This article does not contain any studies with human participants or animals performed by the authors. Formal consent is not required.

Informed consent

’Not applicable’. No individual/personal data used.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Belwal, R.C., Rai, S. & Gupta, A. Extractive text summarization using clustering-based topic modeling. Soft Comput 27, 3965–3982 (2023).

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Extractive summarization
  • Topic modeling
  • Clustering
  • Semantic measure