An Efficient Training Dataset Generation Method for Extractive Text Summarization

  • Esther Hannah
  • Saswati Mukherjee
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 236)


The work presents a method to automatically generate a training dataset for the purpose of summarizing text documents with the help of feature extraction technique. The goal of this approach is to design a dataset which will help to perform the task of summarization very much like a human. A document summary is a text that is produced from one or more texts that conveys important information in the original texts. The proposed system consists of methods such as pre-processing, feature extraction, and generation of training dataset. For implementing the system, 50 test documents from DUC2002 is used. Each document is cleaned by pre-processing techniques such as sentence segmentation, tokenization, removing stop word, and word stemming. Eight important features are extracted for each sentence, and are converted as attributes for the training dataset. A high quality, proper training dataset is needed for achieving good quality in document summarization, and the proposed system aims in generating a well-defined training dataset that is sufficiently large enough and noise free for performing text summarization. The training dataset utilizes a set of features which are common that can be used for all subtasks of data mining. Primary subjective evaluation shows that our training is effective, efficient, and the performance of the system is promising.


Feature extraction Dataset Summarization Pre-processing 


  1. 1.
    Luhn, H.P.: The automatic creation of literature abstract. IBM J. Res. Dev. 2, 159–165 (1958)CrossRefMathSciNetGoogle Scholar
  2. 2.
    Kupiec, J., Pedersen, J., Chen, F.: “A Trainable document summarizer”. In Proceedings of the Eighteenth Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR), pp. 68–73. Seattle (1995).Google Scholar
  3. 3.
    Edmundson, H.P.: New methods in automatic extracting. J. Assoc. Comput. Mach. 16(2), 264–285 (1969).Google Scholar
  4. 4.
    Baxendale, P.: Machine-made index for technical literature–an experiment. IBM J. Res. Dev. 2, 354–361 (1958)CrossRefGoogle Scholar
  5. 5.
    Rasim, M.: Alguliev, Effective summarization method of text documents. Proceedings of IEEE International Conference on Web Intelligence, In (2005)Google Scholar
  6. 6.
    Hsun-Hui, H., Yau-Hwang, K., Horng-Chang, Y.: Fuzzy-rough set aided sentence extraction summarization. Proceedings of the first International Conference on Innovative Computing, Information and Control, In (2006)Google Scholar
  7. 7.
    Huantong, G., Peng, Z., Enhong, C., Qingsheng, C.: A novel automatic text summarization study based on term co-occurrence. Proceedings of ICCI, In (2006)Google Scholar
  8. 8.
    Massih, R.: Amini and nicolas usunier, a contextual query expansion approach by term clustering for robust text summarization. Proceedings of DUC, In (2007)Google Scholar
  9. 9.
    Suanmali, L., Mohammed Salem, B., Salim, N.: Sentence features fusion for text summarization using fuzzy logic. In: Proceedings of HIS 2009, 142–146 (2009).Google Scholar
  10. 10.
    Esther, H., Saswati, M., Kumar, G.: An extractive text summarization based on multivariate approach. ICACTE 3, 157 (2010)Google Scholar
  11. 11.
    Esther, H., Geetha T.V., Saswati M.: Automatic extractive text summarization based on fuzzy logic: a sentence oriented approach. LNCS 2011.Google Scholar

Copyright information

© Springer India 2014

Authors and Affiliations

  1. 1.Department of Computer ScienceAnna UniversityChennaiIndia
  2. 2.Department of Information Science and Technology, College of EngineeringAnna UniversityChennaiIndia

Personalised recommendations