Skip to main content
Log in

Automatic summarization of scientific publications using a feature selection approach

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Feature Maximization is a feature selection method that deals efficiently with textual data: to design systems that are altogether language-agnostic, parameter-free and do not require additional corpora to function. We propose to evaluate its use in text summarization, in particular in cases where documents are structured. We first experiment this approach in a single-document summarization context. We evaluate it on the DUC AQUAINT corpus and show that despite the unstructured nature of the corpus, our system is above the baseline and produces encouraging results. We also observe that the produced summaries seem robust to redundancy. Next, we evaluate our method in the more appropriate context of SciSumm challenge, which is dedicated to research publications summarization. These publications are structured in sections and our class-based approach is thus relevant. We more specifically focus on the task that aims to summarize papers using those that refer to them. We consider and evaluate several systems using our approach dealing with specific bag of words. Furthermore, in these systems, we also evaluate cosine and graph-based distance for sentence weighting and comparison. We show that our Feature Maximization based approach performs very well in the SciSumm 2016 context for the considered task, providing better results than the known results so far, and obtaining high recall. We thus demonstrate the flexibility and the relevance of Feature Maximization in this context.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. The 2nd Computational Linguistics Scientific Document Summarization Shared Task (CL-SciSumm 2016), http://wing.comp.nus.edu.sg/cl-scisumm2016/

  2. In this paper, we always consider only one reference summary, but there may be several ones created by distinct human annotators for example.

  3. The choice of the weighting scheme is not really constrained by the approach instead of producing positive values. Such a scheme is supposed to figure out the significance (i.e., semantic and importance) of the feature for the data. Feature recall is a scale-independent measure but feature predominance is not. We have, however, shown experimentally that the F-measure which is a combination of these two measures is only weakly influenced by feature scaling. Nevertheless, to guarantee full scale-independent behavior for this measure, data may be standardized.

  4. The Document Understanding Conference.

  5. Query-focused summarization.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Cao, Z., Li, W., Wu, D.: Polyu at cl-scisumm 2016. In: BIRNDL@ JCDL, pp. 132–138 (2016)

  3. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  4. Cohan, A., Goharian, N.: Revisiting Summarization Evaluation for Scientific Articles. arXiv preprint arXiv:1604.00400 (2016)

  5. Collins, A.M., Loftus, E.F.: A spreading-activation theory of semantic processing. Psychol. Rev. 82(6), 407 (1975)

    Article  Google Scholar 

  6. Conroy, J.M., O’leary, D.P.: Text summarization via hidden markov models. In: SIGIR, pp. 406–407 (2001)

  7. Crestani, F.: Application of spreading activation techniques in information retrieval. Artif. Intell. Rev. 11(6), 453–482 (1997)

    Article  Google Scholar 

  8. Das, D., Martins, A.F.T.: A survey on automatic text summarization. Literature Survey for the Language and Statistics II course at CMU 4, 192–195 (2007)

    Google Scholar 

  9. Dugué, N., Lamirel, J.-C., Cuxac, P.: Keep track of your clusters ! In: Research Challenges in Information Science (RCIS) (2016)

  10. Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)

    Article  Google Scholar 

  11. Baeza-Yates, R.: Introduction to data structures and algorithms related to information retrieval. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval, Data Structures and Algorithms, pp. 13–27. Prentice-Hall (1992)

  12. Haghighi, A., Vanderwende, L.: Exploring content models for multi-document summarization. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pp. 362–370 (2009)

  13. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.-Y.: Overview of the cl-scisumm 2016 shared task. In: BIRNDL@ JCDL, pp. 93–102 (2016)

  14. Klampfl, S., Rexha, A., Kern, R.: Identifying referenced text in scientific publications by summarisation and classification techniques. In: BIRNDL@ JCDL, pp. 122–131 (2016)

  15. Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: ACM SIGIR, pp. 68–73 (1995)

  16. Lamirel, J.-C., Cuxac, P., Chivukula, A.S., Hajlaoui, K.: A new feature selection and feature contrasting approach based on quality metric: application to efficient classification of complex textual data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 367–378. Springer, Berlin (2013)

  17. Lamirel, J.-C., Dugué, N., Cuxac, P.: New efficient clustering quality indexes. In: International Joint Conference on Neural Networks (2016)

  18. Lamirel, J.-C., Dugué, N., Cuxac, P.: Performing and visualizing temporal analysis of large text data issued for open sources: past and future methods. In: Beyond Databases, Architectures and Structures (2016)

  19. Lamirel, J.-C., Falk, I., Gardent, C.: Federating clustering and cluster labelling capabilities with a single approach based on feature maximization: French verb classes identification with igngf neural clustering. Neurocomputing 147, 136–146 (2015)

    Article  Google Scholar 

  20. Lamirel, J.-C., Ta, A.P., Attik, M.: Novel labeling strategies for hierarchical representation of multidimensional data analysis results. In: IASTED International Conference on Artificial Intelligence and Applications (2008)

  21. Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: Cist system for cl-scisumm 2016 shared task. In: BIRNDL@ JCDL, pp. 156–167 (2016)

  22. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: the ACL-04 workshop, vol. 8 (2004)

  23. Lin, C.-Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In: 18th Conference on Computational Linguistics, vol. 1, pp. 495–501 (2000)

  24. Lloret, E.: Text summarisation based on human language technologies and its applications. Ph.D. Thesis, Universidad de Alicante (2015)

  25. Lu, K., Mao, J., Li, G., Xu, J.: Recognizing reference spans and classifying their discourse facets. In: BIRNDL@ JCDL, pp. 139–145 (2016)

  26. Malenfant, B., Lapalme, G.: Rali system description for cl-scisumm 2016 shared task. In: BIRNDL@ JCDL, pp. 146–155 (2016)

  27. Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, pp. 404–411. Association for Computational Linguistics, Barcelona, Spain (2004)

  28. Moraes, L., Baki, S., Verma, R., Lee, D.: University of houston at cl-scisumm 2016: Svms with tree kernels and sentence similarity. In: BIRNDL@ JCDL, pp. 113–121 (2016)

  29. Nenkova, A., Maskey, S., Liu, Y.: Automatic summarization. In: 49th Annual Meeting of the ACL: Tutorial Abstracts, p. 3 (2011)

  30. Nicolas, D., Lamirel, J.-C.: Une métrique de sélection de variables appliquée à la centralité et à la détection des roles communautaires. In: EGC (2017)

  31. Nomoto, Ta.: Neal: a neurally enhanced approach to linking citation and reference. In: BIRNDL@ JCDL, pp. 168–174 (2016)

  32. Saggion, H., AbuRaed, A., Ronzano, F.: Trainable citation-enhanced summarization of scientific articles. In: BIRNDL@ JCDL (2016)

  33. Tata, S., Patel, J.M.: Estimating the selectivity of tf-idf based cosine similarity predicates. ACM Sigmod Rec. 36(2), 7–12 (2007)

    Article  Google Scholar 

  34. Vanderwende, L., Suzuki, H., Brockett, C., Nenkova, A.: Beyond sumbasic: task-focused summarization with sentence simplification and lexical expansion. Inf. Process. Manag. 43(6), 1606–1618 (2007)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Dugué.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al Saied, H., Dugué, N. & Lamirel, JC. Automatic summarization of scientific publications using a feature selection approach. Int J Digit Libr 19, 203–215 (2018). https://doi.org/10.1007/s00799-017-0214-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-017-0214-x

Keywords

Navigation