Abstract
Feature Maximization is a feature selection method that deals efficiently with textual data: to design systems that are altogether language-agnostic, parameter-free and do not require additional corpora to function. We propose to evaluate its use in text summarization, in particular in cases where documents are structured. We first experiment this approach in a single-document summarization context. We evaluate it on the DUC AQUAINT corpus and show that despite the unstructured nature of the corpus, our system is above the baseline and produces encouraging results. We also observe that the produced summaries seem robust to redundancy. Next, we evaluate our method in the more appropriate context of SciSumm challenge, which is dedicated to research publications summarization. These publications are structured in sections and our class-based approach is thus relevant. We more specifically focus on the task that aims to summarize papers using those that refer to them. We consider and evaluate several systems using our approach dealing with specific bag of words. Furthermore, in these systems, we also evaluate cosine and graph-based distance for sentence weighting and comparison. We show that our Feature Maximization based approach performs very well in the SciSumm 2016 context for the considered task, providing better results than the known results so far, and obtaining high recall. We thus demonstrate the flexibility and the relevance of Feature Maximization in this context.
Similar content being viewed by others
Notes
The 2nd Computational Linguistics Scientific Document Summarization Shared Task (CL-SciSumm 2016), http://wing.comp.nus.edu.sg/cl-scisumm2016/
In this paper, we always consider only one reference summary, but there may be several ones created by distinct human annotators for example.
The choice of the weighting scheme is not really constrained by the approach instead of producing positive values. Such a scheme is supposed to figure out the significance (i.e., semantic and importance) of the feature for the data. Feature recall is a scale-independent measure but feature predominance is not. We have, however, shown experimentally that the F-measure which is a combination of these two measures is only weakly influenced by feature scaling. Nevertheless, to guarantee full scale-independent behavior for this measure, data may be standardized.
The Document Understanding Conference.
Query-focused summarization.
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Cao, Z., Li, W., Wu, D.: Polyu at cl-scisumm 2016. In: BIRNDL@ JCDL, pp. 132–138 (2016)
Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009)
Cohan, A., Goharian, N.: Revisiting Summarization Evaluation for Scientific Articles. arXiv preprint arXiv:1604.00400 (2016)
Collins, A.M., Loftus, E.F.: A spreading-activation theory of semantic processing. Psychol. Rev. 82(6), 407 (1975)
Conroy, J.M., O’leary, D.P.: Text summarization via hidden markov models. In: SIGIR, pp. 406–407 (2001)
Crestani, F.: Application of spreading activation techniques in information retrieval. Artif. Intell. Rev. 11(6), 453–482 (1997)
Das, D., Martins, A.F.T.: A survey on automatic text summarization. Literature Survey for the Language and Statistics II course at CMU 4, 192–195 (2007)
Dugué, N., Lamirel, J.-C., Cuxac, P.: Keep track of your clusters ! In: Research Challenges in Information Science (RCIS) (2016)
Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
Baeza-Yates, R.: Introduction to data structures and algorithms related to information retrieval. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval, Data Structures and Algorithms, pp. 13–27. Prentice-Hall (1992)
Haghighi, A., Vanderwende, L.: Exploring content models for multi-document summarization. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pp. 362–370 (2009)
Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.-Y.: Overview of the cl-scisumm 2016 shared task. In: BIRNDL@ JCDL, pp. 93–102 (2016)
Klampfl, S., Rexha, A., Kern, R.: Identifying referenced text in scientific publications by summarisation and classification techniques. In: BIRNDL@ JCDL, pp. 122–131 (2016)
Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: ACM SIGIR, pp. 68–73 (1995)
Lamirel, J.-C., Cuxac, P., Chivukula, A.S., Hajlaoui, K.: A new feature selection and feature contrasting approach based on quality metric: application to efficient classification of complex textual data. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 367–378. Springer, Berlin (2013)
Lamirel, J.-C., Dugué, N., Cuxac, P.: New efficient clustering quality indexes. In: International Joint Conference on Neural Networks (2016)
Lamirel, J.-C., Dugué, N., Cuxac, P.: Performing and visualizing temporal analysis of large text data issued for open sources: past and future methods. In: Beyond Databases, Architectures and Structures (2016)
Lamirel, J.-C., Falk, I., Gardent, C.: Federating clustering and cluster labelling capabilities with a single approach based on feature maximization: French verb classes identification with igngf neural clustering. Neurocomputing 147, 136–146 (2015)
Lamirel, J.-C., Ta, A.P., Attik, M.: Novel labeling strategies for hierarchical representation of multidimensional data analysis results. In: IASTED International Conference on Artificial Intelligence and Applications (2008)
Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: Cist system for cl-scisumm 2016 shared task. In: BIRNDL@ JCDL, pp. 156–167 (2016)
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: the ACL-04 workshop, vol. 8 (2004)
Lin, C.-Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In: 18th Conference on Computational Linguistics, vol. 1, pp. 495–501 (2000)
Lloret, E.: Text summarisation based on human language technologies and its applications. Ph.D. Thesis, Universidad de Alicante (2015)
Lu, K., Mao, J., Li, G., Xu, J.: Recognizing reference spans and classifying their discourse facets. In: BIRNDL@ JCDL, pp. 139–145 (2016)
Malenfant, B., Lapalme, G.: Rali system description for cl-scisumm 2016 shared task. In: BIRNDL@ JCDL, pp. 146–155 (2016)
Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, pp. 404–411. Association for Computational Linguistics, Barcelona, Spain (2004)
Moraes, L., Baki, S., Verma, R., Lee, D.: University of houston at cl-scisumm 2016: Svms with tree kernels and sentence similarity. In: BIRNDL@ JCDL, pp. 113–121 (2016)
Nenkova, A., Maskey, S., Liu, Y.: Automatic summarization. In: 49th Annual Meeting of the ACL: Tutorial Abstracts, p. 3 (2011)
Nicolas, D., Lamirel, J.-C.: Une métrique de sélection de variables appliquée à la centralité et à la détection des roles communautaires. In: EGC (2017)
Nomoto, Ta.: Neal: a neurally enhanced approach to linking citation and reference. In: BIRNDL@ JCDL, pp. 168–174 (2016)
Saggion, H., AbuRaed, A., Ronzano, F.: Trainable citation-enhanced summarization of scientific articles. In: BIRNDL@ JCDL (2016)
Tata, S., Patel, J.M.: Estimating the selectivity of tf-idf based cosine similarity predicates. ACM Sigmod Rec. 36(2), 7–12 (2007)
Vanderwende, L., Suzuki, H., Brockett, C., Nenkova, A.: Beyond sumbasic: task-focused summarization with sentence simplification and lexical expansion. Inf. Process. Manag. 43(6), 1606–1618 (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Al Saied, H., Dugué, N. & Lamirel, JC. Automatic summarization of scientific publications using a feature selection approach. Int J Digit Libr 19, 203–215 (2018). https://doi.org/10.1007/s00799-017-0214-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-017-0214-x