Abstract
A document known as notitia criminis (NC) is use in the Brazilian Federal Police as the starting point of the criminal investigation. An NC aims to report a summary of investigative activities. Thus, it contains all relevant information about a supposed crime that occurred. To manage an inquiry and correlate similar investigations, the Federal Police usually needs to extract essential information from an NC document. The manual extraction (reading and understanding the entire content) may be human mentally exhausting, due to the size and complexity of the documents. In this light, natural language processing (NLP) techniques are commonly used for automatic information extraction from textual documents. Deep neural networks are successfully apply to many different NLP tasks. A neural network model that leveraged the results in a wide range of NLP tasks was the BERT model—an acronym for Bidirectional Encoder Representations from Transformers. In this article, we propose approaches based on the BERT model to extract relevant information from textual documents using automatic text summarization techniques. In other words, we aim to analyze the feasibility of using the BERT model to extract and synthesize the most essential information of an NC document. We evaluate the performance of the proposed approaches using two real-world datasets: the Federal Police dataset (a private domain dataset) and the Brazilian WikiHow dataset (a public domain dataset). Experimental results using different variants of the ROUGE metric show that our approaches can significantly increase extractive text summarization effectiveness without sacrificing efficiency.
Similar content being viewed by others
Notes
Term Frequency—Inverse Document Frequency (TF-IDF) is a technique for text vectorization based on the Bag of words (BoW) model.
Elbow is a common heuristic in mathematical optimization. In clustering, this technique means when a number of clusters are chosen, the addition of another cluster to that set does not provide much better modeling of the data.
References
Alguliyev R, Aliguliyev R, Isazade N, Abdi A, Idris N (2019) Cosum: text summarization based on clustering and optimization. Expert Syst 36:02. https://doi.org/10.1111/exsy.12340
Bird S, Klein E, Loper E (eds) (2009) Natural language processing with Python : [analyzing text with the natural language toolkit]. O’Reilly, Beijing; Köln [u.a.], 1. ed. edition. ISBN 978-0-596-51649-9 0-596-51649-5
Brown TB., Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. arXiv:2005.14165
Bühlmann P (2004) Bagging, boosting and ensemble methods. Papers ,31, Berlin, 2004. http://hdl.handle.net/10419/22204
Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly BHA, Varoquaux G (2013) API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD workshop: languages for data mining and machine learning, pp 108–122. arXiv
Cohan A, Dernoncourt F, Kim DS, Bui T, Kim S, Chang W, Goharian N (2018) A discourse-aware attention model for abstractive summarization of long documents
Mostafa D, Stephan G, Jakob U, Łukasz K (2019) Universal transformers, Oriol Vinyals
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Filatova E, Hatzivassiloglou V (2004) Event-based extractive summarization. In: Text summarization branches out. Barcelona, Spain, July. Association for Computational Linguistics, pp 104–111. https://aclanthology.org/W04-1017
Galassi A, Lippi M, Torroni P (2021) Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 32(10):4291–4308. https://doi.org/10.1109/tnnls.2020.3019893
Grail Q, Perez J, Gaussier E (2021) Globalizing BERT-based transformer architectures for long document summarization. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume. Online, April. Association for Computational Linguistics, pp 1792–1810. https://doi.org/10.18653/v1/2021.eacl-main.154
Jadhav A, Jain R, Fernandes S, Shaikh S (2019) Text summarization using neural networks. In: 2019 international conference on advances in computing, communication and control (ICAC3), pp 1–6. https://doi.org/10.1109/ICAC347590.2019.9036739
Spärck Jones K (2007) Automatic summarising: the state of the art. Inf Process Manag 43:1449–1481
Kiani F, Oguzhan T (2017) A survey automatic text summarization. 5:205–213. https://doi.org/10.17261/Pressacademia.2017.591
Koh HY, Ju J, Liu M, Pan S (2022) An empirical survey on long document summarization: datasets, models and metrics. ACM Comput Surv. https://doi.org/10.1145/3545176
Koupaee M, Wang WY (2018) Wikihow: a large scale text summarization dataset. arXiv:1810.09305
Kouzis-Loukas D (2016) Learning scrapy. Packt Publishing Ltd, Birmingham
Kupiec J, Pedersen J, Chen F (1995) A trainable document summarizer. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’95. New York, NY, USA. Association for Computing Machinery, pp 68-73. ISBN 0897917146. https://doi.org/10.1145/215206.215333
Oliveira H, de Brito Gomes Laerth B A multi-document summarization system for news articles in Portuguese using integer linear programming, pp 131–143. 09 2030. ISBN 9786557063613. https://doi.org/10.22533/at.ed.61320040912
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out. Barcelona, Spain, July. Association for Computational Linguistics, pp 74–81. https://aclanthology.org/W04-1013
Liu PJ, Saleh M, Pot E, Goodrich B, Sepassi R, Kaiser L, Shazeer N (2018) Generating wikipedia by summarizing long sequences. arXiv:1801.10198
Liu Y (2019) Fine-tune bert for extractive summarization. arXiv:1903.10318
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692
Lloret E, Plaza L, Aker A (2018) The challenging task of summary evaluation: an overview. Lang Resour Eval 52:03. https://doi.org/10.1007/s10579-017-9399-2
Mani Inderjeet (2002) Summarization evaluation: an overview. In: NTCIR, 06
Mihalcea R, Tarau P (2004) TextRank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, Spain, July. Association for Computational Linguistics, pp 404–411. https://aclanthology.org/W04-3252
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space
Miller D (2019) Leveraging bert for extractive text summarization on lectures. arXiv:1906.04165
Miller D (2019) Leveraging bert for extractive text summarization on lectures. arXiv:1906.04165
Moradi M, Dorffner G, Samwald M (2020) Deep contextualized embeddings for quantifying the informative content in biomedical text summarization. Comput Methods Programs Biomed 184:105117. https://doi.org/10.1016/j.cmpb.2019.105117
Gopalan Moratanch N, Chitrakala (2016) A survey on abstractive text summarization. In: 2016 international conference on circuit, power and computing technologies (ICCPCT). arXiv, 03, pp 1–7. https://doi.org/10.1109/ICCPCT.2016.7530193
Gopalan M, Chitrakala (2017) A survey on extractive text summarization. In: 2017 international conference on computer, communication and signal processing (ICCCSP). arXiv, 01, pp 1–6. https://doi.org/10.1109/ICCCSP.2017.7944061
Nallapati R, Zhai F, Zhou B (2016) Summarunner: a recurrent neural network based sequence model for extractive summarization of documents
Narayan S, Cohen SB, Lapata M (2018) Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization
Nenkova A, McKeown K (2011) Automatic summarization, 5. 06. https://doi.org/10.1561/1500000015
Nguyen T-H, Do T-N (2022) Extractive text summarization on large-scale dataset using k-means clustering. In: Advances and trends in artificial intelligence. Theory and practices in artificial intelligence: 35th international conference on industrial, engineering and other applications of applied intelligent systems, IEA/AIE 2022, Kitakyushu, Japan, July 19-22, Proceedings. Berlin, Heidelberg, 2022. Springer, pp 737–746. ISBN 978-3-031-08529-1. https://doi.org/10.1007/978-3-031-08530-7_62
Norambuena B, Horning M, Mitra T (2020) Evaluating the inverted pyramid structure through automatic 5w1h extraction and summarization. Comput J Symp. https://par.nsf.gov/biblio/10168974
Oliveira (2014) As notícias de crime: uma análise retórico-argumentativa do discurso jornalístico online por antecipação ao discurso jurídico. Master’s thesis, Universidade de São Paulo
Orrú T, Rosa J, Andrade NM (2006) Sabio: an automatic portuguese text summarizer through artificial neural networks in a more biologically plausible model. pp 11–20, 01
Otter DW, Medina JR, Kalita JK (2018) A survey of the usages of deep learning in natural language processing. arXiv:1807.10854
Adam P, Sam G, Francisco M, Adam L, James B, Gregory C, Trevor K, Zeming L, Natalia G, Luca A, Alban D, Andreas K, Edward Y, Zachary D, Martin R, An Alykhan T, Sasank C, Benoit S, Lu F, Junjie B, Soumith C (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems 32. Curran Associates, Inc., pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Pottker H (2003) News and its communicative quality: the inverted pyramid-when and why did it appear? J Stud 4:501–511. https://doi.org/10.1080/1461670032000136596
XiPeng Q, TianXiang S, YiGe X, YunFan S, Ning D, Huang X (2020) Pre-trained models for natural language processing: a survey. Sci China Technol Sci 63(10):1872–1897. https://doi.org/10.1007/s11431-020-1647-3
Radev D, Jing H, Styś M, Tam D (2004) Centroid-based summarization of multiple documents. Inf Process Manag 40:919–938. https://doi.org/10.1016/j.ipm.2003.10.006
Machado RLH, Salgueiro PTA, Nascimento SC Jr, Kaestner Celso AA, Pombo M (2004) A comparison of automatic summarizers of texts in brazilian portuguese. In: Bazzan ALC, Sofiane L (eds) SBIA, volume 3171 of Lecture Notes in Computer Science. Springer, 235–244. ISBN 3-540-23237-0
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108
Savelieva A, Au-Yeung B, Ramani V (2020) Abstractive summarization of spoken and written instructions with bert. arXiv:2008.09676
Souza F, Nogueira R, Lotufo R (2020) BERTimbau: pretrained BERT models for Brazilian Portuguese. pp 403–417. 10 2020. ISBN 978-3-030-61376-1
Torres J (2011) Sumarização automática de artigos científicos de engenharia de software como suporte AO processo de revisão sistemática
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Dingding W, Tao L (2010) Document update summarization using incremental hierarchical clustering. In: Proceedings of the 19th ACM international conference on information and knowledge management, CIKM ’10. New York, NY, USA. Association for Computing Machinery, pp 279-288. ISBN 9781450300995. https://doi.org/10.1145/1871437.1871476
Wang F, Franco-Penya H-H, Kelleher J, Pugh J, Ross R (2017) An analysis of the application of simplified silhouette to the evaluation of k-means clustering validity. In: IAPR international conference on machine learning and data mining in pattern recognition, 07. ISBN 978-3-319-62415-0. https://doi.org/10.1007/978-3-319-62416-7_21
Widyassari AP, Rustad S, Shidik GF, Noersasongko E, Syukur A, Affandy A, De Rosal IMS (2020) Review of automatic text summarization techniques & methods. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.05.006
Xu J, Gan Z, Cheng Y, Liu J (2019) Discourse-aware neural extractive text summarization. arXiv:1910.14142
Yamuna K, Shriamrut V, Singh D, Gopalasamy V, Menon V (2021) Bert-based braille summarization of long documents. In: 2021 12th international conference on computing communication and networking technologies (ICCCNT), pp 1–6. https://doi.org/10.1109/ICCCNT51525.2021.9579748
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le Quoc V (2019) Xlnet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237
Zhang R, Wei Z, Shi Y, Chen Y (2020) BERT-al: BERT for arbitrarily long document understanding. https://openreview.net/forum?id=SklnVAEFDB
Zheng C, Zhang K, Wang HJ, Fan L, Wang Z (2021) Topic-guided abstractive text summarization: a joint learning approach
Zhong M, Liu P, Chen Y, Wang D, Qiu X, Huang X (2020) Extractive summarization as text matching. arXiv:2004.08795
Zhuang F, Qi Z, Duan K, Xi K, Zhu Y, Zhu H, Xiong H, He Q (2019) A comprehensive survey on transfer learning. arXiv:1911.02685
Acknowledgements
This work was ostensibly supported by the Federal Police and the Federal University of Campina Grande under the Epol project 08200.01128/2019-72. We thank them for providing all the data for the construction of the research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Barros, T.S., Pires, C.E.S. & Nascimento, D.C. Leveraging BERT for extractive text summarization on federal police documents. Knowl Inf Syst 65, 4873–4903 (2023). https://doi.org/10.1007/s10115-023-01912-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01912-8