Abstract
The amount of news generated on the internet has increased significantly in recent years. As a trend, text data has gained attention from industry, government, academia, and the financial market. This information is potentially valuable to assist domain experts in decision making. Therefore, related applications based on machine learning have been widely available in several areas of knowledge. However, for supervised learning tasks, the availability of annotated texts in quantity and quality is a recurring problem. This work proposes a time-series-driven approach to labeling chronologically arranged documents. Our proposal categorizes short texts for a particular domain according to the level and trend patterns of a given time series. We use the obtained weak labels with the understanding that they are imperfect but still useful for building predictive text models. Documents and agribusiness commodity price series were employed to assess performance in four classification scenarios. The experimental evaluation considered nine textual representations and different learning paradigms. Neural language-based models demonstrated better classification performance than traditional ones. The results indicate that the proposed approach can be an alternative for automatically labeling a large news volume.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, C.C.: Machine Learning for Text. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73531-3
Alfonseca, E., Filippova, K., Delort, J.Y., Garrido, G.: Pattern learning for relation extraction with a hierarchical topic model. In: Annual Meeting of the Association for Computational Linguistics, vol. 2, pp. 54–59 (2012)
Anklin, V., et al.: Learning whole-slide segmentation from inexact and incomplete labels using tissue graphs. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 636–646. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_59
Araujo, A.F., Gôlo, M.P., Marcacini, R.M.: Opinion mining for app reviews: an analysis of textual representation and predictive models. Autom. Software Eng. 29(1), 1–30 (2022)
Batista-Navarro, R., Hawkins, O.: Topic modelling vs distant supervision: a comparative evaluation based on the classification of parliamentary enquiries. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds.) TPDL 2019. LNCS, vol. 11799, pp. 415–419. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30760-8_46
Boecking, B., Neiswanger, W., Xing, E., Dubrawski, A.: Interactive weak supervision: learning useful heuristics for data labeling. arXiv preprint arXiv:2012.06046 (2020)
Chatfield, C., Xing, H.: The Analysis of Time Series: An Introduction with R. CRC Press (2019)
Chen, L.M., Xiu, B.X., Ding, Z.Y.: Multiple weak supervision for short text classification. Appl. Intell. 1–16 (2022)
Dai, E., Shu, K., Sun, Y., Wang, S.: Labeled data generation with inexact supervision. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 218–226 (2021)
De Sa, C., Ratner, A., Ré, C., Shin, J., Wang, F., Wu, S., Zhang, C.: Deepdive: declarative knowledge base construction. ACM SIGMOD Record 45(1), 60–67 (2016)
dos Santos, B.N., Marcacini, R.M., Rezende, S.O.: Multi-domain aspect extraction using bidirectional encoder representations from transformers. IEEE Access 9, 91604–91613 (2021)
Helmstetter, S., Paulheim, H.: Collecting a large scale dataset for classifying fake news tweets using weak supervision. Fut. Internet 13(5), 114 (2021)
Hsieh, C.Y., Lin, W.I., Xu, M., Niu, G., Lin, H.T., Sugiyama, M.: Active refinement for multi-label learning: a pseudo-label approach. arXiv preprint arXiv:2109.14676 (2021)
Janev, V., Pujić, D., Jelić, M., Vidal, M.-E.: Chapter 9 survey on big data applications. In: Janev, V., Graux, D., Jabeen, H., Sallinger, E. (eds.) Knowledge Graphs and Big Data Processing. LNCS, vol. 12072, pp. 149–164. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-53199-7_9
Krause, S., Li, H., Uszkoreit, H., Xu, F.: Large-scale learning of relation-extraction rules with distant supervision from the web. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 263–278. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_17
Lison, P., Hubin, A., Barnes, J., Touileb, S.: Named entity recognition without labelled data: a weak supervision approach. arXiv preprint arXiv:2004.14723 (2020)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011 (2009)
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: rapid training data creation with weak supervision. In: International Conference on Very Large Data Bases, vol. 11, p. 269. NIH Public Access (2017)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: holistic data repairs with probabilistic inference. arXiv preprint arXiv:1702.00820 (2017)
Shi, Y., Xiao, Y., Niu, L.: A brief survey of relation extraction based on distant supervision. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11538, pp. 293–303. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22744-9_23
Shu, K., et al.: Leveraging multi-source weak social supervision for early detection of fake news. arXiv preprint arXiv:2004.01732 (2020)
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using bert-CRF. arXiv preprint arXiv:1909.10649 (2019)
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Brazilian Conference on Intelligent Systems (2020)
de Souza, M.C., Nogueira, B.M., Rossi, R.G., Marcacini, R.M., dos Santos, B.N., Rezende, S.O.: A network-based positive and unlabeled learning approach for fake news detection. Mach. Learn. 1–44 (2021)
Wang, Y., et al.: Weak supervision for fake news detection via reinforcement learning. In: AAAI Conference on Artificial Intelligence, vol. 34, pp. 516–523 (2020)
Yao, W., Liu, J., Cai, Z.: Personal attributes extraction in chinese text based on distant-supervision and LSTM. In: Park, J.J., Loia, V., Yi, G., Sung, Y. (eds.) CUTE/CSA -2017. LNEE, vol. 474, pp. 511–515. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-7605-3_84
Zhou, Z.H.: A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5(1), 44–53 (2018)
Acknowledgements
This work was carried out at the Center for Artificial Intelligence (C4AI-USP) and partially supported by the São Paulo Research Foundation (FAPESP) (grant #2019/07665-4) and the IBM Corporation. The authors of this paper thank FAPESP (Process 2019 / 25010-5) and the National Center for Scientific and Technological Development (CNPq) (process 309575/2021-4). The corresponding author thanks the Minas Gerais State Research Support Foundation (FAPEMIG) (Process PCRH BPG-00054-210).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Filho, I.J.R., Martins, L.H.D., Parmezan, A.R.S., Marcacini, R.M., Rezende, S.O. (2022). Sequential Short-Text Classification from Multiple Textual Representations with Weak Supervision. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13653. Springer, Cham. https://doi.org/10.1007/978-3-031-21686-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-21686-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21685-5
Online ISBN: 978-3-031-21686-2
eBook Packages: Computer ScienceComputer Science (R0)