Abstract
Learning knowledge from text is becoming increasingly important as the amount of unstructured content on the Web rapidly grows. Despite recent breakthroughs in natural language understanding, the explanation of phenomena from textual documents is still a difficult and poorly addressed problem. Additionally, current NLP solutions often require labeled data, are domain-dependent, and based on black box models. In this paper, we introduce POIROT, a new descriptive text mining methodology for phenomena explanation from documents corpora. POIROT is designed to provide accurate and interpretable results in unsupervised settings, quantifying them based on their statistical significance. We evaluated POIROT on a medical case study, with the aim of learning the “voice of patients” from short social posts. Taking Esophageal Achalasia as a reference, we automatically derived scientific correlations with 79% F1-measure score and built useful explanations of the patients’ viewpoint on topics such as symptoms, treatments, drugs, and foods. We make the source code and experiment details publicly available (https://github.com/unibodatascience/POIROT).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
NLU Kaggle competition on COVID-19. https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge.
- 2.
Folding is the process of adding new vectors to a space after its construction, without rebuilding it.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Ahonen, H., Heinonen, O., Klemettinen, M., Verkamo, A.I.: Applying data mining techniques for descriptive phrase extraction in digital document collections. In: IEEE ADL 1998, pp. 2–11 (1998)
Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, New York (2003)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017)
Bos, J.: A survey of computational semantics: Representation, inference and knowledge in wide-coverage text understanding. Lang. Linguistics Compass 5(6), 336–366 (2011). https://doi.org/10.1111/j.1749-818X.2011.00284.x
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Carbonaro, A.: Interlinking e-learning resources and the web of data for improving student experience. J. e-Learn. Knowl. Soc. 8(2), 33–44 (2012)
Carbonaro, A., Piccinini, F., Reda, R.: Integrating heterogeneous data of healthcare devices to enable domain data management. J. e-Learn. Knowl. Soc. 14 (2018)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262 (2004)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Discovering new gene functionalities from random perturbations of known gene ontological annotations. In: KDIR 2014 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, Rome, Italy, pp. 107–116. SciTePress (2014). https://doi.org/10.5220/0005087801070116
Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Cross-organism learning method to discover new gene functionalities. Comput. Methods Programs Biomed. 126, 20–34 (2016). https://doi.org/10.1016/j.cmpb.2015.12.002
Domeniconi, G., Moro, G., Pagliarani, A., Pasini, K., Pasolini, R.: Job recommendation from semantic similarity of linkedin users’ skills. In: Marsico, M.D., di Baja, G.S., Fred, A.L.N. (eds.) Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods, ICPRAM 2016, Rome, Italy, 24–26 February 2016, pp. 270–277. SciTePress (2016). https://doi.org/10.5220/0005702302700277
Domeniconi, G., Moro, G., Pagliarani, A., Pasolini, R.: On deep learning in cross-domain sentiment classification. In: Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - (Volume 1), Funchal, Madeira, Portugal, 2017, pp. 50–60. SciTePress (2017). https://doi.org/10.5220/0006488100500060
Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Cross-domain text classification through iterative refining of target categories representations. In: Fred, A.L.N., Filipe, J. (eds.) KDIR 2014 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, Rome, Italy, 21–24 October 2014, pp. 31–42. SciTePress (2014). https://doi.org/10.5220/0005069400310042
Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Iterative refining of category profiles for nearest centroid cross-domain text classification. In: Fred, A., Dietz, J.L.G., Aveiro, D., Liu, K., Filipe, J. (eds.) IC3K 2014. CCIS, vol. 553, pp. 50–67. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25840-9_4
Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf.idf. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds.) DATA 2015. CCIS, vol. 584, pp. 39–58. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30162-4_4
Domeniconi, G., Semertzidis, K., Lopez, V., Daly, E.M., Kotoulas, S., et al.: A novel method for unsupervised and supervised conversational message thread detection. In: DATA, pp. 43–54 (2016)
Domeniconi, G., Semertzidis, K., Moro, G., Lopez, V., Kotoulas, S., Daly, E.M.: Identifying conversational message threads by integrating classification and data clustering. In: Francalanci, C., Helfert, M. (eds.) DATA 2016. CCIS, vol. 737, pp. 25–46. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62911-7_2
Frisoni, G., Moro., G., Carbonaro, A.: Learning interpretable and statistically significant knowledge from unlabeled corpora of social text messages: a novel methodology of descriptive text mining. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 121–132. INSTICC, SciTePress (2020). https://doi.org/10.5220/0009892001210132
Frisoni, G., Moro., G., Carbonaro., A.: Unsupervised descriptive text mining for knowledge graph learning. In: Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, pp. 316–324. INSTICC, SciTePress (2020). https://doi.org/10.5220/0010153603160324
Girolami, M., Kabán, A.: On an equivalence between PLSI and LDA. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434 (2003)
Gunning, D.: Explainable Artificial Intelligence (XAI). Defense Advanced Research Projects Agency (DARPA), nd Web 2 (2017)
Gyawali, B., Shimorina, A., Gardent, C., Cruz-Lara, S., Mahfoudh, M.: Mapping natural language to description logic. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.) ESWC 2017, Part I. LNCS, vol. 10249, pp. 273–288. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58068-5_17
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Hofmann, T.: Probabilistic latent semantic analysis. arXiv preprint arXiv:1301.6705 (2013)
Jia, R., Liang, P.: Adversarial examples for evaluating reading comprehension systems. arXiv:1707.07328 (2017)
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104(2), 211 (1997)
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)
Li, J., Sun, A., Han, J., et al.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. (2020)
Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 415–463. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_13
Liu, H., Yin, Q., Wang, W.Y.: Towards explainable NLP: a generative explanation framework for text classification. arXiv:1811.00196 (2018)
Liu, T., Moore, A.W., Yang, K., Gray, A.G.: An investigation of practical approximate nearest neighbor algorithms. In: Advances in Neural Information Processing Systems, pp. 825–832 (2005)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Mathews, S.M.: Explainable artificial intelligence applications in NLP, biomedical, and malware classification: a literature review. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) CompCom 2019. AISC, vol. 998, pp. 1269–1292. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22868-2_90
Microsoft: Turing-NLG: A 17-billion parameter language model by Microsoft, February 2020
Moro, G., Pagliarani, A., Pasolini, R., Sartori, C.: Cross-domain & in-domain sentiment analysis with memory-based deep neural networks. In: Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2018, vol. 1, pp. 125–136. KDIR, Seville (2018). https://doi.org/10.5220/0007239101270138
Pagliarani, A., Moro, G., Pasolini, R., Domeniconi, G.: Transfer learning in sentiment classification with deep neural networks. In: Fred, A., et al. (eds.) IC3K 2017. CCIS, vol. 976, pp. 3–25. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15640-4_1
Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., et al.: The limitations of deep learning in adversarial settings. In: EuroS&P, pp. 372–387 (2016)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv:1908.10084 (2019)
Ren, X., He, W., Qu, M., et al.: AFET: automatic fine-grained entity typing by hierarchical partial-label embedding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1369–1378 (2016)
Riccucci, S., Carbonaro, A., Casadei, G.: Knowledge acquisition in intelligent tutoring system: a data mining approach. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827, pp. 1195–1205. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76631-5_114
Safavian, S.R., Landgrebe, D.A.: A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21, 660–674 (1991)
Sarlos, T.: Improved approximation algorithms for large matrices via random projections. In: 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), pp. 143–152. IEEE (2006)
Suzuki, R., Shimodaira, H.: Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22(12), 1540–1542 (2006)
Weiss, S.M., Indurkhya, N., Zhang, T.: Fundamentals of Predictive Text Mining. Springer, London (2015). https://doi.org/10.1007/978-1-4471-6750-1
Acknowledgments
Gianluca Moro has developed this research and is the author of this methodology and its mathematical and algorithmic solutions, preliminarily reported also in [20, 21], which are also included in his text mining course at the University of Bologna since the 2014/15 academic year and applied for the discovery of the reasons that contribute to cause aircraft accidents (https://unibodatascience.github.io/textmining/) from the raw textual reports collected by the US National Transportation Safety Board (NTSB).
Giacomo Frisoni has successfully applied this methodology to the real medical case study included in this work and in [20, 21] and has contributed in this work to add the LDA and pLSA techniques and to apply and compare them in the case study.
We want to thank Cristina Lanni (https://www.researchgate.net/profile/Cristina_Lanni/research) (Researcher at the University of Pavia) and Celeste Napolitano (President of AMAE and National Secretary of the Italian Society of Narrative Medicine) for their precious help in building the dataset and participating in the realization of Achalasia gold standards.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Frisoni, G., Moro, G. (2021). Phenomena Explanation from Text: Unsupervised Learning of Interpretable and Statistically Significant Knowledge. In: Hammoudi, S., Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2020. Communications in Computer and Information Science, vol 1446. Springer, Cham. https://doi.org/10.1007/978-3-030-83014-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-83014-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83013-7
Online ISBN: 978-3-030-83014-4
eBook Packages: Computer ScienceComputer Science (R0)