Skip to main content

Phenomena Explanation from Text: Unsupervised Learning of Interpretable and Statistically Significant Knowledge

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2020)

Abstract

Learning knowledge from text is becoming increasingly important as the amount of unstructured content on the Web rapidly grows. Despite recent breakthroughs in natural language understanding, the explanation of phenomena from textual documents is still a difficult and poorly addressed problem. Additionally, current NLP solutions often require labeled data, are domain-dependent, and based on black box models. In this paper, we introduce POIROT, a new descriptive text mining methodology for phenomena explanation from documents corpora. POIROT is designed to provide accurate and interpretable results in unsupervised settings, quantifying them based on their statistical significance. We evaluated POIROT on a medical case study, with the aim of learning the “voice of patients” from short social posts. Taking Esophageal Achalasia as a reference, we automatically derived scientific correlations with 79% F1-measure score and built useful explanations of the patients’ viewpoint on topics such as symptoms, treatments, drugs, and foods. We make the source code and experiment details publicly available (https://github.com/unibodatascience/POIROT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    NLU Kaggle competition on COVID-19. https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge.

  2. 2.

    Folding is the process of adding new vectors to a space after its construction, without rebuilding it.

  3. 3.

    https://www.amae.it/.

  4. 4.

    https://www.orpha.net/consor/cgi-bin/SupportGroup_Search.php?lng=EN&data_id=106412.

  5. 5.

    https://www.facebook.com/groups/36705181245/.

  6. 6.

    https://www.textrazor.com/.

  7. 7.

    https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon.

  8. 8.

    https://github.com/disi-unibo-nlu/POIROT.

References

  1. Ahonen, H., Heinonen, O., Klemettinen, M., Verkamo, A.I.: Applying data mining techniques for descriptive phrase extraction in digital document collections. In: IEEE ADL 1998, pp. 2–11 (1998)

    Google Scholar 

  2. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, New York (2003)

    MATH  Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017)

    Article  Google Scholar 

  5. Bos, J.: A survey of computational semantics: Representation, inference and knowledge in wide-coverage text understanding. Lang. Linguistics Compass 5(6), 336–366 (2011). https://doi.org/10.1111/j.1749-818X.2011.00284.x

    Article  Google Scholar 

  6. Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  7. Carbonaro, A.: Interlinking e-learning resources and the web of data for improving student experience. J. e-Learn. Knowl. Soc. 8(2), 33–44 (2012)

    Google Scholar 

  8. Carbonaro, A., Piccinini, F., Reda, R.: Integrating heterogeneous data of healthcare devices to enable domain data management. J. e-Learn. Knowl. Soc. 14 (2018)

    Google Scholar 

  9. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262 (2004)

    Google Scholar 

  10. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

  11. Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Discovering new gene functionalities from random perturbations of known gene ontological annotations. In: KDIR 2014 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, Rome, Italy, pp. 107–116. SciTePress (2014). https://doi.org/10.5220/0005087801070116

  12. Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Cross-organism learning method to discover new gene functionalities. Comput. Methods Programs Biomed. 126, 20–34 (2016). https://doi.org/10.1016/j.cmpb.2015.12.002

    Article  Google Scholar 

  13. Domeniconi, G., Moro, G., Pagliarani, A., Pasini, K., Pasolini, R.: Job recommendation from semantic similarity of linkedin users’ skills. In: Marsico, M.D., di Baja, G.S., Fred, A.L.N. (eds.) Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods, ICPRAM 2016, Rome, Italy, 24–26 February 2016, pp. 270–277. SciTePress (2016). https://doi.org/10.5220/0005702302700277

  14. Domeniconi, G., Moro, G., Pagliarani, A., Pasolini, R.: On deep learning in cross-domain sentiment classification. In: Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - (Volume 1), Funchal, Madeira, Portugal, 2017, pp. 50–60. SciTePress (2017). https://doi.org/10.5220/0006488100500060

  15. Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Cross-domain text classification through iterative refining of target categories representations. In: Fred, A.L.N., Filipe, J. (eds.) KDIR 2014 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, Rome, Italy, 21–24 October 2014, pp. 31–42. SciTePress (2014). https://doi.org/10.5220/0005069400310042

  16. Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Iterative refining of category profiles for nearest centroid cross-domain text classification. In: Fred, A., Dietz, J.L.G., Aveiro, D., Liu, K., Filipe, J. (eds.) IC3K 2014. CCIS, vol. 553, pp. 50–67. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25840-9_4

    Chapter  Google Scholar 

  17. Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf.idf. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds.) DATA 2015. CCIS, vol. 584, pp. 39–58. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30162-4_4

    Chapter  Google Scholar 

  18. Domeniconi, G., Semertzidis, K., Lopez, V., Daly, E.M., Kotoulas, S., et al.: A novel method for unsupervised and supervised conversational message thread detection. In: DATA, pp. 43–54 (2016)

    Google Scholar 

  19. Domeniconi, G., Semertzidis, K., Moro, G., Lopez, V., Kotoulas, S., Daly, E.M.: Identifying conversational message threads by integrating classification and data clustering. In: Francalanci, C., Helfert, M. (eds.) DATA 2016. CCIS, vol. 737, pp. 25–46. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62911-7_2

    Chapter  Google Scholar 

  20. Frisoni, G., Moro., G., Carbonaro, A.: Learning interpretable and statistically significant knowledge from unlabeled corpora of social text messages: a novel methodology of descriptive text mining. In: Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 121–132. INSTICC, SciTePress (2020). https://doi.org/10.5220/0009892001210132

  21. Frisoni, G., Moro., G., Carbonaro., A.: Unsupervised descriptive text mining for knowledge graph learning. In: Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, pp. 316–324. INSTICC, SciTePress (2020). https://doi.org/10.5220/0010153603160324

  22. Girolami, M., Kabán, A.: On an equivalence between PLSI and LDA. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 433–434 (2003)

    Google Scholar 

  23. Gunning, D.: Explainable Artificial Intelligence (XAI). Defense Advanced Research Projects Agency (DARPA), nd Web 2 (2017)

    Google Scholar 

  24. Gyawali, B., Shimorina, A., Gardent, C., Cruz-Lara, S., Mahfoudh, M.: Mapping natural language to description logic. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.) ESWC 2017, Part I. LNCS, vol. 10249, pp. 273–288. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58068-5_17

    Chapter  Google Scholar 

  25. Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)

    Article  MathSciNet  Google Scholar 

  26. Hofmann, T.: Probabilistic latent semantic analysis. arXiv preprint arXiv:1301.6705 (2013)

  27. Jia, R., Liang, P.: Adversarial examples for evaluating reading comprehension systems. arXiv:1707.07328 (2017)

  28. Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104(2), 211 (1997)

    Article  Google Scholar 

  29. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)

    Article  Google Scholar 

  30. Li, J., Sun, A., Han, J., et al.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. (2020)

    Google Scholar 

  31. Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 415–463. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_13

    Chapter  Google Scholar 

  32. Liu, H., Yin, Q., Wang, W.Y.: Towards explainable NLP: a generative explanation framework for text classification. arXiv:1811.00196 (2018)

  33. Liu, T., Moore, A.W., Yang, K., Gray, A.G.: An investigation of practical approximate nearest neighbor algorithms. In: Advances in Neural Information Processing Systems, pp. 825–832 (2005)

    Google Scholar 

  34. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  35. Mathews, S.M.: Explainable artificial intelligence applications in NLP, biomedical, and malware classification: a literature review. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) CompCom 2019. AISC, vol. 998, pp. 1269–1292. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22868-2_90

    Chapter  Google Scholar 

  36. Microsoft: Turing-NLG: A 17-billion parameter language model by Microsoft, February 2020

    Google Scholar 

  37. Moro, G., Pagliarani, A., Pasolini, R., Sartori, C.: Cross-domain & in-domain sentiment analysis with memory-based deep neural networks. In: Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2018, vol. 1, pp. 125–136. KDIR, Seville (2018). https://doi.org/10.5220/0007239101270138

  38. Pagliarani, A., Moro, G., Pasolini, R., Domeniconi, G.: Transfer learning in sentiment classification with deep neural networks. In: Fred, A., et al. (eds.) IC3K 2017. CCIS, vol. 976, pp. 3–25. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15640-4_1

    Chapter  Google Scholar 

  39. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., et al.: The limitations of deep learning in adversarial settings. In: EuroS&P, pp. 372–387 (2016)

    Google Scholar 

  40. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)

  41. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv:1908.10084 (2019)

  42. Ren, X., He, W., Qu, M., et al.: AFET: automatic fine-grained entity typing by hierarchical partial-label embedding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1369–1378 (2016)

    Google Scholar 

  43. Riccucci, S., Carbonaro, A., Casadei, G.: Knowledge acquisition in intelligent tutoring system: a data mining approach. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827, pp. 1195–1205. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76631-5_114

    Chapter  Google Scholar 

  44. Safavian, S.R., Landgrebe, D.A.: A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21, 660–674 (1991)

    Google Scholar 

  45. Sarlos, T.: Improved approximation algorithms for large matrices via random projections. In: 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), pp. 143–152. IEEE (2006)

    Google Scholar 

  46. Suzuki, R., Shimodaira, H.: Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22(12), 1540–1542 (2006)

    Article  Google Scholar 

  47. Weiss, S.M., Indurkhya, N., Zhang, T.: Fundamentals of Predictive Text Mining. Springer, London (2015). https://doi.org/10.1007/978-1-4471-6750-1

    Book  MATH  Google Scholar 

Download references

Acknowledgments

Gianluca Moro has developed this research and is the author of this methodology and its mathematical and algorithmic solutions, preliminarily reported also in [20, 21], which are also included in his text mining course at the University of Bologna since the 2014/15 academic year and applied for the discovery of the reasons that contribute to cause aircraft accidents (https://unibodatascience.github.io/textmining/) from the raw textual reports collected by the US National Transportation Safety Board (NTSB).

Giacomo Frisoni has successfully applied this methodology to the real medical case study included in this work and in [20, 21] and has contributed in this work to add the LDA and pLSA techniques and to apply and compare them in the case study.

We want to thank Cristina Lanni (https://www.researchgate.net/profile/Cristina_Lanni/research) (Researcher at the University of Pavia) and Celeste Napolitano (President of AMAE and National Secretary of the Italian Society of Narrative Medicine) for their precious help in building the dataset and participating in the realization of Achalasia gold standards.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gianluca Moro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Frisoni, G., Moro, G. (2021). Phenomena Explanation from Text: Unsupervised Learning of Interpretable and Statistically Significant Knowledge. In: Hammoudi, S., Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2020. Communications in Computer and Information Science, vol 1446. Springer, Cham. https://doi.org/10.1007/978-3-030-83014-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83014-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83013-7

  • Online ISBN: 978-3-030-83014-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics