Text Segmentation with Topic Modeling and Entity Coherence

  • Adebayo Kolawole JohnEmail author
  • Luigi Di Caro
  • Guido Boella
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 552)


This paper describes a system which uses entity and topic coherence for improved Text Segmentation (TS) accuracy. First, Linear Dirichlet Allocation (LDA) algorithm was used to obtain topics for sentences in the document. We then performed entity mapping across a window in order to discover the transition of entities within sentences. We used the information obtained to support our LDA-based boundary detection for proper boundary adjustment. We report the significance of the entity coherence approach as well as the superiority of our algorithm over existing works.


Text segmentation Entity coherence Linear dirichlet allocation Topic modeling 



Kolawole J. Adebayo has received funding from the Erasmus Mundus Joint International Doctoral (Ph.D.) programme in Law, Science and Technology. Luigi Di Caro and Guido Boella have received funding from the European Union’s H2020 research and innovation programme under the grant agreement No 690974 for the project “MIREL: MIning and REasoning with Legal texts”.


  1. 1.
    Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Advances in Automatic Text Summarization, pp. 111–121 (1999)Google Scholar
  2. 2.
    Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. Comput. Linguist. 34(1), 1–34 (2008)CrossRefGoogle Scholar
  3. 3.
    Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1–3), 177–210 (1999)CrossRefzbMATHGoogle Scholar
  4. 4.
    Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)Google Scholar
  5. 5.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  6. 6.
    Boella, G., Di Caro, L., Humphreys, L., Robaldo, L., Rossi, R., van der Torre, L.: Eunomos, a legal document and knowledge management system for the web to provide relevant, reliable and up-to-date information on the law. Artif. Intell. Law 24(3), 245–283 (2016)CrossRefGoogle Scholar
  7. 7.
    Boella, G., Di Caro, L., Ruggeri, A., Robaldo, L.: Learning from syntax generalizations for automatic semantic annotation. J. Intell. Inf. Syst. 43(2), 231–246 (2014)CrossRefGoogle Scholar
  8. 8.
    Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp. 26–33. Association for Computational Linguistics (2000)Google Scholar
  9. 9.
    Choi, F.Y.Y., Wiemer-Hastings, P., Moore, J.: Latent semantic analysis for text segmentation. In: Proceedings of EMNLP. Citeseer (2001)Google Scholar
  10. 10.
    Dias, G., Alves, E., Lopes, J.G.P.: Topic segmentation algorithms for text summarization and passage retrieval: an exhaustive evaluation. In: AAAI, vol. 7, pp. 1334–1339 (2007)Google Scholar
  11. 11.
    Du, L., Pate, J.K., Johnson, M.: Topic segmentation in an ordering-based topic model (2015)Google Scholar
  12. 12.
    Eisenstein, J.: Hierarchical text segmentation from multi-scale lexical cohesion. In: Proceedings of Human Language Technologies: the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 353–361. Association for Computational Linguistics (2009)Google Scholar
  13. 13.
    Grosz, B.J., Weinstein, S., Joshi, A.K.: Centering: a framework for modeling the local coherence of discourse. Comput. Linguist. 21(2), 203–225 (1995)Google Scholar
  14. 14.
    Alexander, M., Halliday, K., Hasan, R.: Cohesion in English. Routledge (2014)Google Scholar
  15. 15.
    Hearst, M.A.: Texttiling: a quantitative approach to discourse segmentation. Technical report. Citeseer (1993)Google Scholar
  16. 16.
    Hearst, M.A.: Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)Google Scholar
  17. 17.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)Google Scholar
  18. 18.
    Kaufmann, S.: Cohesion and collocation: using context vectors in text segmentation. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 591–595. Association for Computational Linguistics (1999)Google Scholar
  19. 19.
    Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)CrossRefGoogle Scholar
  20. 20.
    Li, Y., McLean, D., Bandar, Z.A., O’shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)CrossRefGoogle Scholar
  21. 21.
    Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-Interdiscip. J. Study Discourse 8(3), 243–281 (1988)CrossRefGoogle Scholar
  22. 22.
    Misra, H., Yvon, F., Cappé, O., Jose, J.: Text segmentation: a topic modeling perspective. Inf. Process. Manage. 47(4), 528–544 (2011)CrossRefGoogle Scholar
  23. 23.
    Misra, H., Yvon, F., Jose, J.M., Cappe, O.: Text segmentation via topic modeling: an analytical study. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1553–1556. ACM (2009)Google Scholar
  24. 24.
    Passonneau, R.J., Litman, D.J.: Discourse segmentation by human and automated means. Comput. Linguist. 23(1), 103–139 (1997)Google Scholar
  25. 25.
    Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Computat. Linguist. 28(1), 19–36 (2002)CrossRefGoogle Scholar
  26. 26.
    Reynar, J.C.: Statistical models for topic segmentation. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 357–364. Association for Computational Linguistics (1999)Google Scholar
  27. 27.
    Riedl, M., Biemann, C.: Text segmentation with topic models. J. Lang. Technol. Comput. Linguist. 27(1), 47–69 (2012)Google Scholar
  28. 28.
    Riedl, M., Biemann, C.: Topictiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42. Association for Computational Linguistics (2012)Google Scholar
  29. 29.
    Utiyama, M., Isahara, H.: A statistical model for domain-independent text segmentation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 499–506. Association for Computational Linguistics (2001)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Adebayo Kolawole John
    • 1
    Email author
  • Luigi Di Caro
    • 1
  • Guido Boella
    • 1
  1. 1.Dipartimento di InformaticaUniversita Di TorinoTorinoItaly

Personalised recommendations