Skip to main content

Text Segmentation with Topic Modeling and Entity Coherence

  • Conference paper
  • First Online:
Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016) (HIS 2016)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 552))

Included in the following conference series:

Abstract

This paper describes a system which uses entity and topic coherence for improved Text Segmentation (TS) accuracy. First, Linear Dirichlet Allocation (LDA) algorithm was used to obtain topics for sentences in the document. We then performed entity mapping across a window in order to discover the transition of entities within sentences. We used the information obtained to support our LDA-based boundary detection for proper boundary adjustment. We report the significance of the entity coherence approach as well as the superiority of our algorithm over existing works.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Our system is being developed in the context of our bigger project Eunomos [6, 7].

  2. 2.

    Otherwise called coherence score.

  3. 3.

    We used the Stanford POStagger. It is available at http://nlp.stanford.edu/software/tagger.shtml.

  4. 4.

    Following our previous parameter \( w_{n} \), we use a window of 3 sentences as default.

  5. 5.

    We use index here to mean the unique ID of a sentence, e.g., sentence 1 will have index 0, sentence 2 will have index 1 etc..

  6. 6.

    i.e., the vector index which corresponds to the index of each sentence in the local minima.

  7. 7.

    The wikipedia dump was downloaded on July 30, 2015. It is accessible at https://dumps.wikimedia.org/enwiki/.

  8. 8.

    It is available at https://radimrehurek.com/gensim/.

References

  1. Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Advances in Automatic Text Summarization, pp. 111–121 (1999)

    Google Scholar 

  2. Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. Comput. Linguist. 34(1), 1–34 (2008)

    Article  Google Scholar 

  3. Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1–3), 177–210 (1999)

    Article  MATH  Google Scholar 

  4. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)

    Google Scholar 

  5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Boella, G., Di Caro, L., Humphreys, L., Robaldo, L., Rossi, R., van der Torre, L.: Eunomos, a legal document and knowledge management system for the web to provide relevant, reliable and up-to-date information on the law. Artif. Intell. Law 24(3), 245–283 (2016)

    Article  Google Scholar 

  7. Boella, G., Di Caro, L., Ruggeri, A., Robaldo, L.: Learning from syntax generalizations for automatic semantic annotation. J. Intell. Inf. Syst. 43(2), 231–246 (2014)

    Article  Google Scholar 

  8. Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp. 26–33. Association for Computational Linguistics (2000)

    Google Scholar 

  9. Choi, F.Y.Y., Wiemer-Hastings, P., Moore, J.: Latent semantic analysis for text segmentation. In: Proceedings of EMNLP. Citeseer (2001)

    Google Scholar 

  10. Dias, G., Alves, E., Lopes, J.G.P.: Topic segmentation algorithms for text summarization and passage retrieval: an exhaustive evaluation. In: AAAI, vol. 7, pp. 1334–1339 (2007)

    Google Scholar 

  11. Du, L., Pate, J.K., Johnson, M.: Topic segmentation in an ordering-based topic model (2015)

    Google Scholar 

  12. Eisenstein, J.: Hierarchical text segmentation from multi-scale lexical cohesion. In: Proceedings of Human Language Technologies: the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 353–361. Association for Computational Linguistics (2009)

    Google Scholar 

  13. Grosz, B.J., Weinstein, S., Joshi, A.K.: Centering: a framework for modeling the local coherence of discourse. Comput. Linguist. 21(2), 203–225 (1995)

    Google Scholar 

  14. Alexander, M., Halliday, K., Hasan, R.: Cohesion in English. Routledge (2014)

    Google Scholar 

  15. Hearst, M.A.: Texttiling: a quantitative approach to discourse segmentation. Technical report. Citeseer (1993)

    Google Scholar 

  16. Hearst, M.A.: Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)

    Google Scholar 

  17. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)

    Google Scholar 

  18. Kaufmann, S.: Cohesion and collocation: using context vectors in text segmentation. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 591–595. Association for Computational Linguistics (1999)

    Google Scholar 

  19. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)

    Article  Google Scholar 

  20. Li, Y., McLean, D., Bandar, Z.A., O’shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)

    Article  Google Scholar 

  21. Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-Interdiscip. J. Study Discourse 8(3), 243–281 (1988)

    Article  Google Scholar 

  22. Misra, H., Yvon, F., Cappé, O., Jose, J.: Text segmentation: a topic modeling perspective. Inf. Process. Manage. 47(4), 528–544 (2011)

    Article  Google Scholar 

  23. Misra, H., Yvon, F., Jose, J.M., Cappe, O.: Text segmentation via topic modeling: an analytical study. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1553–1556. ACM (2009)

    Google Scholar 

  24. Passonneau, R.J., Litman, D.J.: Discourse segmentation by human and automated means. Comput. Linguist. 23(1), 103–139 (1997)

    Google Scholar 

  25. Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Computat. Linguist. 28(1), 19–36 (2002)

    Article  Google Scholar 

  26. Reynar, J.C.: Statistical models for topic segmentation. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 357–364. Association for Computational Linguistics (1999)

    Google Scholar 

  27. Riedl, M., Biemann, C.: Text segmentation with topic models. J. Lang. Technol. Comput. Linguist. 27(1), 47–69 (2012)

    Google Scholar 

  28. Riedl, M., Biemann, C.: Topictiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42. Association for Computational Linguistics (2012)

    Google Scholar 

  29. Utiyama, M., Isahara, H.: A statistical model for domain-independent text segmentation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 499–506. Association for Computational Linguistics (2001)

    Google Scholar 

Download references

Acknowledgments

Kolawole J. Adebayo has received funding from the Erasmus Mundus Joint International Doctoral (Ph.D.) programme in Law, Science and Technology. Luigi Di Caro and Guido Boella have received funding from the European Union’s H2020 research and innovation programme under the grant agreement No 690974 for the project “MIREL: MIning and REasoning with Legal texts”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adebayo Kolawole John .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

John, A.K., Di Caro, L., Boella, G. (2017). Text Segmentation with Topic Modeling and Entity Coherence. In: Abraham, A., Haqiq, A., Alimi, A., Mezzour, G., Rokbani, N., Muda, A. (eds) Proceedings of the 16th International Conference on Hybrid Intelligent Systems (HIS 2016). HIS 2016. Advances in Intelligent Systems and Computing, vol 552. Springer, Cham. https://doi.org/10.1007/978-3-319-52941-7_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-52941-7_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-52940-0

  • Online ISBN: 978-3-319-52941-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics