Skip to main content

Text Segmentation Techniques: A Critical Review

  • Chapter
  • First Online:
Innovative Computing, Optimization and Its Applications

Part of the book series: Studies in Computational Intelligence ((SCI,volume 741))

Abstract

Text segmentation is a method of splitting a document into smaller parts, which is usually called segments. It is widely used in text processing. Each segment has its relevant meaning. Those segments categorized as word, sentence, topic, phrase or any information unit depending on the task of the text analysis. This study presents various reasons of usage of text segmentation for different analyzing approaches. We categorized the types of documents and languages used. The main contribution of this study includes a summarization of 50 research papers and an illustration of past decade (January 2007−January 2017)’s of research that applied text segmentation as their main approach for analysing text. Results revealed the popularity of using text segmentation in analysing different languages. Besides that, the word segment seems to be the most practical and usable segment, as it is the smaller unit than the phrase, sentence or line.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Visweswariah, P.D, Wiratunga, K., Sani N.S. (2012). Two-part segmentation of text documents. In: Proceedings 21st ACM International Conference on Information Knowledge Management—CIKM’12 (p 793). ACM, New York: Maui.

    Google Scholar 

  2. Scaiano, M., Inkpen, D., Laganière, R., & Reinhartz, A. (2010). Automatic text segmentation for movie subtitles. In: Lecturer Notes Computer Science (pp. 295−298). Springer.

    Google Scholar 

  3. Oh, H., Myaeng, S. H., & Jang, M.-G. (2007). Semantic passage segmentation based on sentence topics for question answering. Information Science (Ny), 177, 3696–3717.

    Article  Google Scholar 

  4. Song, F., Darling, W. M., Duric, A., & Kroon, F. W. (2011). An iterative approach to text segmentation. In: 33rd Eurobean Conference on IR Resources ECIR 2011, Dublin (pp. 629–640). Berlin, Heidelberg: Springer.

    Google Scholar 

  5. Oyedotun, O. K., & Khashman, A. (2016). Document segmentation using textural features summarization and feedforward neural network. Applied Intelligence, 45, 1–15.

    Article  Google Scholar 

  6. Wu, Y., Zhang, Y., Luo, S. M., & Wang, X. J. (2007). Comprehensive information based semantic orientation identification. IEEE NLP-KE 2007 - Proc (pp. 274–279). Beijing: Int. Conf. Nat. Lang. Process. Knowl. Eng. IEEE.

    Google Scholar 

  7. Gao, Y., Zhou, L., Zhang, Y., et al (2010). Sentiment classification for stock news. In: ICPCA10—5th International Conference on Pervasive Computer Application (pp. 99−104). Maribor: IEEE.

    Google Scholar 

  8. Xia, H., Tao, M., & Wang, Y. (2010). Sentiment text classification of customers reviews on the Web based on SVM. In: Proceedings–2010 6th International Conference on National Computing (pp. 3633−3637). ICNC.

    Google Scholar 

  9. Liu, C., Wang, Y., & Zheng, F. (2006). Automatic text summarization for dialogue style. In: Proceedings IEEE ICIA 2006—2006 IEEE International Conference on Information Acquistics (pp. 274–278). Weihai: IEEE.

    Google Scholar 

  10. Osman, D. J., & Yearwood, J. L. (2007). Opinion search in web logs In: Conferences in Research and Practice Information Technology Service, 63, 133–139.

    Google Scholar 

  11. Brants, T., Chen, F., & Tsochantaridis, I. (2002). Topic-based document segmentation with probabilistic latent semantic analysis. CIKM’02 (pp. 211–218). Virginia: ACM.

    Google Scholar 

  12. Flejter, D., Wieloch, K., & Abramowicz, W. (2007). Unsupervised methods of topical text segmentation for polish. SIGIR’13 (pp. 51–58). Dublin: ACM.

    Google Scholar 

  13. Potrus, M. Y., Ngah, U. K., & Ahmed, B. S. (2014). An evolutionary harmony search algorithm with dominant point detection for recognition-based segmentation of online Arabic text recognition. Ain Shams Engineering Journal, 5, 1129–1139.

    Article  Google Scholar 

  14. Huang, X., Peng, F., Schuurmans, D., et al. (2003). Applying machine learning to text segmentation. Information Retrieval Journal, 6, 333–362.

    Article  Google Scholar 

  15. Zhu J, Zhu M, Wang H, Tsou BK (2009) Aspect-based sentence segmentation for sentiment summarization. In: Proceeding 1st International CIKM Worshop. Top Analysis mass Open.—TSA’09 (pp. 65–72). Hong Kong: ACM New York, NY, USA ©2009.

    Google Scholar 

  16. Gan, K. H., Phang, K. K., & Tang, E. K. (2007). A semantic learning approach for mapping unstructured query to web resources. In: Proceedings—2006 IEEE/WIC/ACM International Conference on Web Intelligent (WI 2006 Main Confernce Proceedings), WI’06 (pp. 494–497). Hong Kong: IEEE.

    Google Scholar 

  17. Hoon, G. K., Wei, & T. C. (2016). Flexible facets generation for faceted search. In: First EAI International Conference on Computer Science Engineering EAI (pp. 1–3). Penang: Malaysia.

    Google Scholar 

  18. Duan, D., Qian, W., Pan, S., et al (2012). VISA: A visual sentiment analysis system. In: Proceedings 5th International Symposium Visa Information Communicate Interaction—VINCI’12. (pp. 22–28). ACM: Hangzhou.

    Google Scholar 

  19. Sun, Y., Butler, T. S., Shafarenko, A., et al. (2007). Word segmentation of handwritten text using supervised classification techniques. Applied Software Computing, 7, 71–88.

    Article  Google Scholar 

  20. Lamprier, S., Amghar, T., Levrat, B., & Saubion, F. (2007). ClassStruggle: A clustering based text segmentation. In: Proceedings SAC’07. (pp. 600−604). ACM: Seoul.

    Google Scholar 

  21. Correa, J., & Dockrell, J. E. (2007). Unconventional word segmentation in Brazilian children’s early text production. Reading and Writing, 20, 815–831.

    Article  Google Scholar 

  22. El-Shayeb, M. A., El-Beltagy, S. R, & Rafea, A. (2007). Comparative analysis of different text segmentation algorithms on arabic news stories. In: IEEE International Conference on Information Reuse and Integration, Las Vegas (pp. 441–446).

    Google Scholar 

  23. Xie, L., Zeng, J., & Feng, W. (2008). Multi-scale texttiling for automatic story segmentation in Chinese broadcast news. In: 4th Asia Information Retrieval Symposium, Harbin (pp. 345–355). Berlin, Heidelberg: Springer.

    Google Scholar 

  24. Xia, Z., Suzhen, W., Mingzhu, X., & Yixin, Y. (2009). Chinese text sentiment classification based on granule network. In: 2009 IEEE International Conference on Granular Computing GRC 2009 (pp. 775−778). Nanchang: IEEE.

    Google Scholar 

  25. Hong, C. M., Chen, C. M., & Chiu, C. Y. (2009). Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems. Expert Systems with Applications, 36, 3641–3651.

    Article  Google Scholar 

  26. Mukund, S., Srihari, R., & Peterson, E. (2010). An information-extraction system for Urdu-a resource-poor language. ACM Transcations on Asian Language Information Processing, 9, 1–43.

    Article  Google Scholar 

  27. Tsai, R. T.-H. (2010). Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures. Expert Systems with Applications, 37, 3553–3560.

    Article  Google Scholar 

  28. Liu, X., Zuo, M., & Chen, L. (2010). The application of text mining technology in monitoring the network education public sentiment. In: 2010 International Conference on Computing Intelligence and Software Engineering (pp. 1–4). Wuhan: IEEE.

    Google Scholar 

  29. Li, N., & Wu, D. D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems, 48, 354–368.

    Article  Google Scholar 

  30. Misra, H., Yvon, F., Cappé, O., & Jose, J. (2011). Text segmentation: A topic modeling perspective. Information Process Management, 47, 528–544.

    Article  Google Scholar 

  31. Fan, J. (2011). Text segmentation of consumer magazines in PDF format. In: International Conference on Document Analysis and Recognition (ICDAR) (pp. 794–798).

    Google Scholar 

  32. Ranaivo-Malançon, B. (2011). Building a rule-based Malay text segmentation tool. In: 2011 International Conference on Asian Language Processing IALP 2011 (pp. 276–279). Penang: IEEE.

    Google Scholar 

  33. Nouri, J., & Yangarber, R. (2011). A novel evaluation method for morphological segmentation. In: Proceedings Tenth International Conference on Language Resources Evaluation (LREC 2016) (pp. 3102–3109). Portoroz: European Language Resources Association (ELRA).

    Google Scholar 

  34. Paliwal, S., & Pudi, V. (2012). Investigating usage of text segmentation and inter-passage similarities. In: Machine Learning and Data Mining Pattern Recognition (pp. 555–565). Berlin, Heidelberg: Springer.

    Google Scholar 

  35. Peng, X., Setlur, S., Govindaraju, V., & Ramachandrula, S. (2012). Using a boosted tree classifier for text segmentation in hand-annotated documents. Pattern Recognition Letters, 33, 943–950.

    Article  Google Scholar 

  36. Guinaudeau, C., Gravier, G.S & Billot, P. (2012). Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation. Computer Speech Language. 26, 90–104.

    Google Scholar 

  37. Clausner, C., Antonacopoulos, A., & Pletschacher, S. (2012). A robust hybrid approach for text line segmentation. In: 21st International Conference on pattern Recognition (pp. 335–338). Tsukuba: IEEE.

    Google Scholar 

  38. Ye, F.Y., Chen, Y., Luo, X., et al (2012). Research on topic segmentation of Chinese text based on lexical chain. In: 12th International Conference on Computer and Information Technology CIT 2012 (pp. 1131–1136) .Chengdu: IEEE.

    Google Scholar 

  39. Myint, N., Aung, M., & Maung, S. S. (2013). Semantic based text block segmentation using wordnet. International Journal of Computer Communication and Engineering, 2, 601–604.

    Google Scholar 

  40. Kravets, L. G. (2013). The first steps in developing machine translation of patents. World Patent Information, 35, 183–186.

    Article  Google Scholar 

  41. Chiru, C., & Teka, A. (2013). Sentiment-based text segmentation. In: 2nd International. Conference on Systems Computer Science (pp. 234–239). Villeneuve d’Ascq: France, IEEE.

    Google Scholar 

  42. Sun, X., Zhang, Y., Matsuzaki, T., et al. (2013). Probabilistic Chinese word segmentation with non-local information and stochastic training. Information Processing Management, 49, 626–636.

    Article  Google Scholar 

  43. Ye, Y., Wu, Q., Li, Y., et al. (2013). Unknown Chinese word extraction based on variety of overlapping strings. Information Processing Management, 49, 497–512.

    Article  Google Scholar 

  44. Fragkou, P. (2013). Text segmentation for language identification in Greek forums. In: Proceedings of Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants (pp. 23–29). Hissar: Elsevier B.V.

    Google Scholar 

  45. Ma, G., Li, X., & Rayner, K. (2014). Word segmentation of overlapping ambiguous strings during Chinese reading. Journal of Experimental Psychology: Human Perception and Performance, 40, 1046–1059.

    Google Scholar 

  46. Lan, Q., Li, W., & Liu, W. (2015). Chinese text sentiment orientation identificat.ion based on chinese-characters. In: International Conference on IEEE 2015 12th Fuzzy Systems and Knowledge Discovery (FSKD) (pp. 663−668). Zhangjiajie.

    Google Scholar 

  47. Alemi, A. A., & Ginsparg, P. (2015). Text segmentation based on semantic word embeddings. KDD2015 (pp. 1–10). Sydney, Australia: ACM.

    Google Scholar 

  48. Fu, X., Yang, K., Huang, J. Z., & Cui, L. (2015). Dynamic non-parametric joint sentiment topic mixture model. Knowledge-Based Systems, 82, 102–114.

    Article  Google Scholar 

  49. Liu, S. M., & Chen, J.-H. (2015). A multi-label classification based approach for sentiment classification. Expert Systems with Applications, 42, 1083–1093.

    Article  Google Scholar 

  50. Claveau, V., & Lefevre, S. (2015). Topic segmentation of TV-streams by watershed transform and vectorization. Computer Speech and Language, 29, 63–80.

    Article  Google Scholar 

  51. Shi, H., Zhan, W., & Li, X. (2015). A supervised fine-grained sentiment analysis system for online reviews. Intelligent Automation and Soft Computing, 21, 589–605.

    Article  Google Scholar 

  52. Liu, W., & Wang, L. (2016). How does dictionary size influence performance of Vietnamese word segmentation? In: Proceedings Tenth International Conference on Language Resources Evaluation (LREC 2016) (pp. 1079−1083). European Language Resources Association (ELRA), Portorož: Slovenia.

    Google Scholar 

  53. Grouin, C. (2016). Text segmentation of digitized clinical texts. In: Proceedings Tenth International Conference on Language Resource Evaluation (LREC 2016) (pp. 3592−3599). European Language Resources Association (ELRA), Portorož: Slovenia.

    Google Scholar 

  54. Logacheva, V., & Specia, L. (2016). Phrase-level segmentation and labelling of machine translation errors. In: Tenth International Conference on Language Resource Evaluation (LREC 2016) (pp. 2240–2245). European Language Resources Association (ELRA), Portorož: Slovenia.

    Google Scholar 

  55. Homburg, T., & Chiarcos, C. (2016). Akkadian word segmentation. In: Proceedings Tenth International Conference on Language Resource Evaluation. (LREC 2016) (pp. 4067−4074). European Language Resources Association (ELRA), Portorož: Slovenia.

    Google Scholar 

  56. Pedersoli, F., & Tzanetakis, G. (2016). Document segmentation and classification into musical scores and text. International Journal Document Analysis and Recognition, 19, 289–304.

    Article  Google Scholar 

  57. Ehsan, N., & Shakery, A. (2016). Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Information Processing and Management, 52, 1004–1017.

    Article  Google Scholar 

  58. Qingrong, C., Wentao, G., Scheepers, C., et al. (2017). Effects of text segmentation on silent reading of Chinese regulated poems: Evidence from eye movements. 44, 265–286.

    Google Scholar 

  59. Kavitha, A. S., Shivakumara, P., Kumar, G. H., & Lu, T. (2017). A new watershed model based system for character segmentation in degraded text lines. AEU—International Journal of Electronics and Communications, 71, 45–52.

    Google Scholar 

Download references

Acknowledgements

We would like to thank First EAI International Conference on Computer Science and Engineering for the opportunity to present our paper and further extend it. This research paper was partially supported by Sunway University Internal Research Grant No. INT-FST-IS-0114-07 and Sunway-Lancaster Grant SGSSL-FST-DCIS-0115-11.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irina Pak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Pak, I., Teh, P.L. (2018). Text Segmentation Techniques: A Critical Review. In: Zelinka, I., Vasant, P., Duy, V., Dao, T. (eds) Innovative Computing, Optimization and Its Applications. Studies in Computational Intelligence, vol 741. Springer, Cham. https://doi.org/10.1007/978-3-319-66984-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66984-7_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66983-0

  • Online ISBN: 978-3-319-66984-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics