Advertisement

Recognising Summary Articles

  • Mark Fisher
  • Dyaa AlbakourEmail author
  • Udo Kruschwitz
  • Miguel Martinez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11437)

Abstract

Online content providers process massive streams of texts to supply topics and entities of interest to their customers. In this process, they face several information overload problems. Apart from identifying topically relevant articles, this includes identifying duplicates as well as filtering summary articles that comprise of disparate topical sections. Such summary articles would be treated as noise from a media monitoring perspective, an end user might however be interested in just those articles. In this paper, we introduce the recognition of summary articles as a novel task and present theoretical and experimental work towards addressing the problem. Rather than treating this as a single-step binary classification task, we propose a framework to tackle it as a two-step approach of boundary detection followed by classification. Boundary detection is achieved with a bi-directional LSTM sequence learner. Structural features are then extracted using the boundaries and clusters devised with the output of this LSTM. A range of classifiers are applied for ensuing summary recognition including a convolutional neural network (CNN) where we treat articles as 1-dimensional structural ‘images’. A corpus of natural summary articles is collected for evaluation using the Signal 1M news dataset. To assess the generalisation properties of our framework, we also investigate its performance on synthetic summaries. We show that our structural features sustain their performance on generalisation in comparison to baseline bag-of-words and word2vec classifiers.

References

  1. 1.
    Martinez, M., et al.: Report on the 1st International Workshop on Recent Trends in News Information Retrieval (NewsIR 2016). SIGIR Forum, volo. 50, no. 1, pp. 58–67 (2016)Google Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Misra, H., Yvon, F., Jose, J.M., Cappe, O.: Text segmentation via topic modeling: an analytical study. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, New York, NY, USA, pp. 1553–1556. ACM (2009)Google Scholar
  4. 4.
    Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM), pp. 543–552 (2014)Google Scholar
  5. 5.
    Koshorek, O., Cohen, A., Mor, N., Rotman, M., Berant, J.: Text segmentation as a supervised learning task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 469–473. Association for Computational Linguistics (2018)Google Scholar
  6. 6.
    Corney, D., Albakour, D., Martinez-Alvarez, M., Moussa, S.: What do a million news articles look like? In: Proceedings of the First International Workshop on Recent Trends in News Information Retrieval Co-located with 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, 20 March 2016, pp. 42–47 (2016)Google Scholar
  7. 7.
    Pillai, R.R., Idicula, S.M.: Linear text segmentation using classification techniques. In: Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India. A2CWiC 2010, New York, NY, USA, pp. 58:1–58:4. ACM (2010)Google Scholar
  8. 8.
    Galley, M., McKeown, K., Fosler-Lussier, E., Jing, H.: Discourse segmentation of multi-party conversation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL 2003, Stroudsburg, PA, USA, pp. 562–569. Association for Computational Linguistics (2003)Google Scholar
  9. 9.
    Hearst, M.A.: TextTiling: segmenting text into multi-paragraph subtopic passages. Comput. Lingust. 23(1), 33–64 (1997)Google Scholar
  10. 10.
    Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference. NAACL 2000, Stroudsburg, PA, USA, pp. 26–33. Association for Computational Linguistics (2000)Google Scholar
  11. 11.
    Dadachev, B., Balinsky, A., Balinsky, H.: On automatic text segmentation. In: Proceedings of the 2014 ACM Symposium on Document Engineering, DocEng 2014, New York, NY, USA, pp. 73–80. ACM (2014)Google Scholar
  12. 12.
    Utiyama, M., Isahara, H.: A statistical model for domain-independent text segmentation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. ACL 2001, Stroudsburg, PA, USA, pp. 499–506. Association for Computational Linguistics (2001)Google Scholar
  13. 13.
    Riedl, M., Biemann, C.: TopicTiling: a text segmentation algorithm based on LDA. In: Proceedings of ACL 2012 Student Research Workshop, pp. 37–42. Association for Computational Linguistics (2012)Google Scholar
  14. 14.
    Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)Google Scholar
  15. 15.
    Pham, N.T., Kruszewski, G., Lazaridou, A., Baroni, M.: Jointly optimizing word representations for lexical and sentential tasks with the c-phrase model. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 971–981. Association for Computational Linguistics (2015)Google Scholar
  16. 16.
    Garten, J., Sagae, K., Ustun, V., Dehghani, M.: Combining distributed vector representations for words. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 95–101. Association for Computational Linguistics (2015)Google Scholar
  17. 17.
    Hill, F., Cho, K., Korhonen, A.: Learning distributed representations of sentences from unlabelled data. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1367–1377. Association for Computational Linguistics (2016)Google Scholar
  18. 18.
    Kiros, R., et al.: Skip-thought vectors. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 3294–3302. Curran Associates Inc., New York (2015)Google Scholar
  19. 19.
    Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 528–540. Association for Computational Linguistics (2018)Google Scholar
  20. 20.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  21. 21.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 3104–3112. Curran Associates Inc, New York (2014)Google Scholar
  22. 22.
    Xu, C., Xie, L., Xiao, X.: A bidirectional lstm approach with word embeddings for sentence boundary detection. J. Signal Process. Syst. 90(7), 1063–1075 (2018)CrossRefGoogle Scholar
  23. 23.
    Glavaš, G., Nanni, F., Ponzetto, S.P.: Unsupervised text segmentation using semantic relatedness graphs. In: Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pp. 125–130. Association for Computational Linguistics (2016)Google Scholar
  24. 24.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  25. 25.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)zbMATHCrossRefGoogle Scholar
  26. 26.
    Calinksi, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)MathSciNetzbMATHGoogle Scholar
  27. 27.
    Balikas, G., Amini, M.R.: An empirical study on large scale text classification with skip-gram embeddings. arXiv preprint arXiv:1606.06623 (2016)
  28. 28.
    Schuhmacher, M., Ponzetto, S.P.: Exploiting dbpedia for web search results clustering. In: Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, AKBC@CIKM 13, San Francisco, California, USA, 27–28 October 2013, pp. 91–96 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mark Fisher
    • 1
    • 2
  • Dyaa Albakour
    • 2
    Email author
  • Udo Kruschwitz
    • 1
  • Miguel Martinez
    • 2
  1. 1.School of Computer Science and Electronic EngineeringUniversity of EssexColchesterUK
  2. 2.SignalLondonUK

Personalised recommendations