Comparative Analysis of the Informativeness and Encyclopedic Style of the Popular Web Information Sources

Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 320)


Nowadays, very often decision making relies on information that is found in the various Internet sources. Preferred are texts of the encyclopedic style, which contain mostly factual information. We propose to combine the logic-linguistic model and the universal dependency treebank to extract facts of various quality levels from texts. Based on Random Forest as a classification algorithm, we show the most significant types of facts and types of words that most affect the encyclopedic-style of the text. We evaluate our approach on four corpora based on Wikipedia, social and mass media texts. Our classifier achieves over 90% F-measure.


Encyclopedic Informativeness Universal dependency Random Forest Facts extraction Wikipedia Mass media 


  1. 1.
    Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 1–10 (2015)CrossRefGoogle Scholar
  2. 2.
    Béjoint, H.: Modern Lexicography: An Introduction, pp. 30–31. Oxford University Press (2000)Google Scholar
  3. 3.
    Khairova, N., Petrasova, S., Gautam, A.: The logical-linguistic model of fact extraction from English texts. In: International Conference on Information and Software Technologies, Communications in Computer and Information Science, CCIS 2016, pp. 625–635 (2016)Google Scholar
  4. 4.
    Nivre, J., et al.: Universal dependencies v1: a multilingual treebank collection In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, May. European Language Resources Association (ELRA) (2016)Google Scholar
  5. 5.
    Schler, J., Koppel, M., Argamon, S,. Pennebaker, J.: Effects of age and gender on blogging. In: Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, pp. 191–197 (2006)Google Scholar
  6. 6.
    Leafgren, J.: Degrees of explicitness: information structure and the packaging of Bulgarian subjects and objects. John Benjamins, Amsterdam & Philadelphia (2002)CrossRefGoogle Scholar
  7. 7.
    Berman, R.A., Ravid, D.: Analyzing narrative informativeness in speech and writing. In: Tyler, A., Kim, Y., Takada, M. (eds.) Language in the Context of Use: Cognitive Approaches to Language and Language Learning. Cognitive Linguistics Research Series. pp. 79–101. Mouton de Gruyter, The Hague (2008)Google Scholar
  8. 8.
    Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: Proceedings of SIGIR 2005, pp. 353–360 (2005)Google Scholar
  9. 9.
    Kireyev, K.: Semantic-based estimation of term informativeness. In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pp. 530–538 (2009)Google Scholar
  10. 10.
    Wu, Z., Giles, L.C.: Measuring term informativeness in context. In: Proceedings of NAACL 2013, Atlanta, Georgia, pp. 259–269 (2013)Google Scholar
  11. 11.
    Shams, R.: Identification of informativeness in text using natural language stylometry. Electronic Thesis and Dissertation Repository, 2365 (2014)Google Scholar
  12. 12.
    Huang, A.H., Zang, A.Y., Zheng, R.: Evidence on the information content of text in analyst reports. Acc. Rev. 89(6), 2151–2180 (2014)CrossRefGoogle Scholar
  13. 13.
    Sokolova, M., Lapalme, G.: How much do we say? Using informativeness of negotiation text records for early prediction of negotiation outcomes. Group Decis. Negot. 21(3), 363–379 (2012)CrossRefGoogle Scholar
  14. 14.
    Lex, E., Voelske, M., Errecalde, M., Ferretti, E., Cagnina, L., Horn, C., Granitzer, M.: Measuring the quality of web content using factual information. In: Proceedings of the 2nd joint WICOW/AIRWeb Workshop on Web Quality, pp. 7–10. ACM (2012)Google Scholar
  15. 15.
    De Marneffe, M.C., Manning, C.D.: Stanford typed dependencies manual, pp. 338–345. Technical report. Stanford University (2008)Google Scholar
  16. 16.
    Lewoniewski, W.: Enrichment of information in multilingual wikipedia based on quality analysis. In: Abramowicz, W. (ed.) BIS 2017. LNBIP, vol. 303, pp. 216–227. Springer, Cham (2017). Scholar
  17. 17.
    Węcel, K., Lewoniewski, W.: Modelling the quality of attributes in wikipedia infoboxes. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 228, pp. 308–320. Springer, Cham (2015). Scholar
  18. 18.
    Lewoniewski, W., Węcel, K., Abramowicz, W.: Quality and importance of wikipedia articles in different languages. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 613–624. Springer, Cham (2016). Scholar
  19. 19.
    McEnery, T., Hardie, A.: Corpus Linguistics: Method, Theory and Practice, pp. 48–52. Cambridge University Press, Cambridge (2012)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.National Technical University “Kharkiv Polytechnic Institute”KharkivUkraine
  2. 2.Poznań University of Economics and BusinessPoznańPoland
  3. 3.Institute of Information and Computational TechnologiesAlmatyKazakhstan
  4. 4.Al-Farabi Kazakh National UniversityAlmatyKazakhstan

Personalised recommendations