Skip to main content
Log in

NgramPOS: a bigram-based linguistic and statistical feature process model for unstructured text classification

  • Published:
Wireless Networks Aims and scope Submit manuscript

Abstract

Research in financial domain has shown that sentiment aspects of stock news have a profound impact on volume trades, volatility, stock prices and firm earnings. In-depth analysis of stock news is now sourced from financial reviews by various social networking and marketing sites to help improve decision making. Nonetheless, such reviews are in the form of unstructured text, which requires natural language processing (NLP) in order to extract the sentiments. Accordingly, in this study we investigate the use of NLP tasks in effort to improve the performance of sentiment classification in evaluating the information content of financial news as an instrument in investment decision support system. At present, feature extraction approach is mainly based on the occurrence frequency of words. Therefore low-frequency linguistic features that could be critical in sentiment classification are typically ignored. In this research, we attempt to improve current sentiment analysis approaches for financial news classification by focusing on low-frequency but informative linguistic expressions. Our proposed combination of low and high-frequency linguistic expressions contributes a novel set of features for sentiment classification. The experimental results show that an optimal Ngram feature selection (combination of optimal unigram and bigram features) enhances sentiment classification accuracy as compared to other types of feature sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

  2. http://nlp.stanford.edu:8080/corenlp/process.

References

  1. Fama, E. F. (1965). The behavior of stock-market prices. Journal of Business, 38(1), 34–105.

    Article  Google Scholar 

  2. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. Journal of Finance, 62(3), 1139–1168.

    Article  Google Scholar 

  3. Li, F. (2010). Textual analysis of corporate disclosures: A survey of the literature. Accounting literature, 29, 143–165.

    Google Scholar 

  4. Hagenau, M., Liebmann, M., & Neumann, D. (2013). Automated news reading: Stock price prediction based on financial news using context-specific features. Decision Support Systems, 55, 685–697.

    Article  Google Scholar 

  5. Khadjeh Nassirtoussi, A., Aghabozorgi, S., Ying Wah, T., & Ngo, D. C. L. (2015). Text mining of news-headlines for FOREX market prediction: A multi-layer dimension reduction algorithm with semantics and sentiment. Expert Systems With Applications, 42(1), 306–324.

    Article  Google Scholar 

  6. Koppel, M., & Shtrimberg, I. (2006). Good news or bad news? Let the market decide. Computing Attitude and Affect in Text: Theory and Applications, 20, 297–301.

    Google Scholar 

  7. Groth, S. S., & Muntermann, J. (2011). An intraday market risk management approach based on textual analysis. Decision Support Systems, 50(4), 680–691.

    Article  Google Scholar 

  8. Yu, Y., Duan, W., & Cao, Q. (2013). The impact of social and conventional media on firm equity value: A sentiment analysis approach. Decision Support Systems, 55(4), 919–926.

    Article  Google Scholar 

  9. Généreux, M., Poibeau, T., & Koppel, M. (2011). Sentiment analysis using automatically labelled financial news items. In Affective computing and sentiment analysis (Vol. 45, no. 2, pp. 101–114). The series Text, Speech and Language Technology, Springer.

  10. Zhai, J. J., Cohen, N., & Atreya, A. (2011). CS224N final project: Sentiment analysis of news articles for financial signal prediction (pp. 1–8). https://nlp.stanford.edu/courses/cs224n/2011/reports/nccohen-aatreya-jameszjj.pdf.

  11. Pestov, V. (2013). Is the k-NN classifier in high dimensions affected by the curse of dimensionality? Computers & Mathematics with Applications, 65(10), 1427–1437.

    Article  MathSciNet  Google Scholar 

  12. Joshi, K., Bharathi, H. N., & Jyothi, R. (2016). Stock trend prediction using news sentiment analysis. CoRR. abs/1607.0.

  13. Chen, M. Y., & Chen, T. H. (2017). Modeling public mood and emotion: Blog and news sentiment and socio-economic phenomena. Future Generation Computing Systemshttps://doi.org/10.1016/j.future.2017.10.028.

    Article  Google Scholar 

  14. Chan, S. W. K., & Chong, M. W. C. (2017). Sentiment analysis in financial texts. Decision Support Systems, 94(2017), 53–64.

    Article  Google Scholar 

  15. Mayne, A. (2010). Sentiment analysis for financial news. Sydney: University of Sydney.

    Google Scholar 

  16. Foroozan Yazdani, S., Murad, M. A. A., Sharef, N. M., Singh, Y. P., & Latiff, A. R. A. (2016). Sentiment classification of financial news using statistical features. International Journal of Pattern Recognition and Artificial Intelligence, 31(3), 34.

    Google Scholar 

  17. Pederson, T. (2001). A decision tree of bigrams is an accurate predictor of word sence. In Proceeding of the second NAACL (pp. 79–86).

  18. Dave, K., Way, I., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews, In Proceedings of the 12th International World Wide Web Conference, Budapest, (pp. 519–528).  

  19. Mejova, Y., & Srinivasan, P. (2011). Exploring feature definition and selection for sentiment classifiers. In Fifth international AAAI conference on weblogs and social media (pp. 546–549).

  20. Lan, M. L. M., Tan, C. L. T. C. L., Su, J. S. J., & Lu, Y. L. Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721–735.

    Article  Google Scholar 

  21. Pham Xuan, N., & Le Quang, H. (2014). A new improved term weighting scheme for text categorization. Advances in Intelligent Systems and Computing, 271, 261–270.

    Article  Google Scholar 

  22. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47.

    Article  Google Scholar 

  23. Petrişor, A.-I., Ianoş, I., Iurea, D., & Văidianu, M.-N. (2012). Applications of principal component analysis integrated with GIS. Procedia Environmental Sciences, 14, 247–256.

    Article  Google Scholar 

  24. Alpaydin, E. (2010). Introduction to machine learning, 2nd Edn. The MIT Press.  

  25. Khadjeh Nassirtoussi, A., Aghabozorgi, S., Ying Wah, T., & Ngo, D. C. L. (2014). Text mining for market prediction: A systematic review. Expert Systems with Applications, 41(16), 7653–7670.

    Article  Google Scholar 

  26. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, 1398(2), 137–142.

    Google Scholar 

  27. Hajek, P., & Henriques, R. (2017). Mining corporate annual reports for intelligent detection of financial statement fraud—A comparative study of machine learning methods. Knowledge-Based Systems, 128, 139-152.  

  28. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    MATH  Google Scholar 

  29. Schölkopf, B., & Smola, A. (2005). Support vector machines and kernel algorithms (pp. 1–22).

  30. Ooi, H. S., Schneider, G., Lim, T., Chan, Y., Eisenhaber, B., & Eisenhaber, F. (2010). Data mining techniques for the life sciences (vol. 609, pp 327–348).  New York: Humana Press and Springer Bussiness Media.

  31. Hsu, C., Chang, C., & Lin, C. (2010). A practical guide to support vector classification. Bioinformatics, 1(1), 1–16.

    MathSciNet  Google Scholar 

  32. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer.

    Book  Google Scholar 

  33. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence, 14(12), 1137–1143.

    Google Scholar 

  34. Taylor, A., Marcus, M., & Santorini, B. (2003). The Penn Treebank: an overview. Treebanks 5–22.

  35. Benamara, F., Cesarano, C., & Reforgiato, D. (2007). Sentiment analysis: Adjectives and adverbs are better than adjectives alone. In Proceedings of the International Conference on Weblogs and Social Media(ICWSM), (pp. 1–4).  

  36. Turney, P. D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association Computational Linguistics (pp. 417–424).

  37. Hatzivassiloglou, V., McKeown, K. R., Pang, B., Lee, L., Vaithyanathan, S., Ku, L.-W., et al. (2009). Predicting the semantic orientation of adjectives. ACM Transactions on Information Systems, 21(4), 315–346.

    Google Scholar 

  38. Han, J., & Kamber, M. (2006). Data mining (concepts and techniques). Burlington: Elsevier (Morgan Kaufmann).

    MATH  Google Scholar 

Download references

Acknowledgements

This work is supported in partial by Universiti Tun Hussein Onn Malaysia.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sepideh Foroozan Yazdani.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Foroozan Yazdani, S., Tan, Z., Kakavand, M. et al. NgramPOS: a bigram-based linguistic and statistical feature process model for unstructured text classification. Wireless Netw 28, 1251–1261 (2022). https://doi.org/10.1007/s11276-018-01909-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11276-018-01909-0

Keywords

Navigation