Document Representation for Text Analytics in Finance

Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 345)


The automated analysis of unstructured data that is directly or indirectly relevant to developments on financial markets has attracted attention from researchers and practitioners alike. Recent advances in natural language processing enable a richer representation of textual data with respect to semantical and syntactical characteristics. Specifically, distributed representations of words and documents, commonly referred to as embeddings, are a promising alternative. Consequently, this paper investigates the utilization of these approaches for text analytics in finance. To this end, we synthesize traditional and more recent text representation techniques into a coherent framework and provide explanations of the illustrated methods. Building on this distinction, we systematically analyze the hitherto usage of these methods in the financial domain. The results indicate a surprisingly rare application of the outlined techniques. It is precisely for this reason that this paper aims to connect both finance and natural language processing research and might therefore be helpful in applying new methods at the intersection of the respective research areas.


Document representation Text mining Word embeddings Conceptual framework Literature review 


  1. 1.
    Agarwal, R., Dhar, V.: Big data, data science, and analytics: the opportunity and challenge for IS research. Inf. Syst. Res. 25(3), 443–448 (2014)CrossRefGoogle Scholar
  2. 2.
    Gopal, R., Marsden, J.R., Vanthienen, J.: Information mining—reflections on recent advancements and the road ahead in data, text, and media mining. Decis. Support Syst. 51, 727–731 (2011)CrossRefGoogle Scholar
  3. 3.
    Loughran, T., McDonald, B.: Textual analysis in accounting and finance: a survey. J. Acc. Res. 54(4), 1187–1230 (2016)CrossRefGoogle Scholar
  4. 4.
    Jin, F., Self, N., Saraf, P., Butler, P., Wang, W., Ramakrishnan, N.: Forex-foreteller: currency trend modeling using news articles. In: 19th ACM International Conference on Knowledge Discovery and Data Mining, pp. 1470–1473. ACM (2013)Google Scholar
  5. 5.
    Tetlock, P.C., Saar-Tsechansky, M., Mackskasssy, S.: More than words: quantifying language to measure firms’ fundamentals. J. Finan. 63(3), 1437–1467 (2008)CrossRefGoogle Scholar
  6. 6.
    Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011)CrossRefGoogle Scholar
  7. 7.
    Hotho, A., Nürnberger, A., Paaß, G.: A brief survey of text mining. In: LDV Forum, pp. 19–62 (2005)Google Scholar
  8. 8.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)CrossRefGoogle Scholar
  9. 9.
    Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Hinton, G.E.: Learning distributed representations of concepts. In: 8th Annual Conference of the Cognitive Science Society, pp. 1–12. Amherst, MA (1986)Google Scholar
  11. 11.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  12. 12.
    Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394 (2010)Google Scholar
  13. 13.
    Hinton, G.E., McClelland, J., Rumelhart, D.: Distributed representations. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, no. (3), pp. 77–109 (1986)Google Scholar
  14. 14.
    Bengio, Y.: Learning deep architectures for AI. Found. Trends® Mach. Learn. 2(1), 1–127 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  15. 15.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)zbMATHGoogle Scholar
  16. 16.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: 31st International Conference on Machine Learning, pp. 1188–1196 (2014)Google Scholar
  17. 17.
    Abbasi, A., Sarker, S., Chiang, R.H.L.: Big data research in information systems: toward an inclusive research agenda. J. Assoc. Inf. Syst. 17(2), i–xxxii (2016)Google Scholar
  18. 18.
    De Boom, C., Van Canneyt, S., Demeester, T., Dhoedt, B.: Representation learning for very short texts using weighted word embedding aggregation. Pattern Recogn. Lett. 80, 150–156 (2016)CrossRefGoogle Scholar
  19. 19.
    Nassirtoussi, A.K., Aghabozorgi, S., Wah, T.Y., Ngo, D.: Text mining for market prediction: a systematic review. Expert Syst. Appl. 41(16), 7653–7670 (2014)CrossRefGoogle Scholar
  20. 20.
    Feldman, R.: Techniques and applications for sentiment analysis. Commun. ACM 56(4), 82–89 (2013)CrossRefGoogle Scholar
  21. 21.
    Kearney, C., Liu, S.: Textual sentiment in finance: a survey of methods and models. Int. Rev. Financ. Anal. 33, 171–185 (2014)CrossRefGoogle Scholar
  22. 22.
    Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends® Inf. Retrieval 2(1–2), 1–135 (2008)CrossRefGoogle Scholar
  23. 23.
    Debortoli, S., Müller, O., Junglas, I., vom Brocke, J.: Text mining for information systems researchers: an annotated topic modeling tutorial. Commun. Assoc. Inf. Syst. 39(7), 110–135 (2016)Google Scholar
  24. 24.
    Eickhoff, M., Neuss, N.: Topic modelling methodology: its use in information systems and other managerial disciplines. In: 25th European Conference on Information Systems, pp. 1327–1347 (2017)Google Scholar
  25. 25.
    Lee, S., Baker, J., Song, J., Wetherbe, J.C.: An empirical comparison of four text mining methods. In: 43rd Hawaii International Conference on System Sciences, pp. 1–10. IEEE (2010)Google Scholar
  26. 26.
    Agrawal, R., Batra, M.: A detailed study on text mining techniques. Int. J. Soft Comput. Eng. 2(6), 118–121 (2013)Google Scholar
  27. 27.
    Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint (2017)Google Scholar
  28. 28.
    Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1(1), 60–76 (2009)Google Scholar
  29. 29.
    Zhong, G., Wang, L.-N., Ling, X., Dong, J.: An overview on data representation learning: from traditional feature learning to recent deep learning. J. Finan. Data Sci. 2(4), 265–278 (2016)CrossRefGoogle Scholar
  30. 30.
    Irfan, R., et al.: A survey on text mining in social networks. Knowl. Eng. Rev. 30(2), 157–170 (2015)CrossRefGoogle Scholar
  31. 31.
    Feldman, R., Dagan, I.: Knowledge discovery in textual databases (KDT). In: 1st International Conference on Knowledge Discovery and Data Mining, pp. 112–117 (1995)Google Scholar
  32. 32.
    Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37–54 (1996)Google Scholar
  33. 33.
    Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Commun. ACM 49(9), 76–82 (2006)CrossRefGoogle Scholar
  34. 34.
    Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2007)Google Scholar
  35. 35.
    Miner, G., Elder, J., Hill, T.: Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications. Academic Press, Cambridge (2012)Google Scholar
  36. 36.
    Pavlopoulou, N., Abushwashi, A., Stahl, F., Scibetta, V.: A text mining framework for big data. Expert Update 17(1), 1–23 (2017)Google Scholar
  37. 37.
    Tseng, Y.-H., Lin, C.-J., Lin, Y.-I.: Text mining techniques for patent analysis. Inf. Process. Manage. 43(5), 1216–1247 (2007)CrossRefGoogle Scholar
  38. 38.
    Jing, L.-P., Huang, H.-K., Shi, H.-B.: Improved Feature selection approach TFIDF in text mining. In: 1st International Conference on Machine Learning and Cybernetics, pp. 944–946 (2002)Google Scholar
  39. 39.
    Munková, D., Munk, M., Vozár, M.: Data pre-processing evaluation for text mining: transaction/sequence model. Procedia Comput. Sci. 18, 1198–1207 (2013)CrossRefGoogle Scholar
  40. 40.
    Vijayarani, S., Ilamathi, M.J., Nithya, M.: Preprocessing techniques for text mining – an overview. Int. J. Comput. Sci. Commun. Netw. 5(1), 7–16 (2015)Google Scholar
  41. 41.
    Lowe, W.: Towards a theory of semantic space. In: Annual Meeting of the Cognitive Science Society (2001)Google Scholar
  42. 42.
    Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
  43. 43.
    Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)CrossRefGoogle Scholar
  44. 44.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  45. 45.
    Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)zbMATHCrossRefGoogle Scholar
  46. 46.
    Underhill, D.G., McDowell, L.K., Marchette, D.J., Solka, J.L.: Enhancing text analysis via dimensionality reduction. In: IEEE International Conference on Information Reuse and Integration, pp. 348–353 (2007)Google Scholar
  47. 47.
    Pechenizkiy, M., Tsymbal, A., Puuronen, S.: PCA-based feature transformation for classification: issues in medical diagnostics. In: 17th IEEE Symposium on Computer-Based Medical Systems, pp. 535–540 (2004)Google Scholar
  48. 48.
    Kim, Y.: Convolutional neural networks for sentence classification. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751. Association for Computational Linguistics (2014)Google Scholar
  49. 49.
    Severyn, A., Moschitti, A.: Twitter sentiment analysis with deep convolutional neural networks. In: 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 959–962. ACM (2015)Google Scholar
  50. 50.
    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)Google Scholar
  51. 51.
    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247 (2014)Google Scholar
  52. 52.
    Firth, J.R.: A synopsis of linguistic theory 1930–1955. In: Palmer, F.R. (ed.) Selected Papers of 1952–1959, pp. 168–205, Longmans, London (1957)Google Scholar
  53. 53.
    Schütze, H., Pedersen, J.: A vector model for syntagmatic and paradigmatic relatedness. In: 9th Annual Conference of the UW Centre for the New OED and Text Research, pp. 104–113 (1993)Google Scholar
  54. 54.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)zbMATHGoogle Scholar
  55. 55.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  56. 56.
    Pennington, J., Socher, R., Manning, C.: GloVe: Global Vectors for Word Representation. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)Google Scholar
  57. 57.
    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)CrossRefGoogle Scholar
  58. 58.
    Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)zbMATHCrossRefGoogle Scholar
  59. 59.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: 14th International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
  60. 60.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  61. 61.
    Bergamaschi, S., Po, L.: Comparing LDA and LSA topic models for content-based movie recommendation systems. In: Monfort, V., Krempels, K.-H. (eds.) WEBIST 2014. LNBIP, vol. 226, pp. 247–263. Springer, Cham (2015). Scholar
  62. 62.
    Ranzato, M.A., Szummer, M.: Semi-supervised learning of compact document representations with deep networks. In: 25th International Conference on Machine Learning, pp. 792–799. ACM (2008)Google Scholar
  63. 63.
    Cao, Z., Li, S., Liu, Y., Li, W., Ji, H.: A novel neural topic model and its supervised extension. In: 29th AAAI Conference on Artificial Intelligence, pp. 2210–2216 (2015)Google Scholar
  64. 64.
    Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: 25th International Conference on Machine Learning, pp. 536–543. ACM (2008)Google Scholar
  65. 65.
    Wei, C., Luo, S., Ma, X., Ren, H., Zhang, J., Pan, L.: Locally embedding autoencoders: a semi-supervised manifold learning approach of document representation. PLoS ONE 11(1), e0146672 (2016)CrossRefGoogle Scholar
  66. 66.
    Wang, S., Manning, C.: Baselines and bigrams: simple, good sentiment and topic classification. In: 50th Annual Meeting of the Association for Computational Linguistics, pp. 8–14 (2012)Google Scholar
  67. 67.
    Levy, Y., Ellis, T.J.: A systems approach to conduct an effective literature review in support of information systems research. Inf. Sci. J. 9, 181–212 (2006)Google Scholar
  68. 68.
    Webster, J., Watson, R.T.: Analyzing the past to prepare for the future: writing a literature review. MIS Q. 26(2), xiii–xxiii (2002)Google Scholar
  69. 69.
    Cooper, H., Hedges, L.: Research synthesis as a scientific process. In: Cooper, H., Hedges, L.V., Valentine, J.C. (eds.) The Handbook of Research Synthesis and Meta-Analysis, vol. 1. Russell Sage Foundation, New York City (2009)Google Scholar
  70. 70.
    Vom Brocke, J., Simons, A., Niehaves, B., Riemer, K., Plattfaut, R., Cleven, A.: Reconstructing the giant: on the importance of rigour in documenting the literature search process. In: 17th European Conference on Information Systems, pp. 2206–2217 (2009)Google Scholar
  71. 71.
    Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: LREC Workshop on New Challenges for NLP Frameworks. ELRA (2010)Google Scholar
  72. 72.
    Feuerriegel, S., Fehrer, R.: Improving decision analytics with deep learning: the case of financial disclosures. In: 24th European Conference on Information Systems (2016)Google Scholar
  73. 73.
    Cerchiello, P., Nicola, G., Rönnqvist, S., Sarlin, P.: Deep learning bank distress from news and numerical financial data. In: DEM Working Paper Series (2017)Google Scholar
  74. 74.
    Rönnqvist, S., Sarlin, P.: Bank distress in the news: describing events through deep learning. Neurocomputing 264, 57–70 (2017)CrossRefGoogle Scholar
  75. 75.
    Diao, Q., Qiu, M., Wu, C.-Y., Smola, A.J., Jiang, J., Wang, C.: Jointly modeling aspects, ratings and sentiments for movie recommendation. In: 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 193–202. ACM (2014)Google Scholar
  76. 76.
    Yan, J., et al.: Mining social lending motivations for loan project recommendations. Expert Syst. Appl., 1–7 (2018)Google Scholar
  77. 77.
    Socher, R., Pennington, J., Huang, E.H., Ng, A.Y., Manning, C.D.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Conference on Empirical Methods in Natural Language Processing, pp. 151–161 (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Chair of Electronic Finance and Digital MarketsUniversity of GoettingenGoettingenGermany

Personalised recommendations