Advertisement

Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing

  • Lukas Borke
  • Wolfgang K. Härdle
Chapter
Part of the Springer Handbooks of Computational Statistics book series (SHCS)

Abstract

QuantNet is an integrated web-based environment consisting of different types of statistics-related documents and program codes. Its goal is creating reproducibility and offering a platform for sharing validated knowledge native to the social web. To increase the information retrieval (IR) efficiency there is a need for incorporating semantic information. Three text mining models will be examined: vector space model (VSM), generalized VSM (GVSM), and latent semantic analysis (LSA). The LSA has been successfully used for IR purposes as a technique for capturing semantic relations between terms and inserting them into the similarity measure between documents. Our results show that different model configurations allow adapted similarity-based document clustering and knowledge discovery. In particular, different LSA configurations together with hierarchical clustering reveal good results under M3 evaluation. QuantNet and the corresponding Data-Driven Documents (D3) based visualization can be found and applied under http://quantlet.de. The driving technology behind it is Q3-D3-LSA, which is the combination of “GitHub API based QuantNet Mining infrastructure in R”, LSA and D3 implementation.

Keywords

Computational statistics Transparency Dissemination or quantlets Quantlets 

Notes

Acknowledgements

Financial support from the Deutsche Forschungsgemeinschaft via CRC “Economic Risk” and IRTG 1792 “High Dimensional Non Stationary Time Series,” Humboldt-Universität zu Berlin, is gratefully acknowledged.

References

  1. Berry M (2003) Survey of text mining: clustering, classification, and retrieval, 1st edn. Springer, New YorkGoogle Scholar
  2. Borak S, Härdle W, López-Cabrera B (2013) Statistics of financial markets: exercises and solutions, 2nd edn. Springer, BerlinCrossRefGoogle Scholar
  3. Borke L (2017) Dynamic clustering and visualization of smart data via D3-3D-LSA. Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät. http://dx.doi.org/10.18452/18307
  4. Borke L (2017a) TManalyzerQ: provides IR tools in 3 text mining models: BVSM, GVSM(TT) and LSA - QuantNet edition. R package version 0.5.0Google Scholar
  5. Borke L (2017b) yamldebugger: YAML parser debugger according to the QuantNet style guide. R package version 1.0Google Scholar
  6. Borke L, Bykovskaya S (2017a) BitQuery: a GitHub API driven and D3 based search engine for open source repositories. http://bitquery.de
  7. Borke L, Bykovskaya S (2017b) D3 for visual analytics. https://github.com/d3VA
  8. Borke L, Bykovskaya S (2017c) mdGeneratorQ: GitHub Markdown generator according to the QuantNet style guide. R package version 0.4.0Google Scholar
  9. Borke L, Härdle WK (2017) GitHub API based QuantNet mining infrastructure in R. SFB 649 discussion paper. Humboldt Universität zu BerlinGoogle Scholar
  10. Bostock M, Ogievetsky V, Heer J (2011) D3 data-driven documents. IEEE Trans Vis Comput Graph 17(12):2301–2309CrossRefGoogle Scholar
  11. Bradford RB (2009) Comparability of LSI and human judgment in text analysis tasks. In: Proceedings of the 11th WSEAS international conference on mathematical methods and computational techniques in electrical engineering, MMACTEE’09. World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, pp 359–366Google Scholar
  12. Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(1):1–22Google Scholar
  13. Cosentino V, Luis J, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16. ACM, New York, pp 137–141Google Scholar
  14. Cristianini N, Shawe-Taylor J, Lodhi H (2002) Latent semantic kernels. J Intell Inf Syst 18(2):127–152CrossRefGoogle Scholar
  15. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRefGoogle Scholar
  16. Dunn JC (1974) Well separated clusters and fuzzy partitions. J Cybern 4:95–104MathSciNetCrossRefGoogle Scholar
  17. Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, HobokenCrossRefGoogle Scholar
  18. Feinerer I, Hornik K (2015) tm: text mining package. R package version 0.6-2Google Scholar
  19. Feinerer I, Wild F (2007) Automated coding of qualitative interviews with latent semantic analysis. In: Mayr HC, Karagiannis D (eds) Information systems technology and its applications, 6th international conference ISTA. Gesellschaft für Informatik, Bonn, pp 66–77Google Scholar
  20. Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54CrossRefGoogle Scholar
  21. Fernández-Luna JM, Huete JF, Rodríguez-Cano JC (2011) User intent transition for explicit collaborative search through groups recommendation. In: Proceedings of the 3rd international workshop on collaborative information retrieval, CIR ’11. ACM, New York, pp 23–28CrossRefGoogle Scholar
  22. Franke J, Härdle W, Hafner C (2015) Statistics of financial markets: an introduction, 4th edn. Springer, BerlinzbMATHGoogle Scholar
  23. Golyandina N, Korobeynikov A (2014) Basic singular spectrum analysis and forecasting with R. Comput Stat Data Anal 71:934–954. R package version 0.14MathSciNetCrossRefGoogle Scholar
  24. Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212CrossRefGoogle Scholar
  25. Härdle W, Simar L (2015) Applied multivariate statistical analysis, 4th edn. Springer, BerlinzbMATHGoogle Scholar
  26. Härdle W, Hautsch N, Overbeck L (2008) Applied quantitative finance, 2nd edn. Springer, BerlinzbMATHGoogle Scholar
  27. Haslwanter T (2016) An introduction to statistics with Python: with applications in the life sciences, 1st edn. Springer International Publishing, BerlinCrossRefGoogle Scholar
  28. Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21CrossRefGoogle Scholar
  29. Kaufman L, Rousseeuw PJ (2008) Partitioning around medoids (program PAM). In: Finding groups in data. Wiley, Hoboken, pp 68–125Google Scholar
  30. Korobeynikov A (2010) Computation- and space-efficient implementation of SSA. Stat Interface 3(3):357–368. R package version 0.14.MathSciNetCrossRefGoogle Scholar
  31. Korobeynikov A, Larsen RM, Laboratory LBN (2016) svd: interfaces to various state-of-art SVD and eigensolvers. R package version 0.4Google Scholar
  32. Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi PF (2008) Mining internet-scale software repositories. In: Platt J, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems 20. Curran Associates, Red Hook, pp 929–936Google Scholar
  33. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik, K (2016) cluster: cluster analysis basics and extensions. R package version 2.0.5Google Scholar
  34. Michailidis G (2008) Data visualization through their graph representations. In: Handbook of data visualization. Springer handbooks of computational statistics. Springer, Berlin, pp 103–120CrossRefGoogle Scholar
  35. Miller T, Klein B, Wolf E (2009) Exploiting latent semantic relations in highly linked hypertext for information retrieval in wikis. In: Proceedings of the international conference RANLP-2009. Association for Computational Linguistics, Borovets, pp 241–245Google Scholar
  36. Mohamed M, Oussalah M (2014) A comparative study of conversion aided methods for WordNet sentence textual similarity. In: Proceedings of the first AHA!-workshop on information discovery in text. Association for Computational Linguistics, Borovets and Dublin City University, Dublin, pp 37–42CrossRefGoogle Scholar
  37. Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520CrossRefGoogle Scholar
  38. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefGoogle Scholar
  39. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRefGoogle Scholar
  40. Scheidegger C (2016) github: github API. R package version 0.9.8Google Scholar
  41. Scheidegger C, Borke L (2017) rgithubQ: GitHub API bindings for R - QuantNet edition. R package version 0.5.0Google Scholar
  42. Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications, 1st edn. Chapman & Hall/CRC, Boca RatonGoogle Scholar
  43. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text miningGoogle Scholar
  44. Theußl S, Feinerer I, Hornik K (2012) A tm plug-in for distributed text mining in R. J Stat Softw 51(5):1–31CrossRefGoogle Scholar
  45. Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, Lumley T, Maechler M, Magnusson A, Moeller S, Schwartz M, Venables B (2016) gplots: various R programming tools for plotting data. R package version 3.0.1Google Scholar
  46. Weiss SM, Indurkhya N, Zhang T (2010) Fundamentals of predictive text mining. Springer, LondonCrossRefGoogle Scholar
  47. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, New YorkCrossRefGoogle Scholar
  48. Wild F (2015) lsa: latent semantic analysis. R package version 0.73.1Google Scholar
  49. Wild F, Stahl C (2007) Investigating unstructured texts with latent semantic analysis. In: Decker R, Lenz HJ (eds) Advances in data analysis. Proceedings of the 30th annual conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, 8–10 March 2006. Springer, Berlin, pp 383–390CrossRefGoogle Scholar
  50. Wong SKM, Ziarko W, Wong PCN (1985) Generalized vector spaces model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’85. ACM, New York, pp 18–25Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Humboldt-Universität zu BerlinR.D.C - Research Data Center, SFB 649 “Economic Risk”BerlinGermany
  2. 2.Humboldt-Universität zu BerlinC.A.S.E. - Center for Applied Statistics and EconomicsBerlinGermany
  3. 3.School of BusinessSingapore Management UniversitySingaporeSingapore

Personalised recommendations