Skip to main content

Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing

  • Chapter
  • First Online:
Handbook of Big Data Analytics

Part of the book series: Springer Handbooks of Computational Statistics ((SHCS))

Abstract

QuantNet is an integrated web-based environment consisting of different types of statistics-related documents and program codes. Its goal is creating reproducibility and offering a platform for sharing validated knowledge native to the social web. To increase the information retrieval (IR) efficiency there is a need for incorporating semantic information. Three text mining models will be examined: vector space model (VSM), generalized VSM (GVSM), and latent semantic analysis (LSA). The LSA has been successfully used for IR purposes as a technique for capturing semantic relations between terms and inserting them into the similarity measure between documents. Our results show that different model configurations allow adapted similarity-based document clustering and knowledge discovery. In particular, different LSA configurations together with hierarchical clustering reveal good results under M 3 evaluation. QuantNet and the corresponding Data-Driven Documents (D3) based visualization can be found and applied under http://quantlet.de. The driving technology behind it is Q3-D3-LSA, which is the combination of “GitHub API based QuantNet Mining infrastructure in R”, LSA and D3 implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Berry M (2003) Survey of text mining: clustering, classification, and retrieval, 1st edn. Springer, New York

    Google Scholar 

  • Borak S, Härdle W, López-Cabrera B (2013) Statistics of financial markets: exercises and solutions, 2nd edn. Springer, Berlin

    Book  Google Scholar 

  • Borke L (2017) Dynamic clustering and visualization of smart data via D3-3D-LSA. Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät. http://dx.doi.org/10.18452/18307

  • Borke L (2017a) TManalyzerQ: provides IR tools in 3 text mining models: BVSM, GVSM(TT) and LSA - QuantNet edition. R package version 0.5.0

    Google Scholar 

  • Borke L (2017b) yamldebugger: YAML parser debugger according to the QuantNet style guide. R package version 1.0

    Google Scholar 

  • Borke L, Bykovskaya S (2017a) BitQuery: a GitHub API driven and D3 based search engine for open source repositories. http://bitquery.de

  • Borke L, Bykovskaya S (2017b) D3 for visual analytics. https://github.com/d3VA

  • Borke L, Bykovskaya S (2017c) mdGeneratorQ: GitHub Markdown generator according to the QuantNet style guide. R package version 0.4.0

    Google Scholar 

  • Borke L, Härdle WK (2017) GitHub API based QuantNet mining infrastructure in R. SFB 649 discussion paper. Humboldt Universität zu Berlin

    Google Scholar 

  • Bostock M, Ogievetsky V, Heer J (2011) D3 data-driven documents. IEEE Trans Vis Comput Graph 17(12):2301–2309

    Article  Google Scholar 

  • Bradford RB (2009) Comparability of LSI and human judgment in text analysis tasks. In: Proceedings of the 11th WSEAS international conference on mathematical methods and computational techniques in electrical engineering, MMACTEE’09. World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, pp 359–366

    Google Scholar 

  • Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(1):1–22

    Google Scholar 

  • Cosentino V, Luis J, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16. ACM, New York, pp 137–141

    Google Scholar 

  • Cristianini N, Shawe-Taylor J, Lodhi H (2002) Latent semantic kernels. J Intell Inf Syst 18(2):127–152

    Article  Google Scholar 

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  • Dunn JC (1974) Well separated clusters and fuzzy partitions. J Cybern 4:95–104

    Article  MathSciNet  Google Scholar 

  • Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, Hoboken

    Book  Google Scholar 

  • Feinerer I, Hornik K (2015) tm: text mining package. R package version 0.6-2

    Google Scholar 

  • Feinerer I, Wild F (2007) Automated coding of qualitative interviews with latent semantic analysis. In: Mayr HC, Karagiannis D (eds) Information systems technology and its applications, 6th international conference ISTA. Gesellschaft für Informatik, Bonn, pp 66–77

    Google Scholar 

  • Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54

    Article  Google Scholar 

  • Fernández-Luna JM, Huete JF, Rodríguez-Cano JC (2011) User intent transition for explicit collaborative search through groups recommendation. In: Proceedings of the 3rd international workshop on collaborative information retrieval, CIR ’11. ACM, New York, pp 23–28

    Chapter  Google Scholar 

  • Franke J, Härdle W, Hafner C (2015) Statistics of financial markets: an introduction, 4th edn. Springer, Berlin

    MATH  Google Scholar 

  • Golyandina N, Korobeynikov A (2014) Basic singular spectrum analysis and forecasting with R. Comput Stat Data Anal 71:934–954. R package version 0.14

    Article  MathSciNet  Google Scholar 

  • Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212

    Article  Google Scholar 

  • Härdle W, Simar L (2015) Applied multivariate statistical analysis, 4th edn. Springer, Berlin

    MATH  Google Scholar 

  • Härdle W, Hautsch N, Overbeck L (2008) Applied quantitative finance, 2nd edn. Springer, Berlin

    MATH  Google Scholar 

  • Haslwanter T (2016) An introduction to statistics with Python: with applications in the life sciences, 1st edn. Springer International Publishing, Berlin

    Book  Google Scholar 

  • Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (2008) Partitioning around medoids (program PAM). In: Finding groups in data. Wiley, Hoboken, pp 68–125

    Google Scholar 

  • Korobeynikov A (2010) Computation- and space-efficient implementation of SSA. Stat Interface 3(3):357–368. R package version 0.14.

    Article  MathSciNet  Google Scholar 

  • Korobeynikov A, Larsen RM, Laboratory LBN (2016) svd: interfaces to various state-of-art SVD and eigensolvers. R package version 0.4

    Google Scholar 

  • Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi PF (2008) Mining internet-scale software repositories. In: Platt J, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems 20. Curran Associates, Red Hook, pp 929–936

    Google Scholar 

  • Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik, K (2016) cluster: cluster analysis basics and extensions. R package version 2.0.5

    Google Scholar 

  • Michailidis G (2008) Data visualization through their graph representations. In: Handbook of data visualization. Springer handbooks of computational statistics. Springer, Berlin, pp 103–120

    Chapter  Google Scholar 

  • Miller T, Klein B, Wolf E (2009) Exploiting latent semantic relations in highly linked hypertext for information retrieval in wikis. In: Proceedings of the international conference RANLP-2009. Association for Computational Linguistics, Borovets, pp 241–245

    Google Scholar 

  • Mohamed M, Oussalah M (2014) A comparative study of conversion aided methods for WordNet sentence textual similarity. In: Proceedings of the first AHA!-workshop on information discovery in text. Association for Computational Linguistics, Borovets and Dublin City University, Dublin, pp 37–42

    Chapter  Google Scholar 

  • Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520

    Article  Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  Google Scholar 

  • Scheidegger C (2016) github: github API. R package version 0.9.8

    Google Scholar 

  • Scheidegger C, Borke L (2017) rgithubQ: GitHub API bindings for R - QuantNet edition. R package version 0.5.0

    Google Scholar 

  • Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications, 1st edn. Chapman & Hall/CRC, Boca Raton

    Google Scholar 

  • Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining

    Google Scholar 

  • Theußl S, Feinerer I, Hornik K (2012) A tm plug-in for distributed text mining in R. J Stat Softw 51(5):1–31

    Article  Google Scholar 

  • Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, Lumley T, Maechler M, Magnusson A, Moeller S, Schwartz M, Venables B (2016) gplots: various R programming tools for plotting data. R package version 3.0.1

    Google Scholar 

  • Weiss SM, Indurkhya N, Zhang T (2010) Fundamentals of predictive text mining. Springer, London

    Book  Google Scholar 

  • Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, New York

    Book  Google Scholar 

  • Wild F (2015) lsa: latent semantic analysis. R package version 0.73.1

    Google Scholar 

  • Wild F, Stahl C (2007) Investigating unstructured texts with latent semantic analysis. In: Decker R, Lenz HJ (eds) Advances in data analysis. Proceedings of the 30th annual conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, 8–10 March 2006. Springer, Berlin, pp 383–390

    Chapter  Google Scholar 

  • Wong SKM, Ziarko W, Wong PCN (1985) Generalized vector spaces model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’85. ACM, New York, pp 18–25

    Google Scholar 

Download references

Acknowledgements

Financial support from the Deutsche Forschungsgemeinschaft via CRC “Economic Risk” and IRTG 1792 “High Dimensional Non Stationary Time Series,” Humboldt-Universität zu Berlin, is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wolfgang K. Härdle .

Editor information

Editors and Affiliations

Appendix

Appendix

See Figs. 16.26, 16.27, 16.28, 16.29, 16.30, and 16.31.

Fig. 16.29
figure 29

Dendrogram created by HC (ward-method) in LSA model, cut in 6 clusters and 30 subclusters, 137 Gestalten, subset from the books SFE, SFS, and the project IBT

Fig. 16.30
figure 30

Gestalt “SFEGBMProcess” simulating the geometric Brownian motion comprises three Quantlets in three programming languages: R, Matlab, and SAS

Fig. 16.31
figure 31

Back end view: Quantlet organization on GitHub

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Borke, L., Härdle, W.K. (2018). Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing. In: Härdle, W., Lu, HS., Shen, X. (eds) Handbook of Big Data Analytics. Springer Handbooks of Computational Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-18284-1_16

Download citation

Publish with us

Policies and ethics