Abstract
QuantNet is an integrated web-based environment consisting of different types of statistics-related documents and program codes. Its goal is creating reproducibility and offering a platform for sharing validated knowledge native to the social web. To increase the information retrieval (IR) efficiency there is a need for incorporating semantic information. Three text mining models will be examined: vector space model (VSM), generalized VSM (GVSM), and latent semantic analysis (LSA). The LSA has been successfully used for IR purposes as a technique for capturing semantic relations between terms and inserting them into the similarity measure between documents. Our results show that different model configurations allow adapted similarity-based document clustering and knowledge discovery. In particular, different LSA configurations together with hierarchical clustering reveal good results under M 3 evaluation. QuantNet and the corresponding Data-Driven Documents (D3) based visualization can be found and applied under http://quantlet.de. The driving technology behind it is Q3-D3-LSA, which is the combination of “GitHub API based QuantNet Mining infrastructure in R”, LSA and D3 implementation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Berry M (2003) Survey of text mining: clustering, classification, and retrieval, 1st edn. Springer, New York
Borak S, Härdle W, López-Cabrera B (2013) Statistics of financial markets: exercises and solutions, 2nd edn. Springer, Berlin
Borke L (2017) Dynamic clustering and visualization of smart data via D3-3D-LSA. Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät. http://dx.doi.org/10.18452/18307
Borke L (2017a) TManalyzerQ: provides IR tools in 3 text mining models: BVSM, GVSM(TT) and LSA - QuantNet edition. R package version 0.5.0
Borke L (2017b) yamldebugger: YAML parser debugger according to the QuantNet style guide. R package version 1.0
Borke L, Bykovskaya S (2017a) BitQuery: a GitHub API driven and D3 based search engine for open source repositories. http://bitquery.de
Borke L, Bykovskaya S (2017b) D3 for visual analytics. https://github.com/d3VA
Borke L, Bykovskaya S (2017c) mdGeneratorQ: GitHub Markdown generator according to the QuantNet style guide. R package version 0.4.0
Borke L, Härdle WK (2017) GitHub API based QuantNet mining infrastructure in R. SFB 649 discussion paper. Humboldt Universität zu Berlin
Bostock M, Ogievetsky V, Heer J (2011) D3 data-driven documents. IEEE Trans Vis Comput Graph 17(12):2301–2309
Bradford RB (2009) Comparability of LSI and human judgment in text analysis tasks. In: Proceedings of the 11th WSEAS international conference on mathematical methods and computational techniques in electrical engineering, MMACTEE’09. World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, pp 359–366
Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(1):1–22
Cosentino V, Luis J, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16. ACM, New York, pp 137–141
Cristianini N, Shawe-Taylor J, Lodhi H (2002) Latent semantic kernels. J Intell Inf Syst 18(2):127–152
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Dunn JC (1974) Well separated clusters and fuzzy partitions. J Cybern 4:95–104
Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, Hoboken
Feinerer I, Hornik K (2015) tm: text mining package. R package version 0.6-2
Feinerer I, Wild F (2007) Automated coding of qualitative interviews with latent semantic analysis. In: Mayr HC, Karagiannis D (eds) Information systems technology and its applications, 6th international conference ISTA. Gesellschaft für Informatik, Bonn, pp 66–77
Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54
Fernández-Luna JM, Huete JF, Rodríguez-Cano JC (2011) User intent transition for explicit collaborative search through groups recommendation. In: Proceedings of the 3rd international workshop on collaborative information retrieval, CIR ’11. ACM, New York, pp 23–28
Franke J, Härdle W, Hafner C (2015) Statistics of financial markets: an introduction, 4th edn. Springer, Berlin
Golyandina N, Korobeynikov A (2014) Basic singular spectrum analysis and forecasting with R. Comput Stat Data Anal 71:934–954. R package version 0.14
Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212
Härdle W, Simar L (2015) Applied multivariate statistical analysis, 4th edn. Springer, Berlin
Härdle W, Hautsch N, Overbeck L (2008) Applied quantitative finance, 2nd edn. Springer, Berlin
Haslwanter T (2016) An introduction to statistics with Python: with applications in the life sciences, 1st edn. Springer International Publishing, Berlin
Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Kaufman L, Rousseeuw PJ (2008) Partitioning around medoids (program PAM). In: Finding groups in data. Wiley, Hoboken, pp 68–125
Korobeynikov A (2010) Computation- and space-efficient implementation of SSA. Stat Interface 3(3):357–368. R package version 0.14.
Korobeynikov A, Larsen RM, Laboratory LBN (2016) svd: interfaces to various state-of-art SVD and eigensolvers. R package version 0.4
Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi PF (2008) Mining internet-scale software repositories. In: Platt J, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems 20. Curran Associates, Red Hook, pp 929–936
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik, K (2016) cluster: cluster analysis basics and extensions. R package version 2.0.5
Michailidis G (2008) Data visualization through their graph representations. In: Handbook of data visualization. Springer handbooks of computational statistics. Springer, Berlin, pp 103–120
Miller T, Klein B, Wolf E (2009) Exploiting latent semantic relations in highly linked hypertext for information retrieval in wikis. In: Proceedings of the international conference RANLP-2009. Association for Computational Linguistics, Borovets, pp 241–245
Mohamed M, Oussalah M (2014) A comparative study of conversion aided methods for WordNet sentence textual similarity. In: Proceedings of the first AHA!-workshop on information discovery in text. Association for Computational Linguistics, Borovets and Dublin City University, Dublin, pp 37–42
Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Scheidegger C (2016) github: github API. R package version 0.9.8
Scheidegger C, Borke L (2017) rgithubQ: GitHub API bindings for R - QuantNet edition. R package version 0.5.0
Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications, 1st edn. Chapman & Hall/CRC, Boca Raton
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining
Theußl S, Feinerer I, Hornik K (2012) A tm plug-in for distributed text mining in R. J Stat Softw 51(5):1–31
Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, Lumley T, Maechler M, Magnusson A, Moeller S, Schwartz M, Venables B (2016) gplots: various R programming tools for plotting data. R package version 3.0.1
Weiss SM, Indurkhya N, Zhang T (2010) Fundamentals of predictive text mining. Springer, London
Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, New York
Wild F (2015) lsa: latent semantic analysis. R package version 0.73.1
Wild F, Stahl C (2007) Investigating unstructured texts with latent semantic analysis. In: Decker R, Lenz HJ (eds) Advances in data analysis. Proceedings of the 30th annual conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, 8–10 March 2006. Springer, Berlin, pp 383–390
Wong SKM, Ziarko W, Wong PCN (1985) Generalized vector spaces model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’85. ACM, New York, pp 18–25
Acknowledgements
Financial support from the Deutsche Forschungsgemeinschaft via CRC “Economic Risk” and IRTG 1792 “High Dimensional Non Stationary Time Series,” Humboldt-Universität zu Berlin, is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Borke, L., Härdle, W.K. (2018). Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing. In: Härdle, W., Lu, HS., Shen, X. (eds) Handbook of Big Data Analytics. Springer Handbooks of Computational Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-18284-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-18284-1_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18283-4
Online ISBN: 978-3-319-18284-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)