Abstract
In this article, we study the scale-dependent dimensionality properties and overall structure of text data with a method that measures correlation dimension in different scales. As experimental results, we present the analysis of text data sets with the Reuters and Europarl corpora, which are also compared to artificially generated point sets. A comparison is also made with speech data. The results reflect some of the typical properties of the data and the use of our method in improving various data analysis applications is discussed.
This work has been supported by the Academy of Finland and a grant from the Department of Mathematics and Statistics at the University of Helsinki (IK).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Grassberger, P., Procaccia, I.: Characterization of strange attractors. Phys. Rev. Lett. 50(5), 346–349 (1983)
Camastra, F.: Data dimensionality estimation methods: a survey. Pattern Recognition 36(12), 2945–2954 (2003)
Theiler, J.: Estimating fractal dimension. Journal of the Optical Society of America A 7, 1055–1073 (1990)
Karlgren, J., Holst, A., Sahlgren, M.: Filaments of meaning in word space. Advances in Information Retrieval, pp. 531–538 (2008)
Kumar, C.A., Srinivas, S.: A note on effect of term weighting on selecting intrinsic dimensionality of data. Journal of Cybernetics and Information Technologies 9(1), 5–12 (2009)
Kohonen, T., Nieminen, I.T., Honkela, T.: On the quantization error in SOM vs. VQ: A critical and systematic study. In: Proceedings of WSOM 2009, pp. 133–144 (2009)
Fukunaga, K., Olsen, D.R.: An algorithm for finding intrinsic dimensionality of data. IEEE Trans. Comput. 20, 176–183 (1971)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)
Vinay, V., Cox, I.J., Milic-Frayling, N., Wood, K.R.: Measuring the complexity of a collection of documents. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 107–118. Springer, Heidelberg (2006)
Cai, D., He, X., Han, J.: Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17(12), 1624–1637 (2005)
Cole, R., Fanty, M.: Spoken letter recognition. In: HLT 1990: Proceedings of the Workshop on Speech and Natural Language, pp. 385–390 (1990)
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Machine Translation Summit X, pp. 79–86 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kivimäki, I., Lagus, K., Nieminen, I.T., Väyrynen, J.J., Honkela, T. (2010). Using Correlation Dimension for Analysing Text Data. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds) Artificial Neural Networks – ICANN 2010. ICANN 2010. Lecture Notes in Computer Science, vol 6352. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15819-3_49
Download citation
DOI: https://doi.org/10.1007/978-3-642-15819-3_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15818-6
Online ISBN: 978-3-642-15819-3
eBook Packages: Computer ScienceComputer Science (R0)