Using Correlation Dimension for Analysing Text Data

Kivimäki, Ilkka; Lagus, Krista; Nieminen, Ilari T.; Väyrynen, Jaakko J.; Honkela, Timo

doi:10.1007/978-3-642-15819-3_49

Ilkka Kivimäki¹⁹,
Krista Lagus¹⁹,
Ilari T. Nieminen¹⁹,
Jaakko J. Väyrynen¹⁹ &
…
Timo Honkela¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6352))

Included in the following conference series:

International Conference on Artificial Neural Networks

1825 Accesses
2 Citations

Abstract

In this article, we study the scale-dependent dimensionality properties and overall structure of text data with a method that measures correlation dimension in different scales. As experimental results, we present the analysis of text data sets with the Reuters and Europarl corpora, which are also compared to artificially generated point sets. A comparison is also made with speech data. The results reflect some of the typical properties of the data and the use of our method in improving various data analysis applications is discussed.

This work has been supported by the Academy of Finland and a grant from the Department of Mathematics and Statistics at the University of Helsinki (IK).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Grassberger, P., Procaccia, I.: Characterization of strange attractors. Phys. Rev. Lett. 50(5), 346–349 (1983)
Article MathSciNet Google Scholar
Camastra, F.: Data dimensionality estimation methods: a survey. Pattern Recognition 36(12), 2945–2954 (2003)
Article MATH Google Scholar
Theiler, J.: Estimating fractal dimension. Journal of the Optical Society of America A 7, 1055–1073 (1990)
Article MathSciNet Google Scholar
Karlgren, J., Holst, A., Sahlgren, M.: Filaments of meaning in word space. Advances in Information Retrieval, pp. 531–538 (2008)
Google Scholar
Kumar, C.A., Srinivas, S.: A note on effect of term weighting on selecting intrinsic dimensionality of data. Journal of Cybernetics and Information Technologies 9(1), 5–12 (2009)
Google Scholar
Kohonen, T., Nieminen, I.T., Honkela, T.: On the quantization error in SOM vs. VQ: A critical and systematic study. In: Proceedings of WSOM 2009, pp. 133–144 (2009)
Google Scholar
Fukunaga, K., Olsen, D.R.: An algorithm for finding intrinsic dimensionality of data. IEEE Trans. Comput. 20, 176–183 (1971)
Article MATH Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Vinay, V., Cox, I.J., Milic-Frayling, N., Wood, K.R.: Measuring the complexity of a collection of documents. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 107–118. Springer, Heidelberg (2006)
Chapter Google Scholar
Cai, D., He, X., Han, J.: Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17(12), 1624–1637 (2005)
Article Google Scholar
Cole, R., Fanty, M.: Spoken letter recognition. In: HLT 1990: Proceedings of the Workshop on Speech and Natural Language, pp. 385–390 (1990)
Google Scholar
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Machine Translation Summit X, pp. 79–86 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Adaptive Informatics Research Centre, Aalto University School of Science and Technology,
Ilkka Kivimäki, Krista Lagus, Ilari T. Nieminen, Jaakko J. Väyrynen & Timo Honkela

Authors

Ilkka Kivimäki
View author publications
You can also search for this author in PubMed Google Scholar
Krista Lagus
View author publications
You can also search for this author in PubMed Google Scholar
Ilari T. Nieminen
View author publications
You can also search for this author in PubMed Google Scholar
Jaakko J. Väyrynen
View author publications
You can also search for this author in PubMed Google Scholar
Timo Honkela
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, TEI of Thessaloniki, 57400, Sindos, Greece
Konstantinos Diamantaras
School of Physics, Astronomy, and Informatics, Department of Informatics, Nicolaus Copernicus University, ul. Grudziadzka 5, 87-100, Torun, Poland
Wlodek Duch
Department of Forestry and Management of the Environment and Natural Resources, Democritus University of Thrace, Pantazidou 193, 68200, Orestiada, Thrace, Greece
Lazaros S. Iliadis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kivimäki, I., Lagus, K., Nieminen, I.T., Väyrynen, J.J., Honkela, T. (2010). Using Correlation Dimension for Analysing Text Data. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds) Artificial Neural Networks – ICANN 2010. ICANN 2010. Lecture Notes in Computer Science, vol 6352. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15819-3_49

Download citation

DOI: https://doi.org/10.1007/978-3-642-15819-3_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15818-6
Online ISBN: 978-3-642-15819-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics