Robustness Issues in Text Mining

  • Marco Turchi
  • Domenico Perrotta
  • Marco Riani
  • Andrea Cerioli
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 190)

Abstract

We extend the Forward Search approach for robust data analysis to address problems in text mining. In this domain, datasets are collections of an arbitrary number of documents, which are represented as vectors of thousands of elements according to the vector space model. When the number of variables v is so large and the dataset size n is smaller by order of magnitudes, the traditional Mahalanobis metric cannot be used as a similarity distance between documents. We show that by monitoring the cosine (dis)similarity measure with the Forward Search approach it is possible to perform robust estimation for a document collection and order the documents so that the most dissimilar (possibly outliers, for that collection) are left at the end. We also show that the presence of more groups of documents in the collection is clearly detected with multiple starts of the Forward Search.

Keywords

Cosine similarity document classification forward search 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Atkinson, A.C., Riani, M.: Robust Diagnostic Regression Analysis. Springer, Berlin (2000)MATHCrossRefGoogle Scholar
  2. 2.
    Atkinson, A.C., Riani, M.: Exploratory tools for clustering multivariate data. Comput. Stat. Data Anal. 52, 272–285 (2007)MathSciNetMATHCrossRefGoogle Scholar
  3. 3.
    Atkinson, A.C., Riani, M., Cerioli, A.: Exploring Multivariate Data with the Forward Search. Springer, Berlin (2004)MATHGoogle Scholar
  4. 4.
    Billhardt, H., Borrajo, D., Maojo, V.: A context vector model for information retrieval. J. Am Soc. Inf. Sci. Tec. 53, 236–249 (2002)CrossRefGoogle Scholar
  5. 5.
    Garcia-Escudero, L., Gordaliza, A., Matran, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)MathSciNetMATHCrossRefGoogle Scholar
  6. 6.
    Huang, A.: Similarity measures for text document clustering. In: Proc. of the 6th New Zealand Computer Science Research Student Conference, Christchurch, New Zealand, pp. 49–56 (2008)Google Scholar
  7. 7.
    Hubert, M., Rousseeuw, P.J., Van Aelst, S.: High-breakdown robust multivariate methods. Stat. Sci. 23, 92–119 (2008)CrossRefGoogle Scholar
  8. 8.
    Mao, W., Chu, W.W.: Free-text medical document retrieval via phrase-based vector space model. In: Proc. of the AMIA Symposium, p. 489 (2002)Google Scholar
  9. 9.
    Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus In: Proc. of the Workshop Ontologies and Information Extraction at the EUROLAN 2003, Bucharest, Romania (2003)Google Scholar
  10. 10.
    Riani, M., Perrotta, D., Torti, F.: FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemometr. Intell. Lab. 116, 17–32 (2012)CrossRefGoogle Scholar
  11. 11.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)MATHCrossRefGoogle Scholar
  12. 12.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inform. Process. Manag. 24, 513–522 (1988)CrossRefGoogle Scholar
  13. 13.
    Steinberger, R., Ebrahim, M., Turchi, M.: JRC Eurovoc Indexer JEX — A freely available multi-label categorisation tool. In: Proc. of the 8th Int. Conf. on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey (2012)Google Scholar
  14. 14.
    Yates, R.B., Neto, B.R.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Marco Turchi
    • 1
  • Domenico Perrotta
    • 1
  • Marco Riani
    • 2
  • Andrea Cerioli
    • 2
  1. 1.European CommissionJoint Research CentreBrusselsBelgium
  2. 2.Department of EconomicsUniversity of ParmaParmaItaly

Personalised recommendations