Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

Article

Abstract

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories.

Keywords

Digital libraries Text categorization Machine learning Support vector machines Analogical information representation Wavelet analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agirre, E., De Lacalle, O.: Clustering WordNet word senses. In: Proceedings of RANLP-03, 4th international conference on recent advances in natural language processing, pp. 121–130. Borovets, Bulgaria (2003)Google Scholar
  2. 2.
    Agirre, E., Alfonseca, E., de Lacalle, O.: Approximating hierarchy-based similarity for WordNet nominal synsets using topic signatures. In: Proceedings of GWC-04, 2nd global WordNet conference, pp. 15–22. Brno, Czech Republic (2004)Google Scholar
  3. 3.
    Avancini H., Lavelli A., Sebastiani F., Zanoli R.: Automatic expansion of domain-specific lexicons by term categorization. ACM Trans. Speech Lang. Process. 3(1), 1–30 (2006)CrossRefGoogle Scholar
  4. 4.
    Basili, R., Cammisa, M., Moschitti, A.: Effective use of WordNet semantics via kernel-based learning. In: Proceedings of CoNLL-05, 9th conference on computational natural language learning, pp. 1–8. Ann Arbor, MI, USA (2005)Google Scholar
  5. 5.
    Bethard, S., Wetzer, P., Butcher, K., Martin, J., Sumner, T.: Automatically characterizing resource quality for educational digital libraries. In: Proceedings of JCDL-09, 9th joint international conference on digital libraries, pp. 221–230. Austin, TX, USA (2009)Google Scholar
  6. 6.
    Bloehdorn, S., Basili, R., Cammisa, M., Moschitti, A.: Semantic kernels for text classification based on topological measures of feature similarity. In: Proceedings of ICDM-06, 6th IEEE international conference on data mining. Hong Kong (2006)Google Scholar
  7. 7.
    Brocks, H., Kranstedt, A., Jäschke, G., Hemmje, M.: Modeling context for digital preservation. In: Nguyen, N., Szczerbicki, E. (eds.) Smart Information and Knowledge Management: Advances, Challenges, and Critical Issues. Springer, Berlin (2009)Google Scholar
  8. 8.
    Budanitsky A., Hirst G.: Evaluating WordNet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)MATHCrossRefGoogle Scholar
  9. 9.
    Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001)
  10. 10.
    Cormen, T., Leiserson, C., Rivest, R.: Introduction to algorithms. MIT Press, Cambridge (2001)Google Scholar
  11. 11.
    Cristianini N., Shawe-Taylor J., Lodhi H.: Latent semantic kernels. J. Intell. Inf. Syst. 18(2), 127–152 (2002)CrossRefGoogle Scholar
  12. 12.
    Cui, H.: An application for semantic markup of biodiversity documents. In: Proceedings of JCDL-08, 8th ACM/IEEE-CS joint conference on digital libraries, pp. 421–421. Pittsburgh, PA, USA (2008)Google Scholar
  13. 13.
    Datta R., Joshi D., Li J., Wang J.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2), 1–60 (2008)CrossRefGoogle Scholar
  14. 14.
    Dawson, A., Slevin, A.: Repository case history: University of Strathclyde Strathprints. http://www.rsp.ac.uk/repos/casestudies/pdfs/strathclyde.pdf (2008)
  15. 15.
    de Carvalho, M., Gonçalves, M., Laender, A., da Silva, A.: Learning to deduplicate. In: Proceedings of JCDL-06, 6th ACM/IEEE-CS joint conference on digital libraries, pp. 41–50. Chapel Hill, NC, USA (2006)Google Scholar
  16. 16.
    Deerwester S., Dumais S., Furnas G., Landauer T., Harshman R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  17. 17.
    Efron, M., Elsas, J., Marchionini, G., Zhang, J.: Machine learning for information architecture in a large governmental web site. In: Proceedings of JCDL-04, 4th ACM/IEEE-CS joint conference on digital libraries, pp. 151–159. Tucson, AZ, USA (2004)Google Scholar
  18. 18.
    Esposito, F., Malerba, D., Semeraro, G., Fanizzi, N., Ferilli, S.: Adding machine learning and knowledge intensive techniques to a digital library service. Int. J. Digit. Libr. 2(1), 3–19 (1998)Google Scholar
  19. 19.
    Fellbaum C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)MATHGoogle Scholar
  20. 20.
    Frank E., Paynter G.: Predicting library of congress classifications from library of congress subject headings. J. Am. Soc. Inf. Sci. Technol. 55(3), 214–227 (2004)CrossRefGoogle Scholar
  21. 21.
    Fuhr N., Tsakonas G., Aalberg T., Agosti M., Hansen P., Kapidakis S., Klas C., Kovács L., Landoni M., Micsik A.: Evaluation of digital libraries. Int. J. Digit. Libr. 8(1), 21–38 (2007)CrossRefGoogle Scholar
  22. 22.
    Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: Proceedings of IJCAI-05, 19th international joint conference on artificial intelligence, vol. 19. Edinburgh, UK (2005)Google Scholar
  23. 23.
    Hagedorn K., Chapman S., Newman D.: Enhancing search and browse using automated clustering of subject metadata. D-Lib Mag. 13(7/8), 1082–9873 (2007)Google Scholar
  24. 24.
    Han, H., Giles, C., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic document metadata extraction using support vector machines. In: Proceedings of JCDL-03, 3rd ACM/IEEE-CS joint conference on digital libraries, pp. 37–48. Houston, TX, USA (2003)Google Scholar
  25. 25.
    Hoenkamp E.: Unitary operators on the document space. J. Am. Soc. Inf. Sci. Technol. 54(4), 314–320 (2003)CrossRefGoogle Scholar
  26. 26.
    Hotho, A., Staab, S., Stumme, G.: WordNet improves text document clustering. In: Proceedings of SIGIR-03, 26th international conference on research and development in information retrieval. Toronto, Canada (2003)Google Scholar
  27. 27.
    Hu, Y., Li, H., Cao, Y., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. In: Proceedings of JCDL-05, 5th ACM/IEEE-CS joint conference on digital libraries, pp. 145–154. Denver, CO, USA (2005)Google Scholar
  28. 28.
    ISO 14721: Reference model for an Open Archival Information System (OAIS) fCCSDS 650.0-B-1 Blue book (2003)Google Scholar
  29. 29.
    Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of ROCLING-97, international conference on research in computational linguistics, pp. 19–33. Taipei, Taiwan (1997)Google Scholar
  30. 30.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, pp. 137–142. Chemnitz, Germany (1998)Google Scholar
  31. 31.
    Li, T., Ogihara, M., Li, Q.: A comparative study on content-based music genre classification. In: Proceedings of SIGIR-03, 26th international conference on research and development in information retrieval, pp. 282–289. Toronto, ON, Canada (2003)Google Scholar
  32. 32.
    Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of ACL-98, 36th annual meeting of association for computational linguistics, vol. 36, pp. 768–774. Montréal, Québec, Canada (1998)Google Scholar
  33. 33.
    Lu, X., Mitra, P., Wang, J., Giles, C.: Automatic categorization of figures in scientific documents. In: Proceedings of JCDL-06, 6th ACM/IEEE-CS joint conference on digital libraries, pp. 129–138. Chapel Hill, NC, USA (2006)Google Scholar
  34. 34.
    Lu, X., Wang, J., Mitra, P., Giles, C.: Deriving knowledge from figures for digital libraries. In: Proceedings of WWW-07, 16th international conference on world wide web, pp. 1229–1230. Banff, AB, Canada (2007)Google Scholar
  35. 35.
    Lyu, M., Yau, E., Sze, S.: A multilingual, multimodal digital video library system. In: Proceedings of JCDL-02, 2nd ACM/IEEE-CS joint conference on digital libraries, pp. 145–153. Portland, OR, USA (2002)Google Scholar
  36. 36.
    Manning C., Schütze H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATHGoogle Scholar
  37. 37.
    Martins, W., Gonçalves, M., Laender, A., Pappa, G.: Learning to assess the quality of scientific conferences: a case study in computer science. In: Proceedings of JCDL-09, 9th joint international conference on digital libraries, pp. 193–202. Austin, TX, USA (2009)Google Scholar
  38. 38.
    Mavroeidis, D., Tsatsaronis, G., Vazirgiannis, M., Theobald, M., Weikum, G.: Word sense disambiguation for exploiting hierarchical thesauri in text classification. In: Proceedings of PKDD-05, 9th European conference on the principles of data mining and knowledge discovery, pp. 181–192. Porto, Portugal (2005)Google Scholar
  39. 39.
    Miller, N., Wong, P., Brewster, M., Foote, H.: TOPIC ISLANDS—a wavelet-based text visualization system. In: Proceedings of InfoVis-98, IEEE symposium on information visualization, pp. 189–196. Research Triangle Park, NC, USA (1998)Google Scholar
  40. 40.
    Mohammad, S., Hirst, G.: Distributional measures as proxies for semantic relatedness (2005, submitted)Google Scholar
  41. 41.
    Moore, R., Rajasekar, A., Baru, C., Ludaescher, B., Gupta, A., Marciano, R.: Persistent archives. US Patent 6,963,875 (2005)Google Scholar
  42. 42.
    Pant, G., Tsioutsiouliklis, K., Johnson, J., Giles, C.: Panorama: extending digital libraries with topical crawlers. In: Proceedings of JCDL-04, 4th ACM/IEEE-CS joint conference on digital libraries, pp. 142–150. Tucson, AZ, USA (2004)Google Scholar
  43. 43.
    Paynter, G.: Developing practical automatic metadata assignment and evaluation tools for internet resources. In: Proceedings of JCDL-05, 5th ACM/IEEE-CS joint conference on digital libraries, pp. 291–300. Denver, CO, USA (2005)Google Scholar
  44. 44.
    Purcell G., Rennels G., Shortliffe E.: Development and evaluation of a context-based document representation for searching the medical literature. Int. J. Digit. Libr. 1(3), 288–296 (1997)CrossRefGoogle Scholar
  45. 45.
    Ramsey M., Chen H., Zhu B., Schatz B.: A collection of visual thesauri for browsing large collections of geographic images. J. Am. Soc. Inf. Sci. 50(9), 826–834 (1999)CrossRefGoogle Scholar
  46. 46.
    Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of IJCAI-95, 14th international joint conference on artificial intelligence, vol. 1, pp. 448–453. Montréal, Québec, Canada (1995)Google Scholar
  47. 47.
    Rodriguez, M., Hidalgo, J.: Using WordNet to complement training information in text categorization. In: Proceedings of RANLP-97, 2nd international conference on recent advances in natural language processing (1997)Google Scholar
  48. 48.
    Sebastiani F.: Text categorization. In: Zanasi, A. (eds) Text Mining and its Applications, pp. 109–129. WIT Press, Southampton (2005)Google Scholar
  49. 49.
    Shawe-Taylor J., Cristianini N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)CrossRefGoogle Scholar
  50. 50.
    Siolas, G., d’Alché Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: Proceedings of IJCNN-00, IEEE international joint conference on neural networks. Austin, TX, USA (2000)Google Scholar
  51. 51.
    Smola A., Schölkopf B., Müller K.: The connection between regularization operators and support vector kernels. Neural Netw. 11(4), 637–649 (1998)CrossRefGoogle Scholar
  52. 52.
    Wang, J.: An extensive study on automated Dewey decimal classification. J. Am. Soc. Inf. Sci. Technol. 60(11), 2269–2286 (2009)Google Scholar
  53. 53.
    Wang J., Wiederhold G., Firschein O., Xin Wei S.: Content-based image indexing and searching using Daubechies’ wavelets. Int. J. Digit. Libr. 1(4), 311–328 (1998)CrossRefGoogle Scholar
  54. 54.
    Wetzler, P., Bethard, S., Butcher, K., Martin, J., Sumner, T.: Automatically assessing resource quality for educational digital libraries. In: Proceedings of WICOW-09, 3rd workshop on information credibility on the web, pp. 3–10. Madrid, Spain (2009)Google Scholar
  55. 55.
    Wilson, B.: A special issue on digital library evolution. D-Lib Mag. 12(3), 56 (2006)Google Scholar
  56. 56.
    Wittek, P., Darányi, S., Tan, C.: Improving text classification by a sense spectrum approach to term expansion. In: Proceedings of CoNLL-09, 13th conference on computational natural language learning, pp. 183–191. Boulder, CO, USA (2009)Google Scholar
  57. 57.
    Wong, S., Ziarko, W., Wong, P.: Generalized vector space model in information retrieval. In: Proceedings of SIGIR-85, 8th international conference on research and development in information retrieval, pp. 18–25. Montréal, Québec, Canada (1985)Google Scholar
  58. 58.
    Xia, Z., Dong, Y., Xing, G.: Support vector machines for collaborative filtering. In: Proceedings of ACMSE-06, 44th annual southeast regional conference, pp. 169–174. Melbourne, FL, USA (2006)Google Scholar
  59. 59.
    Yang Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1), 69–90 (1999)CrossRefGoogle Scholar
  60. 60.
    Zhang L., Zhou W., Jiao L.: Wavelet support vector machine. IEEE Trans. Syst. Man Cybern. 34(1), 34–39 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Sándor Darányi
    • 1
  • Peter Wittek
    • 1
  • Milena Dobreva
    • 2
  1. 1.Swedish School of Library and Information ScienceUniversity of BoråsBoråsSweden
  2. 2.Centre for Digital Library ResearchUniversity of StrathclydeGlasgowUK

Personalised recommendations