Skip to main content
Log in

Text Mining in the SOMLib Digital Library System: The Representation of Topics and Genres

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

With the increasing amount of textual information available in electronic form, more powerful methods for exploring, searching, and organizing the available mass of information are needed to cope with this situation. This paper presents the SOMLIb digital library system, built on neural networks to provide text mining capabilities. At its foundation we use the Self-Organizing Map to provide content-based clustering of documents. By using an extended model, i.e. the Growing Hierarchical Self-Organizing Map, we can further detect subject hierarchies in a document collection, with the neural network adapting its size and structure automatically during its unsupervised training process to reflect the topical hierarchy. By mining the weight vector structure of the trained maps our system is able to select keywords describing the various topical clusters. Text mining has to incorporate more than the mere analysis of content. Structural and genre information are key in organizing and locating information. Using color-coding techniques we can integrate a structural analysis of documents based on Self-Organizing Maps into the subject-based clustering relying on metaphor graphics for intuitive visualization. We demonstrate the capabilities of the SOMLib system using collections of articles from various newspapers and magazines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: A cluster-based approach to browsing large document collections,” in Proc. of the 15 Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Copenhagen, Denmark, ACM, June 21–24, 1992, pp. 318–329.

  2. M. Hearst and J. Pedersen, “Reexamining the cluster hypothesis: Scatter/Gather on retrieval results,” in Proc. of the 19 Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Zürich, Switzerland, ACM, Aug. 18–22, 1996, pp. 76–84.

  3. M. Chalmers and P. Chitson, “Bead: Exploration in information visualization,” in Proc. of the 15th Annual Int’l. ACM SIGIR Conf., Copenhagen, Denmark, 1992, pp. 330–337.

  4. M. Song, “Bibliomapper: A cluster-based information visualization technique,” in IEEE Symposium on Information Visualization (INFOVIS’98), North Carolina, 1998.

  5. A. Rauber and D. Merkl, “The SOMLib digital library system,” in Proc. of the 3rd European Conf. on Research and Advanced Technology for Digital Libraries (ECDL99), number LNCS 1696 in Lecture Notes in Computer Science, Paris, France, Springer, Sept. 22–24, 1999, pp. 323–342.

    Google Scholar 

  6. T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, pp. 59–69, 1982.

    Google Scholar 

  7. T. Kohonen, Self-Organizing Maps, Springer-Verlag: Berlin, 1995.

    Google Scholar 

  8. X. Lin, “A self-organizing semantic map for information retrieval,” in Proc. of the 14th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR91), Chicago, IL, ACM, Oct. 13–16, 1991, pp. 262–269.

  9. H. Chen, C. Schuffels, and R. Orwig, “Internet categorization and search:Aself-organizing approach,” Journal of Visual Communication and Image Representation, vol. 7, no.1, pp. 88–102, 1996.

    Google Scholar 

  10. D. Merkl, “Text classification with self-organizing maps: Some lessons learned,” Neurocomputing, vol. 21, nos.1–3, pp. 61–77, 1998.

    Google Scholar 

  11. T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and A. Saarela, “Self-organization of a massive document collection,” IEEE Transactions on Neural Networks, vol. 11, no.3, pp. 574–585, 2000.

    Google Scholar 

  12. M. Dittenbach, D. Merkl, and A. Rauber, “The growing hierarchical self-organizing map,” in Proc. of the Int. Joint Conf. on Neural Networks (IJCNN 2000), vol. VI, Como, Italy, IEEE Computer Society, July 24–27, 2000, pp. 15–19.

    Google Scholar 

  13. A. Rauber, M. Dittenbach, and D. Merkl, “Automatically detecting and organizing documents into topic hierarchies: A neural-network based approach to bookshelf creation and arrangement,” in Proc. of the 4th European Conf. on Research and Advanced Technologies for Digital Libraries (ECDL2000), number 1923 in Lecture Notes in Computer Science, Lisboa, Portugal, Springer, Sept. 18–20, 2000, pp. 348–351.

    Google Scholar 

  14. A. Rauber and D. Merkl, “Automatic labeling of self-organizing maps: Making a treasure map reveal its secrets,” in Proc. of the 3rd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD99), number LNCS/LNAI 1574 in Lecture Notes in Artificial Intelligence, Beijing, China, Springer, April 26–29, 1999, pp. 228–237.

    Google Scholar 

  15. A. Rauber and A. Müller-Kögler, “Integrating automatic genre analysis into digital libraries,” in Proc. of the First ACM-IEEE Joint Conf. on Digital Libraries, Roanoke, VA, ACM, June 24–28, 2001.

  16. A. Rauber and H. Bina, “Visualizing electronic document repositories: Drawing books and papers in a digital library,” in Advances in Visual Database Systems: Proc. of the IFIP TC2 WG2.6 5. Working Conf. on Visual Database Systems, Fukuoka, Japan, Kluwer Academic Publishers, May, 10–12. 2000, pp. 95–114.

  17. M. Mayer, “Improving usability: Usability evaluation of a corporate digital library,” Master's Thesis, Vienna University of Technology, Vienna, Austria, 2001.

    Google Scholar 

  18. ACM, ACM Digital Library. Website, May 2001. http://www.acm.org/dl, as of May 2001.

  19. I. Jolliffe, Principal Component Analysis, Springer: Berlin, 1986.

    Google Scholar 

  20. B. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press: Cambridge, UK, 1996.

    Google Scholar 

  21. J. Blackmore and R. Miikkulainen, “Incremental grid growing: Encoding high-dimensional structure into a two-dimensional feature map,” in Proc. of the IEEE Int. Conf. on Neural Networks (ICNN’93), San Francisco, CA, USA, 1993, vol. 1, pp. 450–455.

    Google Scholar 

  22. B. Fritzke, “Growing grid—Aself-organizing network with constant neighborhood range and adaption strength,” Neural Processing Letters, vol. 2, no.5, pp. 1–5, 1995.

    Google Scholar 

  23. R. Miikkulainen, “Script recognition with hierarchical feature maps,” Connection Science, vol. 2, pp. 83–101, 1990.

    Google Scholar 

  24. D. Merkl and A. Rauber, “Alternative ways for cluster visualization in self-organizing maps,” in Proc. of the Workshop on Self-Organizing Maps (WSOM97), Espoo, Finland, Helsinki University of Technology, HUT, June 4–6, 1997, pp. 106–111.

  25. A. Ultsch, “Self-organizing neural networks for visualization and classification,” in Information and Classification. Concepts, Methods and Application, Springer, Dortmund, Germany, April 1–3, 1992, pp. 307–313.

    Google Scholar 

  26. A. Rauber, “LabelSOM: On the labeling of self-organizing maps,” in Proc. of the Int. Joint Conf. on Neural Networks (IJCNN’99), Washington, DC, July 10–16, 1999.

  27. H. Chernoff, “The use of faces to represent points in k-dimensional space graphically,” Journal of the American Statistical Association, vol. 68, pp. 361–368, 1973.

    Google Scholar 

  28. E. Tufte, The Visual Display of Quantitative Information, Graphics Press: Connecticut, 1983.

    Google Scholar 

  29. D. Merkl and A. Rauber, “Document classification with unsupervised neural networks,” in Soft Computing in Information Retrieval, edited by F. Crestani and G. Pasi, Physica Verlag, 2000, pp. 102–121.

  30. G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, MA, 1989.

    Google Scholar 

  31. M. Hearst and C. Plaunt, “Subtopic structuring for full-length document access,” in Proc. of the 16th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Pittsburg, USA, 1993, pp. 59–68.

  32. M. Kaszkiel and J. Zobel, “Passage retrieval revisited,” in Proc. of the 20th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Philadelphia, PA, ACM, July 27–31, 1997, pp. 178–185.

  33. A. Rauber and D. Merkl, “SOMLib: A digital library system based on neural networks,” in Proc. of the ACM Conf. on Digital Libraries (ACMDL’99), Berkeley, CA, ACM, Aug. 11–14, 1999, pp. 240–241.

  34. L. Cherra and W. Vesterman, “Writing tools: The STYLE and DICTION programs,” Technical Report 91, Bell Laboratories, Murray Hill, NJ, 1981. Republished as part of the 4.4BSD User's Supplementary Documents by O'Reilly.

    Google Scholar 

  35. D. Biber, “A typology of english texts,” Linguistics, vol. 27, pp. 3–43, 1989.

    Google Scholar 

  36. J. Karlgren, “Stylistic experiments in information retrieval,” in Natural Language Information Retrieval, edited by T. Strzalkowski, Kluwer, 1999.

  37. J. Himberg, “A SOM based cluster visualization and its application for false coloring,” in Proc. of the Int. Joint Conf. on Neural Networks (IJCNN 2000), Como, Italy, IEEE Computer Society, July 24–27, 2000.

  38. A. Rauber and D. Merkl, “Using self-organizing maps to organize document collections and to characterize subject matters: How to make a map tell the news of the world,” in Proc. of the 10th Int. Conf. on Database and Expert Systems Applications (DEXA99), number LNCS 1677 in Lecture Notes in Computer Science, Florence, Italy, Springer, Sept. 1–3, 1999, pp. 302–311.

    Google Scholar 

  39. A. Rauber and H. Bina, “‘atAndreas, Rauber’? Conference pages are over there, German documents on the lower left,..”—An “old-fashioned” approach to web search results visualization, in DEXA Workshop Proc. of the 2nd Int. Workshop on Web-Based Information Visualization (WebVis 2000), Greenwich, UK, IEEE Computer Society Press, Sept. 4–8, 2000, pp. 615–619.

  40. E. Schweighofer, A. Rauber, and D. Merkl, “Some remarks on vector representation of legal documents,” in DEXA Workshop Proc. of the Workshop on Legal Information Systems (LISA 2000), Greenwich, UK, IEEE Computer Society Press, Sept. 4–8, 2000, pp. 1087–1091.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rauber, A., Merkl, D. Text Mining in the SOMLib Digital Library System: The Representation of Topics and Genres. Applied Intelligence 18, 271–293 (2003). https://doi.org/10.1023/A:1023297920966

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1023297920966

Navigation