Abstract
With the increasing amount of textual information available in electronic form, more powerful methods for exploring, searching, and organizing the available mass of information are needed to cope with this situation. This paper presents the SOMLIb digital library system, built on neural networks to provide text mining capabilities. At its foundation we use the Self-Organizing Map to provide content-based clustering of documents. By using an extended model, i.e. the Growing Hierarchical Self-Organizing Map, we can further detect subject hierarchies in a document collection, with the neural network adapting its size and structure automatically during its unsupervised training process to reflect the topical hierarchy. By mining the weight vector structure of the trained maps our system is able to select keywords describing the various topical clusters. Text mining has to incorporate more than the mere analysis of content. Structural and genre information are key in organizing and locating information. Using color-coding techniques we can integrate a structural analysis of documents based on Self-Organizing Maps into the subject-based clustering relying on metaphor graphics for intuitive visualization. We demonstrate the capabilities of the SOMLib system using collections of articles from various newspapers and magazines.
Similar content being viewed by others
References
D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: A cluster-based approach to browsing large document collections,” in Proc. of the 15 Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Copenhagen, Denmark, ACM, June 21–24, 1992, pp. 318–329.
M. Hearst and J. Pedersen, “Reexamining the cluster hypothesis: Scatter/Gather on retrieval results,” in Proc. of the 19 Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Zürich, Switzerland, ACM, Aug. 18–22, 1996, pp. 76–84.
M. Chalmers and P. Chitson, “Bead: Exploration in information visualization,” in Proc. of the 15th Annual Int’l. ACM SIGIR Conf., Copenhagen, Denmark, 1992, pp. 330–337.
M. Song, “Bibliomapper: A cluster-based information visualization technique,” in IEEE Symposium on Information Visualization (INFOVIS’98), North Carolina, 1998.
A. Rauber and D. Merkl, “The SOMLib digital library system,” in Proc. of the 3rd European Conf. on Research and Advanced Technology for Digital Libraries (ECDL99), number LNCS 1696 in Lecture Notes in Computer Science, Paris, France, Springer, Sept. 22–24, 1999, pp. 323–342.
T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, pp. 59–69, 1982.
T. Kohonen, Self-Organizing Maps, Springer-Verlag: Berlin, 1995.
X. Lin, “A self-organizing semantic map for information retrieval,” in Proc. of the 14th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR91), Chicago, IL, ACM, Oct. 13–16, 1991, pp. 262–269.
H. Chen, C. Schuffels, and R. Orwig, “Internet categorization and search:Aself-organizing approach,” Journal of Visual Communication and Image Representation, vol. 7, no.1, pp. 88–102, 1996.
D. Merkl, “Text classification with self-organizing maps: Some lessons learned,” Neurocomputing, vol. 21, nos.1–3, pp. 61–77, 1998.
T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and A. Saarela, “Self-organization of a massive document collection,” IEEE Transactions on Neural Networks, vol. 11, no.3, pp. 574–585, 2000.
M. Dittenbach, D. Merkl, and A. Rauber, “The growing hierarchical self-organizing map,” in Proc. of the Int. Joint Conf. on Neural Networks (IJCNN 2000), vol. VI, Como, Italy, IEEE Computer Society, July 24–27, 2000, pp. 15–19.
A. Rauber, M. Dittenbach, and D. Merkl, “Automatically detecting and organizing documents into topic hierarchies: A neural-network based approach to bookshelf creation and arrangement,” in Proc. of the 4th European Conf. on Research and Advanced Technologies for Digital Libraries (ECDL2000), number 1923 in Lecture Notes in Computer Science, Lisboa, Portugal, Springer, Sept. 18–20, 2000, pp. 348–351.
A. Rauber and D. Merkl, “Automatic labeling of self-organizing maps: Making a treasure map reveal its secrets,” in Proc. of the 3rd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD99), number LNCS/LNAI 1574 in Lecture Notes in Artificial Intelligence, Beijing, China, Springer, April 26–29, 1999, pp. 228–237.
A. Rauber and A. Müller-Kögler, “Integrating automatic genre analysis into digital libraries,” in Proc. of the First ACM-IEEE Joint Conf. on Digital Libraries, Roanoke, VA, ACM, June 24–28, 2001.
A. Rauber and H. Bina, “Visualizing electronic document repositories: Drawing books and papers in a digital library,” in Advances in Visual Database Systems: Proc. of the IFIP TC2 WG2.6 5. Working Conf. on Visual Database Systems, Fukuoka, Japan, Kluwer Academic Publishers, May, 10–12. 2000, pp. 95–114.
M. Mayer, “Improving usability: Usability evaluation of a corporate digital library,” Master's Thesis, Vienna University of Technology, Vienna, Austria, 2001.
ACM, ACM Digital Library. Website, May 2001. http://www.acm.org/dl, as of May 2001.
I. Jolliffe, Principal Component Analysis, Springer: Berlin, 1986.
B. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press: Cambridge, UK, 1996.
J. Blackmore and R. Miikkulainen, “Incremental grid growing: Encoding high-dimensional structure into a two-dimensional feature map,” in Proc. of the IEEE Int. Conf. on Neural Networks (ICNN’93), San Francisco, CA, USA, 1993, vol. 1, pp. 450–455.
B. Fritzke, “Growing grid—Aself-organizing network with constant neighborhood range and adaption strength,” Neural Processing Letters, vol. 2, no.5, pp. 1–5, 1995.
R. Miikkulainen, “Script recognition with hierarchical feature maps,” Connection Science, vol. 2, pp. 83–101, 1990.
D. Merkl and A. Rauber, “Alternative ways for cluster visualization in self-organizing maps,” in Proc. of the Workshop on Self-Organizing Maps (WSOM97), Espoo, Finland, Helsinki University of Technology, HUT, June 4–6, 1997, pp. 106–111.
A. Ultsch, “Self-organizing neural networks for visualization and classification,” in Information and Classification. Concepts, Methods and Application, Springer, Dortmund, Germany, April 1–3, 1992, pp. 307–313.
A. Rauber, “LabelSOM: On the labeling of self-organizing maps,” in Proc. of the Int. Joint Conf. on Neural Networks (IJCNN’99), Washington, DC, July 10–16, 1999.
H. Chernoff, “The use of faces to represent points in k-dimensional space graphically,” Journal of the American Statistical Association, vol. 68, pp. 361–368, 1973.
E. Tufte, The Visual Display of Quantitative Information, Graphics Press: Connecticut, 1983.
D. Merkl and A. Rauber, “Document classification with unsupervised neural networks,” in Soft Computing in Information Retrieval, edited by F. Crestani and G. Pasi, Physica Verlag, 2000, pp. 102–121.
G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, MA, 1989.
M. Hearst and C. Plaunt, “Subtopic structuring for full-length document access,” in Proc. of the 16th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Pittsburg, USA, 1993, pp. 59–68.
M. Kaszkiel and J. Zobel, “Passage retrieval revisited,” in Proc. of the 20th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Philadelphia, PA, ACM, July 27–31, 1997, pp. 178–185.
A. Rauber and D. Merkl, “SOMLib: A digital library system based on neural networks,” in Proc. of the ACM Conf. on Digital Libraries (ACMDL’99), Berkeley, CA, ACM, Aug. 11–14, 1999, pp. 240–241.
L. Cherra and W. Vesterman, “Writing tools: The STYLE and DICTION programs,” Technical Report 91, Bell Laboratories, Murray Hill, NJ, 1981. Republished as part of the 4.4BSD User's Supplementary Documents by O'Reilly.
D. Biber, “A typology of english texts,” Linguistics, vol. 27, pp. 3–43, 1989.
J. Karlgren, “Stylistic experiments in information retrieval,” in Natural Language Information Retrieval, edited by T. Strzalkowski, Kluwer, 1999.
J. Himberg, “A SOM based cluster visualization and its application for false coloring,” in Proc. of the Int. Joint Conf. on Neural Networks (IJCNN 2000), Como, Italy, IEEE Computer Society, July 24–27, 2000.
A. Rauber and D. Merkl, “Using self-organizing maps to organize document collections and to characterize subject matters: How to make a map tell the news of the world,” in Proc. of the 10th Int. Conf. on Database and Expert Systems Applications (DEXA99), number LNCS 1677 in Lecture Notes in Computer Science, Florence, Italy, Springer, Sept. 1–3, 1999, pp. 302–311.
A. Rauber and H. Bina, “‘atAndreas, Rauber’? Conference pages are over there, German documents on the lower left,..”—An “old-fashioned” approach to web search results visualization, in DEXA Workshop Proc. of the 2nd Int. Workshop on Web-Based Information Visualization (WebVis 2000), Greenwich, UK, IEEE Computer Society Press, Sept. 4–8, 2000, pp. 615–619.
E. Schweighofer, A. Rauber, and D. Merkl, “Some remarks on vector representation of legal documents,” in DEXA Workshop Proc. of the Workshop on Legal Information Systems (LISA 2000), Greenwich, UK, IEEE Computer Society Press, Sept. 4–8, 2000, pp. 1087–1091.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Rauber, A., Merkl, D. Text Mining in the SOMLib Digital Library System: The Representation of Topics and Genres. Applied Intelligence 18, 271–293 (2003). https://doi.org/10.1023/A:1023297920966
Issue Date:
DOI: https://doi.org/10.1023/A:1023297920966