Abstract
This paper describes our work on developing a language-independent technique for discovery of implicit knowledge from multilingual information sources. Text mining has been gaining popularity in the knowledge discovery field, particularity with the increasing availability of digital documents in various languages from all around the world. However, currently most text mining tools mainly focus only on processing monolingual documents (particularly English documents): little attention has been paid to apply the techniques to handle the documents in Asian languages, and further extend the mining algorithms to support the aspects of multilingual information sources. In this work, we attempt to develop a language-neutral method to tackle the linguistics difficulties in the text mining process. Using a variation of automatic clustering techniques, which apply a neural net approach, namely the Self-Organizing Maps (SOM), we have conducted several experiments to uncover associated documents based on a Chinese corpus, Chinese-English bilingual parallel corpora, and a hybrid Chinese-English corpus. The experiments show some interesting results and a couple of potential paths for future work in the field of multilingual information discovery. Besides, this work is expected to act as a starting point for exploring the impacts on linguistics issues with the machine-learning approach to mining sensible linguistics elements from multilingual text collections.
Similar content being viewed by others
References
I. Dagan, R. Feldman, and H. Hirsh, “Keyword-based browsing and analysis of large document sets,” in Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (SDAIR), Las Vegas, NV, 1996, pp. 191–208.
R. Feldman and I. Dagan, “KDT—knowledge discovery in texts,” in Proceedings of the First Annual Conference on Knowledge Discovery and Data Mining (KDD), AAAI Press: Montreal, 1995, pp. 112–117.
R. Feldman,W. Klosgen, and A. Zilberstein, “Visualization techniques to explore data mining results for document collections,” in Proc. Third Annual Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach, 1997, pp. 16–23.
R. Feldman, I. Dagan, and H. Hirsh, “Mining text using keyword distributions,” Journal of Intelligent Information Systems, vol. 10, pp. 281–300, 1998.
T. Honkela, S. Kaski, K. Lagus, and T. Kohonen, “Newsgroup exploration with WEBSOM method and browsing interface,” Laboratory of Computer and Information Science, Helsinki University of Technology, Technical Report A32, Espoo, Finland, 1996.
T. Kohonen, “Self-organization of very large document collections: State of the art,” in Proceedings of ICANN98, the 8th International Conference on Artificial Neural Networks, edited by L. Niklasson, M. Boden, and T. Ziemke, London, Springer, 1998, vol. 1, pp. 65–74.
S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, “WEBSOM—self-organizing maps of document collections,” Neurocomputing, vol. 21, pp. 101–117, 1998.
T. Kohonen, “Self-organizing formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, pp. 59–69, 1982.
T. Kohonen, Self-Organizing Maps, Springer-Verlag: Berlin, 1995.
M.A. Hearst, “Untangling text data mining,” in Proceedings of ACL’99: The 37th Annual Meeting of Association for Computational Linguistics, University of Maryland, 1999, pp. 20–26.
X. Lin, D. Soergel, and G. Marchionini, “A self-organizing semantic map for information retrieval,” in Proceedings of the ACM SIGIR Int’l Conf. on Research and Development in Information Retrieval (SIGIR’91), Chicago, IL, 1991, pp. 262–269.
H. Ritter and T. Kohonen, “Self-organizing semantic maps,” Biological Cybernetics, vol. 61, 1989, pp. 241–254.
C.H. Lee and H.C. Yang, “A web text mining approach based on self-organizing map,” in Proceedings of the ACM CIKM’99 2nd Workshop on Web Information and Data Management (WIDM’99), Kansas City, Missouri, USA, 1999, pp. 59–62.
C.H. Lee and H.C. Yang, “A text data mining approach using a Chinese corpus based on self-organizing map,” in Proceedings of the Fourth International Workshop on Information Retrieval with Asian Language (IRAL’99), Taipei, Taiwan, 1999, pp. 19–22.
C.H. Lee and H.C. Yang, “Towards multilingual information discovery through a SOM based text mining approach,” in Proceedings of InternationalWorkshop on Text andWeb Mining, The Sixth Pacific Rim International Conference on Artificial Intelligence (PRICAI 2000), Melbourne, Australia, Aug. 28–Sept. 1, 2000, pp. 81–87.
H.C. Yang and C.H. Lee, “Automatic category generation for text documents by self-organizing maps,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN 2000), Como, Italy, July, 2000, Vol. III-581–586, pp. 24–27.
H.C. Yang and C.H. Lee, “Automatic category structure generation and categorization of Chinese text documents,” Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2000), Lyon, France, Sept., 2000, pp. 13–16.
G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill Book Company: New York, 1983.
S. Deerwester, S. Dumais, G. Furnas, and K. Landauer, “Indexing by latent semantic analysis,” Journal of American Society for Information Science, vol. 40, no.6, pp. 391–407, 1990.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Lee, CH., Yang, HC. A Multilingual Text Mining Approach Based on Self-Organizing Maps. Applied Intelligence 18, 295–310 (2003). https://doi.org/10.1023/A:1023250105036
Issue Date:
DOI: https://doi.org/10.1023/A:1023250105036