A Toolkit for Development of the Domain-Oriented Dictionaries for Structuring Document Flows
An approach to thematic document classification, clusterization and investigation of document flows and collections based on domain-oriented dictionaries (DODs) is considered. It is simple enough to be used by, say, a secretary that frequently needs to classify and search large amounts of documents. However, for good results, such an approach requires a solid technology for construction and maintenance of the DODs; this task is to be performed by experts or advanced users. A DOD represents a specific subject topic and is constructed on the basis of the analysis of a collection of documents representing this topic, selected by a group of experts. The toolkit facilitates the development of a hierarchical system of DODs by the application of a set of heuristic criteria for the selection of the keywords from such a document collection representing one subject domain. In the paper, the application of the DODs developed with the toolkit for information retrieval is illustrated with examples.
KeywordsGini Index Document Image Text Corpus Moscow City General Lexicon
Unable to display preview. Download preview PDF.
- BOLSO, S. and A. MORRONE. (1998): A frequency dictionary of polyforms as a linguistic data base for text disambiguation in TALTAC, In: Data Science, Classification and Related Methods (Proc. of 6-th Intern. Conf. IFCS, Rome, Italy, 1998). Rome, 32–35Google Scholar
- LELU, A., and S. FERHAN. (1998): Clustering a textual data-flow by incremental density-modes seeking. In: Data Science, Classification and Related Methods (Proceedings of 6-th Intern. Conf. IFCS, Rome, Italy, 1998). Rome, 206–209Google Scholar
- MAKAGONOV, R, and K. SBOYCHAKOV. (1998): Man-machine methods for solution of weakly-formalized problems in humanitarian and natural fields of knowledge (visual heuristic cluster analysis). In: Pedro Galicia (Ed): Proceedings of International Computer Symposium CIC’98 (Mexico, 1998). National Polytechnic Institute, Mexico, 346–358Google Scholar
- TAKAKURA, S. (1998): Study of same methods of analysis of textual data in Japanese.In: Data Science, Classification and Related Methods (Proceedings of 6-th Intern. Conf. IFCS, Rome, Italy, 1998). Rome, 297–298. RENVGoogle Scholar