Information-Theoretic method for classification of texts
- 21 Downloads
We consider a method for automatic (i.e., unmanned) text classification based on methods of universal source coding (or “data compression”). We show that under certain restrictions the proposed method is consistent, i.e., the classification error tends to zero with increasing text lengths. As an example of practical use of the method we consider the classification problem for scientific texts (research papers, books, etc.). The proposed method is experimentally shown to be highly efficient.
Unable to display preview. Download preview PDF.
- 1.Thapar, N., Using Compression for Source-Based Classification of Text, Master Thesis, Dept. of Electrical Engineering and Computer Science, MIT, Cambridge, USA, 2001.Google Scholar
- 3.Khmelev, D.V., Complexity Approach to Disputed Authorship Attribution, in Russian Language: Contemporaneity and Fates in History (Proc. Int. Congress of Russian Language Researchers, Moscow, Mar. 13–16, 2001), pp. 426–427.Google Scholar