Abstract
This paper introduces a new method for clustering of documents, which have been written in a language evolving during different historical periods, with an example of the Italian language. In the first phase, the text is transformed into a string of four numerical codes, which have been derived from the energy profile of each letter, defining the height of the letters and their location in the text line. Each code represents a gray level and the text is codified as a 1-D image. In the second phase, texture features are extracted from the obtained image in order to create document feature vectors. Subsequently, a new clustering algorithm is employed on the feature vectors to discriminate documents from different historical periods of the language. Experiments are performed on a database of Italian documents given in Italian Vulgar and modern Italian. Results demonstrate that this proposed method perfectly identifies the historical periods of the language of the documents, outperforming other well-known clustering algorithms generally adopted for document categorization and other state-of-the-art text-based language models.
Similar content being viewed by others
Notes
The database is freely available at https://sites.google.com/site/documentanalysis2015/italian-italian-vulgar-database.
References
Janson T (2004) A natural history of latin. Oxford University Press, Oxford
History of Latin. Available at: https://en.wikipedia.org/wiki/History_of_Latin
Haller EK (2012) Dante alighieri. In: Matheson LM (ed) Icons of the middle ages: rulers, writers, rebels, and saints1, Santa Barbara, CA: Greenwood, p 244
Maiden M (1995) A linguistic history of italian. Longman, London
How Latin became Italian. Available at: https://damyanlissitchkov.wordpress.com/2013/03/23/how-latin-became-italian/
Pei MA (1949) New methodology for romance classification. WORD 5(2):135–146
Grimes BF (1996) Ethnologue: languages of the world. In: Pittman RS, Grimes JE (eds). 30th edn. Summer Institute of Linguistics, Academic Publisher, Dallas
Calabrese A (2003) On the Evolution of the short high vowel of Latin into Romance. In: Perez-Leroux A, Roberge Y (eds) Romance linguistics, theory and acquisition. John Benjamins, Amsterdam, pp 63–94
Cavnar W, Trenkle J (1994) N-gram-based text categorization. In: 3rd annual symposium on document analysis and information retrieval, April 11-13, Las Vegas, pp 161–175
Takci H, Sogukpimar I (2004) Letter based text scoring method for language identification. In: Advances in information systems, October 20-22, vol 3261, Izmir, pp 283–290
Tan CM, Wang YF, Lee CD (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38(4):529–546
Grothe L, De Luca EW, Nurnberger A (2008) A comparative study on language identification Methods, Marrakech, Morocco
Braga IA, Monard MC, Matsubara ET (2009) Combining unigrams and bigrams in semi-supervised text classification. In: 14Th Portuguese conference on artificial intelligence (EPIA) - new trends in artificial intelligence, October 12–15, Aveiro, Portugal, pp 489–500
Goodman J (2006) A bit of progress in language modeling: extended version. Technical report MSR-TR-2001-72, machine learning and applied statistics group microsoft research. Redmond
Padro M, Padro L (2004) Comparing methods for language identification. In: XX Congreso de la Sociedad Espanola para el Procesamiento del Lenguage Natural, Barcelona, Spain , pp 155–161
Sibun P, Spitz AL (1994) Language determination: natural language processing from scanned document images. In: 4th applied natural language processing conference, October 13-15, Stuttgart, Germany, pp 15–21
Martino MJ, Paulsen RC (2001) Natural language determination using partial words. U.S. Patent No. 6216102 B1
Cowie J, Ludovic Y, Zacharski R (1999) Language recognition for mono and multilingual documents. In: Vextal conference, November 22-24, Venice, pp 209–214
Shijian L, Lim Tan C (2008) Script and language identification in noisy and degraded document images. IEEE Trans Pattern Anal Mach Intell 30(1):14–24
Tan TN (1996) Written language recognition based on texture analysis. In: Proceedings of ICIP’96, vol 2, Lausanne, Switz, pp 185–188
Peake GS, Tan TN (1997) Script and language identification from document images. In: Third Asian Conference on Computer Vision, January 8-10, Hong Kong, China, pp 97–104
Brodić D, Amelio A, Milivojević ZN (2016) Language discrimination by texture analysis of the image corresponding to the text. Neural Comput Appl 1–22
Brodić D, Amelio A, Milivojević ZN (2015) Characterization and distinction between closely related south slavic languages on the example of Serbian and Croatian. In: Comp. anal. of images and patterns, September 2-4, vol 9256, Valletta, Malta, pp 654–666
Brodić D, Milivojević ZN, Amelio A (2015) Analysis of the South Slavic scripts by run-length features of the image texture. Elektronika Ir Elektrotechnika 21(4):60–64
Zramdini A, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal Mach Intell 20(8):877–882
Joshi GD, Garg S, Sivaswamy J (2007) A generalised framework for script identification. IJDAR 10 (2):55–68
Brodić D, Milivojević ZN, Maluckov CA (2013) Recognition of the script in Serbian documents using frequency occurrence and co-occurrence analysis. Sci World J 896328:1–14
Del Bimbo A (2001) Visual information retrieval. Morgan Kaufmann Publishers Inc., San Francisco
Brodić D, Milivojević ZN, Maluckov CA (2015) An approach to the script discrimination in the Slavic documents. Soft Comput 19(9):2655–2665
Eleyan A, Demirel H (2011) Co-occurrence matrix and its statistical features as a new approach for face recognition. Turkish J Electr Engin and Comp Sci 19(1):97–107
Clausi DA (2002) An analysis of co-occurrence texture statistics as a function of grey level quantization. Canadian J Remote Sens 28(1):45–62
Galloway MM (1975) Texture analysis using gray level run lengths. Comput Graphics Image Process 4 (2):172–179
Chu A, Sehgal CM, Greenleaf JF (1990) Use of gray value distribution of run lengths for texture analysis. Pattern Recogn Lett 11(6):415–419
Dasarathy BR, Holder EB (1991) Image characterizations based on joint gray-level run-length distributions. Pattern Recogn Lett 12(8):497–502
Ojala T, Pietikäinen M, Harwood D (1996) A comparative study of texture measures with classification based on feature distributions. Pattern Recogn 29:51–59
Nosaka R, Ohkawa Y, Fukui K (2011) Feature extraction based on co-occurrence of adjacent local binary patterns. In: Advance in image and video technology, November 20–23, vol 7088, Gwangju, South Korea, pp 82–91
Amelio A, Pizzuti C (2014) A new evolutionary-based clustering framework for image databases. In: Image and Sign. Proc., June 30-july 2, vol 8509, Cherbourg, Normandy, France, pp 322–331
Sicilian School. Available online: http://www.britannica.com/art/Sicilian-school
Dolce Stil Novo. Available online: http://www.britannica.com/art/dolce-stil-nuovo
Angiolieri C Available online: http://www.britannica.com/biography/Cecco-Angiolieri
Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Disc 10(2):141–168
Saarikoski J, Laurikkala J, Järvelin K, Juhola M (2011) Self-organising maps in document classification: a comparison with six machine learning methods. In: 10th international conference, ICANNGA, April 14-16, vol 6593, Ljubljana, Slovenia , pp 260–269
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, August 20-23, Boston, MA, USA
Zhong S (2005) Efficient online spherical k-means clustering. In: IEEE international joint conference on neural networks, 31 July-4 August, vol 5, Montreal, Canada, pp 3180–3185
Rigutini L, Maggini M (2005) A semi-supervised document clustering algorithm based on EM. In: IEEE/WIC/ACM international conference on web intelligence, September 19-22, Compigne, France, pp 200–206
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43(1):59–69
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium on mathematical statistics and probability, June 21-July 18 and December 27-January 7, vol 1, Berkeley, USA, pp 281–297
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188
Powers DMW (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd Edn. Morgan Kaufmann
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. In: 19th international conference on artificial neural networks: Part II, September 14-17, Limassol, Cyprus, pp 175–184
Andrews NO, Fox EA (2009) Recent developments in document clustering technical report, computer science, Virginia Tech
De Vries CM, Geva S, Trotman A (2012) Document clustering evaluation: Divergence from a random baseline. CoRR, abs/1208.5654
De Bie T, Cristianini N (2004) Kernel methods for exploratory pattern analysis: a demonstration on text data. In: Joint IAPR international workshops, SSPR 2004 and SPR 2004, August 18-20, vol 3138, Lisbon, Portugal, pp 16–29
Fodor JD, Sakas WG (2004) Evaluating models of parameter setting. Boston University conference on language development, Boston
Acknowledgments
Authors are fully grateful to Ms. Zagorka Brodić, professor of French and Serbo-Croatian languages, for the helpful discussions about Italian language, and to Ms. Janet Newell, native professor of English language, for her precious editing support.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
Author Darko Brodić declares that he has no conflict of interest. Author Alessia Amelio declares that she has no conflict of interest. Author Zoran N. Milivojević declares that he has no conflict of interest.
Funding
This study was partially funded by the Grant of the Ministry of Education, Science and Technological Development of the Republic Serbia, as a part of the project TR33037 within the framework of Technological development program. The receiver of the funding is Dr. Darko Brodić.
Additional information
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Rights and permissions
About this article
Cite this article
Brodić, D., Amelio, A. & Milivojević, Z.N. Clustering documents in evolving languages by image texture analysis. Appl Intell 46, 916–933 (2017). https://doi.org/10.1007/s10489-016-0878-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-016-0878-8