Clustering documents in evolving languages by image texture analysis

Brodić, Darko; Amelio, Alessia; Milivojević, Zoran N.

doi:10.1007/s10489-016-0878-8

Clustering documents in evolving languages by image texture analysis

Published: 26 December 2016

Volume 46, pages 916–933, (2017)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Darko Brodić¹,
Alessia Amelio² &
Zoran N. Milivojević³

383 Accesses
14 Citations
Explore all metrics

Abstract

This paper introduces a new method for clustering of documents, which have been written in a language evolving during different historical periods, with an example of the Italian language. In the first phase, the text is transformed into a string of four numerical codes, which have been derived from the energy profile of each letter, defining the height of the letters and their location in the text line. Each code represents a gray level and the text is codified as a 1-D image. In the second phase, texture features are extracted from the obtained image in order to create document feature vectors. Subsequently, a new clustering algorithm is employed on the feature vectors to discriminate documents from different historical periods of the language. Experiments are performed on a database of Italian documents given in Italian Vulgar and modern Italian. Results demonstrate that this proposed method perfectly identifies the historical periods of the language of the documents, outperforming other well-known clustering algorithms generally adopted for document categorization and other state-of-the-art text-based language models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Kanish Shah, Henil Patel, … Manan Shah

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Vivek Mehta, Mohit Agarwal & Rohit Kumar Kaliyar

Performance Analysis of Handwritten Text Augmentation on Style-Based Dating of Historical Documents

Article Open access 04 April 2024

Lisa Koopmans, Maruf A. Dhali & Lambert Schomaker

Notes

The database is freely available at https://sites.google.com/site/documentanalysis2015/italian-italian-vulgar-database.

References

Janson T (2004) A natural history of latin. Oxford University Press, Oxford
Google Scholar
History of Latin. Available at: https://en.wikipedia.org/wiki/History_of_Latin
Haller EK (2012) Dante alighieri. In: Matheson LM (ed) Icons of the middle ages: rulers, writers, rebels, and saints1, Santa Barbara, CA: Greenwood, p 244
Maiden M (1995) A linguistic history of italian. Longman, London
Google Scholar
How Latin became Italian. Available at: https://damyanlissitchkov.wordpress.com/2013/03/23/how-latin-became-italian/
Pei MA (1949) New methodology for romance classification. WORD 5(2):135–146
Article Google Scholar
Grimes BF (1996) Ethnologue: languages of the world. In: Pittman RS, Grimes JE (eds). 30th edn. Summer Institute of Linguistics, Academic Publisher, Dallas
Calabrese A (2003) On the Evolution of the short high vowel of Latin into Romance. In: Perez-Leroux A, Roberge Y (eds) Romance linguistics, theory and acquisition. John Benjamins, Amsterdam, pp 63–94
Chapter Google Scholar
Cavnar W, Trenkle J (1994) N-gram-based text categorization. In: 3rd annual symposium on document analysis and information retrieval, April 11-13, Las Vegas, pp 161–175
Takci H, Sogukpimar I (2004) Letter based text scoring method for language identification. In: Advances in information systems, October 20-22, vol 3261, Izmir, pp 283–290
Tan CM, Wang YF, Lee CD (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38(4):529–546
Article MATH Google Scholar
Grothe L, De Luca EW, Nurnberger A (2008) A comparative study on language identification Methods, Marrakech, Morocco
Braga IA, Monard MC, Matsubara ET (2009) Combining unigrams and bigrams in semi-supervised text classification. In: 14Th Portuguese conference on artificial intelligence (EPIA) - new trends in artificial intelligence, October 12–15, Aveiro, Portugal, pp 489–500
Goodman J (2006) A bit of progress in language modeling: extended version. Technical report MSR-TR-2001-72, machine learning and applied statistics group microsoft research. Redmond
Padro M, Padro L (2004) Comparing methods for language identification. In: XX Congreso de la Sociedad Espanola para el Procesamiento del Lenguage Natural, Barcelona, Spain , pp 155–161
Sibun P, Spitz AL (1994) Language determination: natural language processing from scanned document images. In: 4th applied natural language processing conference, October 13-15, Stuttgart, Germany, pp 15–21
Martino MJ, Paulsen RC (2001) Natural language determination using partial words. U.S. Patent No. 6216102 B1
Cowie J, Ludovic Y, Zacharski R (1999) Language recognition for mono and multilingual documents. In: Vextal conference, November 22-24, Venice, pp 209–214
Shijian L, Lim Tan C (2008) Script and language identification in noisy and degraded document images. IEEE Trans Pattern Anal Mach Intell 30(1):14–24
Article Google Scholar
Tan TN (1996) Written language recognition based on texture analysis. In: Proceedings of ICIP’96, vol 2, Lausanne, Switz, pp 185–188
Peake GS, Tan TN (1997) Script and language identification from document images. In: Third Asian Conference on Computer Vision, January 8-10, Hong Kong, China, pp 97–104
Google Scholar
Brodić D, Amelio A, Milivojević ZN (2016) Language discrimination by texture analysis of the image corresponding to the text. Neural Comput Appl 1–22
Brodić D, Amelio A, Milivojević ZN (2015) Characterization and distinction between closely related south slavic languages on the example of Serbian and Croatian. In: Comp. anal. of images and patterns, September 2-4, vol 9256, Valletta, Malta, pp 654–666
Brodić D, Milivojević ZN, Amelio A (2015) Analysis of the South Slavic scripts by run-length features of the image texture. Elektronika Ir Elektrotechnika 21(4):60–64
Google Scholar
Zramdini A, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal Mach Intell 20(8):877–882
Article Google Scholar
Joshi GD, Garg S, Sivaswamy J (2007) A generalised framework for script identification. IJDAR 10 (2):55–68
Article Google Scholar
Brodić D, Milivojević ZN, Maluckov CA (2013) Recognition of the script in Serbian documents using frequency occurrence and co-occurrence analysis. Sci World J 896328:1–14
Article Google Scholar
Del Bimbo A (2001) Visual information retrieval. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
Brodić D, Milivojević ZN, Maluckov CA (2015) An approach to the script discrimination in the Slavic documents. Soft Comput 19(9):2655–2665
Article Google Scholar
Eleyan A, Demirel H (2011) Co-occurrence matrix and its statistical features as a new approach for face recognition. Turkish J Electr Engin and Comp Sci 19(1):97–107
Google Scholar
Clausi DA (2002) An analysis of co-occurrence texture statistics as a function of grey level quantization. Canadian J Remote Sens 28(1):45–62
Article Google Scholar
Galloway MM (1975) Texture analysis using gray level run lengths. Comput Graphics Image Process 4 (2):172–179
Article Google Scholar
Chu A, Sehgal CM, Greenleaf JF (1990) Use of gray value distribution of run lengths for texture analysis. Pattern Recogn Lett 11(6):415–419
Article MATH Google Scholar
Dasarathy BR, Holder EB (1991) Image characterizations based on joint gray-level run-length distributions. Pattern Recogn Lett 12(8):497–502
Article Google Scholar
Ojala T, Pietikäinen M, Harwood D (1996) A comparative study of texture measures with classification based on feature distributions. Pattern Recogn 29:51–59
Article Google Scholar
Nosaka R, Ohkawa Y, Fukui K (2011) Feature extraction based on co-occurrence of adjacent local binary patterns. In: Advance in image and video technology, November 20–23, vol 7088, Gwangju, South Korea, pp 82–91
Amelio A, Pizzuti C (2014) A new evolutionary-based clustering framework for image databases. In: Image and Sign. Proc., June 30-july 2, vol 8509, Cherbourg, Normandy, France, pp 322–331
Sicilian School. Available online: http://www.britannica.com/art/Sicilian-school
Dolce Stil Novo. Available online: http://www.britannica.com/art/dolce-stil-nuovo
Angiolieri C Available online: http://www.britannica.com/biography/Cecco-Angiolieri
Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Disc 10(2):141–168
Article MathSciNet Google Scholar
Saarikoski J, Laurikkala J, Järvelin K, Juhola M (2011) Self-organising maps in document classification: a comparison with six machine learning methods. In: 10th international conference, ICANNGA, April 14-16, vol 6593, Ljubljana, Slovenia , pp 260–269
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, August 20-23, Boston, MA, USA
Zhong S (2005) Efficient online spherical k-means clustering. In: IEEE international joint conference on neural networks, 31 July-4 August, vol 5, Montreal, Canada, pp 3180–3185
Rigutini L, Maggini M (2005) A semi-supervised document clustering algorithm based on EM. In: IEEE/WIC/ACM international conference on web intelligence, September 19-22, Compigne, France, pp 200–206
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Article MathSciNet Google Scholar
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43(1):59–69
Article MathSciNet MATH Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium on mathematical statistics and probability, June 21-July 18 and December 27-January 7, vol 1, Berkeley, USA, pp 281–297
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
MathSciNet MATH Google Scholar
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188
MathSciNet MATH Google Scholar
Powers DMW (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63
MathSciNet Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd Edn. Morgan Kaufmann
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Article Google Scholar
Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. In: 19th international conference on artificial neural networks: Part II, September 14-17, Limassol, Cyprus, pp 175–184
Andrews NO, Fox EA (2009) Recent developments in document clustering technical report, computer science, Virginia Tech
De Vries CM, Geva S, Trotman A (2012) Document clustering evaluation: Divergence from a random baseline. CoRR, abs/1208.5654
De Bie T, Cristianini N (2004) Kernel methods for exploratory pattern analysis: a demonstration on text data. In: Joint IAPR international workshops, SSPR 2004 and SPR 2004, August 18-20, vol 3138, Lisbon, Portugal, pp 16–29
Fodor JD, Sakas WG (2004) Evaluating models of parameter setting. Boston University conference on language development, Boston

Download references

Acknowledgments

Authors are fully grateful to Ms. Zagorka Brodić, professor of French and Serbo-Croatian languages, for the helpful discussions about Italian language, and to Ms. Janet Newell, native professor of English language, for her precious editing support.

Author information

Authors and Affiliations

Technical Faculty in Bor, University of Belgrade, Vojske Jugoslavije 12, 19210, Bor, Serbia
Darko Brodić
DIMES University of Calabria, Via P. Bucci Cube 44, 87036, Rende (CS), Italy
Alessia Amelio
College of Applied Technical Sciences, Aleksandra Medvedeva 20, 18000, Niš, Serbia
Zoran N. Milivojević

Authors

Darko Brodić
View author publications
You can also search for this author in PubMed Google Scholar
Alessia Amelio
View author publications
You can also search for this author in PubMed Google Scholar
Zoran N. Milivojević
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Darko Brodić.

Ethics declarations

Conflict of interests

Author Darko Brodić declares that he has no conflict of interest. Author Alessia Amelio declares that she has no conflict of interest. Author Zoran N. Milivojević declares that he has no conflict of interest.

Funding

This study was partially funded by the Grant of the Ministry of Education, Science and Technological Development of the Republic Serbia, as a part of the project TR33037 within the framework of Technological development program. The receiver of the funding is Dr. Darko Brodić.

Additional information

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brodić, D., Amelio, A. & Milivojević, Z.N. Clustering documents in evolving languages by image texture analysis. Appl Intell 46, 916–933 (2017). https://doi.org/10.1007/s10489-016-0878-8

Download citation

Published: 26 December 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10489-016-0878-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Clustering documents in evolving languages by image texture analysis

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A comprehensive and analytical review of text clustering techniques

Performance Analysis of Handwritten Text Augmentation on Style-Based Dating of Historical Documents

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Funding

Additional information

Ethical approval

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering documents in evolving languages by image texture analysis

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A comprehensive and analytical review of text clustering techniques

Performance Analysis of Handwritten Text Augmentation on Style-Based Dating of Historical Documents

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Funding

Additional information

Ethical approval

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation