Skip to main content
Log in

Clustering documents in evolving languages by image texture analysis

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper introduces a new method for clustering of documents, which have been written in a language evolving during different historical periods, with an example of the Italian language. In the first phase, the text is transformed into a string of four numerical codes, which have been derived from the energy profile of each letter, defining the height of the letters and their location in the text line. Each code represents a gray level and the text is codified as a 1-D image. In the second phase, texture features are extracted from the obtained image in order to create document feature vectors. Subsequently, a new clustering algorithm is employed on the feature vectors to discriminate documents from different historical periods of the language. Experiments are performed on a database of Italian documents given in Italian Vulgar and modern Italian. Results demonstrate that this proposed method perfectly identifies the historical periods of the language of the documents, outperforming other well-known clustering algorithms generally adopted for document categorization and other state-of-the-art text-based language models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The database is freely available at https://sites.google.com/site/documentanalysis2015/italian-italian-vulgar-database.

References

  1. Janson T (2004) A natural history of latin. Oxford University Press, Oxford

    Google Scholar 

  2. History of Latin. Available at: https://en.wikipedia.org/wiki/History_of_Latin

  3. Haller EK (2012) Dante alighieri. In: Matheson LM (ed) Icons of the middle ages: rulers, writers, rebels, and saints1, Santa Barbara, CA: Greenwood, p 244

  4. Maiden M (1995) A linguistic history of italian. Longman, London

    Google Scholar 

  5. How Latin became Italian. Available at: https://damyanlissitchkov.wordpress.com/2013/03/23/how-latin-became-italian/

  6. Pei MA (1949) New methodology for romance classification. WORD 5(2):135–146

    Article  Google Scholar 

  7. Grimes BF (1996) Ethnologue: languages of the world. In: Pittman RS, Grimes JE (eds). 30th edn. Summer Institute of Linguistics, Academic Publisher, Dallas

  8. Calabrese A (2003) On the Evolution of the short high vowel of Latin into Romance. In: Perez-Leroux A, Roberge Y (eds) Romance linguistics, theory and acquisition. John Benjamins, Amsterdam, pp 63–94

    Chapter  Google Scholar 

  9. Cavnar W, Trenkle J (1994) N-gram-based text categorization. In: 3rd annual symposium on document analysis and information retrieval, April 11-13, Las Vegas, pp 161–175

  10. Takci H, Sogukpimar I (2004) Letter based text scoring method for language identification. In: Advances in information systems, October 20-22, vol 3261, Izmir, pp 283–290

  11. Tan CM, Wang YF, Lee CD (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38(4):529–546

    Article  MATH  Google Scholar 

  12. Grothe L, De Luca EW, Nurnberger A (2008) A comparative study on language identification Methods, Marrakech, Morocco

  13. Braga IA, Monard MC, Matsubara ET (2009) Combining unigrams and bigrams in semi-supervised text classification. In: 14Th Portuguese conference on artificial intelligence (EPIA) - new trends in artificial intelligence, October 12–15, Aveiro, Portugal, pp 489–500

  14. Goodman J (2006) A bit of progress in language modeling: extended version. Technical report MSR-TR-2001-72, machine learning and applied statistics group microsoft research. Redmond

  15. Padro M, Padro L (2004) Comparing methods for language identification. In: XX Congreso de la Sociedad Espanola para el Procesamiento del Lenguage Natural, Barcelona, Spain , pp 155–161

  16. Sibun P, Spitz AL (1994) Language determination: natural language processing from scanned document images. In: 4th applied natural language processing conference, October 13-15, Stuttgart, Germany, pp 15–21

  17. Martino MJ, Paulsen RC (2001) Natural language determination using partial words. U.S. Patent No. 6216102 B1

  18. Cowie J, Ludovic Y, Zacharski R (1999) Language recognition for mono and multilingual documents. In: Vextal conference, November 22-24, Venice, pp 209–214

  19. Shijian L, Lim Tan C (2008) Script and language identification in noisy and degraded document images. IEEE Trans Pattern Anal Mach Intell 30(1):14–24

    Article  Google Scholar 

  20. Tan TN (1996) Written language recognition based on texture analysis. In: Proceedings of ICIP’96, vol 2, Lausanne, Switz, pp 185–188

  21. Peake GS, Tan TN (1997) Script and language identification from document images. In: Third Asian Conference on Computer Vision, January 8-10, Hong Kong, China, pp 97–104

    Google Scholar 

  22. Brodić D, Amelio A, Milivojević ZN (2016) Language discrimination by texture analysis of the image corresponding to the text. Neural Comput Appl 1–22

  23. Brodić D, Amelio A, Milivojević ZN (2015) Characterization and distinction between closely related south slavic languages on the example of Serbian and Croatian. In: Comp. anal. of images and patterns, September 2-4, vol 9256, Valletta, Malta, pp 654–666

  24. Brodić D, Milivojević ZN, Amelio A (2015) Analysis of the South Slavic scripts by run-length features of the image texture. Elektronika Ir Elektrotechnika 21(4):60–64

    Google Scholar 

  25. Zramdini A, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal Mach Intell 20(8):877–882

    Article  Google Scholar 

  26. Joshi GD, Garg S, Sivaswamy J (2007) A generalised framework for script identification. IJDAR 10 (2):55–68

    Article  Google Scholar 

  27. Brodić D, Milivojević ZN, Maluckov CA (2013) Recognition of the script in Serbian documents using frequency occurrence and co-occurrence analysis. Sci World J 896328:1–14

    Article  Google Scholar 

  28. Del Bimbo A (2001) Visual information retrieval. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  29. Brodić D, Milivojević ZN, Maluckov CA (2015) An approach to the script discrimination in the Slavic documents. Soft Comput 19(9):2655–2665

    Article  Google Scholar 

  30. Eleyan A, Demirel H (2011) Co-occurrence matrix and its statistical features as a new approach for face recognition. Turkish J Electr Engin and Comp Sci 19(1):97–107

    Google Scholar 

  31. Clausi DA (2002) An analysis of co-occurrence texture statistics as a function of grey level quantization. Canadian J Remote Sens 28(1):45–62

    Article  Google Scholar 

  32. Galloway MM (1975) Texture analysis using gray level run lengths. Comput Graphics Image Process 4 (2):172–179

    Article  Google Scholar 

  33. Chu A, Sehgal CM, Greenleaf JF (1990) Use of gray value distribution of run lengths for texture analysis. Pattern Recogn Lett 11(6):415–419

    Article  MATH  Google Scholar 

  34. Dasarathy BR, Holder EB (1991) Image characterizations based on joint gray-level run-length distributions. Pattern Recogn Lett 12(8):497–502

    Article  Google Scholar 

  35. Ojala T, Pietikäinen M, Harwood D (1996) A comparative study of texture measures with classification based on feature distributions. Pattern Recogn 29:51–59

    Article  Google Scholar 

  36. Nosaka R, Ohkawa Y, Fukui K (2011) Feature extraction based on co-occurrence of adjacent local binary patterns. In: Advance in image and video technology, November 20–23, vol 7088, Gwangju, South Korea, pp 82–91

  37. Amelio A, Pizzuti C (2014) A new evolutionary-based clustering framework for image databases. In: Image and Sign. Proc., June 30-july 2, vol 8509, Cherbourg, Normandy, France, pp 322–331

  38. Sicilian School. Available online: http://www.britannica.com/art/Sicilian-school

  39. Dolce Stil Novo. Available online: http://www.britannica.com/art/dolce-stil-nuovo

  40. Angiolieri C Available online: http://www.britannica.com/biography/Cecco-Angiolieri

  41. Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Disc 10(2):141–168

    Article  MathSciNet  Google Scholar 

  42. Saarikoski J, Laurikkala J, Järvelin K, Juhola M (2011) Self-organising maps in document classification: a comparison with six machine learning methods. In: 10th international conference, ICANNGA, April 14-16, vol 6593, Ljubljana, Slovenia , pp 260–269

  43. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, August 20-23, Boston, MA, USA

  44. Zhong S (2005) Efficient online spherical k-means clustering. In: IEEE international joint conference on neural networks, 31 July-4 August, vol 5, Montreal, Canada, pp 3180–3185

  45. Rigutini L, Maggini M (2005) A semi-supervised document clustering algorithm based on EM. In: IEEE/WIC/ACM international conference on web intelligence, September 19-22, Compigne, France, pp 200–206

  46. Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244

    Article  MathSciNet  Google Scholar 

  47. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43(1):59–69

    Article  MathSciNet  MATH  Google Scholar 

  48. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium on mathematical statistics and probability, June 21-July 18 and December 27-January 7, vol 1, Berkeley, USA, pp 281–297

  49. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  50. Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188

    MathSciNet  MATH  Google Scholar 

  51. Powers DMW (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63

    MathSciNet  Google Scholar 

  52. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd Edn. Morgan Kaufmann

  53. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  54. Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. In: 19th international conference on artificial neural networks: Part II, September 14-17, Limassol, Cyprus, pp 175–184

  55. Andrews NO, Fox EA (2009) Recent developments in document clustering technical report, computer science, Virginia Tech

  56. De Vries CM, Geva S, Trotman A (2012) Document clustering evaluation: Divergence from a random baseline. CoRR, abs/1208.5654

  57. De Bie T, Cristianini N (2004) Kernel methods for exploratory pattern analysis: a demonstration on text data. In: Joint IAPR international workshops, SSPR 2004 and SPR 2004, August 18-20, vol 3138, Lisbon, Portugal, pp 16–29

  58. Fodor JD, Sakas WG (2004) Evaluating models of parameter setting. Boston University conference on language development, Boston

Download references

Acknowledgments

Authors are fully grateful to Ms. Zagorka Brodić, professor of French and Serbo-Croatian languages, for the helpful discussions about Italian language, and to Ms. Janet Newell, native professor of English language, for her precious editing support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Darko Brodić.

Ethics declarations

Conflict of interests

Author Darko Brodić declares that he has no conflict of interest. Author Alessia Amelio declares that she has no conflict of interest. Author Zoran N. Milivojević declares that he has no conflict of interest.

Funding

This study was partially funded by the Grant of the Ministry of Education, Science and Technological Development of the Republic Serbia, as a part of the project TR33037 within the framework of Technological development program. The receiver of the funding is Dr. Darko Brodić.

Additional information

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brodić, D., Amelio, A. & Milivojević, Z.N. Clustering documents in evolving languages by image texture analysis. Appl Intell 46, 916–933 (2017). https://doi.org/10.1007/s10489-016-0878-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-016-0878-8

Keywords

Navigation