Pattern Analysis and Applications

, Volume 19, Issue 3, pp 647–663 | Cite as

Cross-document word matching for segmentation and retrieval of Ottoman divans

Theoretical Advances

Abstract

Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are difficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents.

Keywords

Segmentation Retrieval Matching Historical documents Ottoman divans 

References

  1. 1.
    Al-Badr BH (1995) A segmentation-free approach to text recognition with application to Arabic text, Ph.D. thesis. University of Washington, SeattleGoogle Scholar
  2. 2.
    Andrews WG (1985) Poetry voice, society song: ottoman lyric poetry. University of Washington Press, Seattle and LondonGoogle Scholar
  3. 3.
    Andrews WG, Black N, Kalpakli M (2006) Ottoman lyric poetry: an anthology. University of Washington Press Google Scholar
  4. 4.
    Anonymous (1897) Kulliyat-ı Divan-ı Fuzuli. Hurşid Matbaası, İstanbul Google Scholar
  5. 5.
    Asi A, Rabaev I, Kedem K, El-Sana J (2011) User-assisted alignment of Arabic historical manuscripts. In: International workshop on historical document imaging and processingGoogle Scholar
  6. 6.
    Ataer E, Duygulu P (2006) Retrieval of ottoman documents. In: Proceedings of the 8th ACM International workshop on Multimedia Information retrieval, pp. 155–162 Google Scholar
  7. 7.
    Ataer E, Duygulu P (2007) Matching ottoman words: an image retrieval approach to historical document indexing. In: Proceedings of the 6th ACM International conference on Image and Video Retrieval, pp. 341–347 Google Scholar
  8. 8.
    Ball G, Srihari SN, Srinivasan H (2006) Segmentation-based and segmentation-free methods for spotting handwritten Arabic words. In: 10th International Workshop on Frontiers in Handwriting Recognition Google Scholar
  9. 9.
    Brina CD, Niels R, Overvelde A, Levi G, Hulstijn W (2008) Dynamic time warping: a new method in the study of poor handwriting. Hum Mov Sci 27(2):242–255CrossRefGoogle Scholar
  10. 10.
    Broumandnia A, Shanbehzadeh J, Varnoosfaderani MR (2008) Persian/Arabic handwritten word recognition using M-band packet wavelet transform. Image Vis Comput 26:829–842CrossRefGoogle Scholar
  11. 11.
    Bulacu M, Schomaker L (2007) Text-independent writer identification and verification using textural and allographic features. IEEE Trans Pattern Anal Mach Intell 29:701–717CrossRefGoogle Scholar
  12. 12.
    Can E, Duygulu P, Can F, Kalpakli M (2010) Redif extraction in handwritten Ottoman literary texts. In: Proceedings of the 20th International Conference on Pattern Recognition Google Scholar
  13. 13.
    Can EF, Duygulu P (2011) A line-based representation for matching words in historical manuscripts. Pattern Recognition Letters 32(8):1126–1138CrossRefGoogle Scholar
  14. 14.
    Cheung A, Bennamoun M, Bergmann NW (2001) An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognit 34(2):215–233CrossRefMATHGoogle Scholar
  15. 15.
    Dogan MN (1997) Mecnun ve Leyla Dilinden Siirler. Enderun Kitabevi (1997).Google Scholar
  16. 16.
    Fischer A, Indermuhle E, Frinken V, Bunke H (2011) HMM-based alignment of inaccurate transcriptions for historical documents. In: 11th Int. Conf. on Document Analysis and Recognition, p. 53 Google Scholar
  17. 17.
    Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recognit Lett 33:934–942CrossRefGoogle Scholar
  18. 18.
    Fornés A, Lladós J, Sánchez G (2008) Old handwritten musical symbol classification by a dynamic time warping based method. Graph Recognit 5046:51–60Google Scholar
  19. 19.
    Fornes A, Llados J, Sanchez G, Karatzas D (2010) Rotation invariant hand-drawn symbol recognition based on a dynamic time warping model. Int J Doc Anal Recognit 13(3):229–241CrossRefGoogle Scholar
  20. 20.
    Gatos B, Pratikakis I Segmentation-free word spotting in historical printed documents. In: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, ICDAR ’09, pp. 271–275 Google Scholar
  21. 21.
    Howe NR, Rath TM, Manmatha R (2005) Boosted decision trees for word recognition in handwritten document retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 377–383. ACM Google Scholar
  22. 22.
    Huang C, Srihari SN (2008) Word segmentation of off-line handwritten documents. Document Recognition and Retrieval XV, Proc. SPIE 6815 Google Scholar
  23. 23.
    Jain A (1986) Fundamentals of digital image processing. Prentice-Hall, Englewood CliffsMATHGoogle Scholar
  24. 24.
    Ji Y, Sun S (2013) Multitask multiclass support vector machines: model and experiments. Pattern Recognit pp. 914–924 Google Scholar
  25. 25.
    Khurshid KCF, Vincent N (2012) Word spotting in historical printed documents using shape and sequence comparisons. Pattern Recognition 45:2598–2609CrossRefGoogle Scholar
  26. 26.
    Kabacali A (1998) Cumhuriyet oncesi ve sonrasi matbaa ve basin sanayii. Cem Ofset Google Scholar
  27. 27.
    Kchaou MG, Kanoun S, Ogier JM (2012) Segmentation and word spotting methods for printed and handwritten arabic texts: a comparative study. In: International Conference on Frontiers in Handwriting Recognition Google Scholar
  28. 28.
    Khayyat M, Lam L, Suen CY (2012) Arabic handwritten word spotting using language models pp. 43–48 Google Scholar
  29. 29.
    Khayyat M, Lam L, Suen CY (2014) Learning-based word spotting system for Arabic handwritten documents. Pattern Recognit 47(3):1021–1030CrossRefGoogle Scholar
  30. 30.
    Kim S, Jeong S, Lee GS, Suen C (2001) Word segmentation in handwritten Korean text lines based on gap clustering techniques. In: Sixth International Conference on Document Analysis and Recognition, pp. 189–193 Google Scholar
  31. 31.
    Konidaris T, Gatos B, Ntzios K, Pratikakis I, Theodoridis S, Perantonis SJ (2007) Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int J Doc Anal Recognit 9(2):167–177CrossRefGoogle Scholar
  32. 32.
    Kut T, Ture F (1996) Yazmadan basmaya: muteferrika, muhendishane. Yapi Kredi Kultur Merkezi, UskudarGoogle Scholar
  33. 33.
    Lados J, Rusinol M, Fornes A, Fernandes D, Dutta A (2012) On the influence of word representations for handwritten word spotting in historical documents. International J Pattern Recognit Artif Intell 26(05) Google Scholar
  34. 34.
    Leydier Y, Ouji A, LeBourgeois F, Emptoz H (2009) Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recognit 42(9):2089–2105CrossRefMATHGoogle Scholar
  35. 35.
    Lladós, J, Pratim-Roy P, Rodríguez JA., Sánchez G (2007) Word spotting in archive documents using shape contexts. In: Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part II, pp. 290–297. Springer-Verlag Google Scholar
  36. 36.
    Louloudis G, Gatos B, Pratikakis I, Halatsis C (2009) Text line and word segmentation of handwritten documents. Pattern Recognit 42(12):3169–3183CrossRefMATHGoogle Scholar
  37. 37.
    Manmatha R, Han C, Riseman E (1996) Word spotting: a new approach to indexing handwriting. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 631–637 Google Scholar
  38. 38.
    Manmatha R, Han C, Riseman EM, Croft WB (1996) Indexing handwriting using word matching. In: Proceedings of the first ACM international conference on Digital libraries, pp. 151–159. ACM Google Scholar
  39. 39.
    Manmatha R, Srimal N (1999) Scale space technique for word segmentation in handwritten documents. Scale-Space Theories in Computer Vision. Lect Notes Comput Sci 1682:22–33CrossRefGoogle Scholar
  40. 40.
    Marcolino A, Ramos V, Ramalho M, Pinto JC (2000) Line and word matching in old documents. In: Proceedings of the 5th IberoAmerican Symposium on Pattern Recognition, pp. 123–125 Google Scholar
  41. 41.
    Marti UV, Bunke H (2001) Using a statistical language model to improve the performance of an HMM-Based cursive handwriting recognition system. Int J Pattern Recognit Artif Anal 15(1):65–90CrossRefGoogle Scholar
  42. 42.
    Micah KE, Manmatha R, James A (2004) Text alignment with handwritten documents. In: DIAL ’04: Proceedings of the First International Workshop on Document Image Analysis for Libraries, p. 195. IEEE Computer Society, Washington DCGoogle Scholar
  43. 43.
    Niels R (2004) Dynamic time warping: an intuitive way of handwriting recognition. Master’s thesis Google Scholar
  44. 44.
    Nikolaou N, Makridis M, Gatos B, Papamarkos NSN (2010) Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis Comput 28(4):590–604CrossRefGoogle Scholar
  45. 45.
    Nikolaou N, Makridis M, Gatos B, Stamatopoulos N, Papamarkos N (2010) Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis Comput. 28(4):590–604CrossRefGoogle Scholar
  46. 46.
    Rath T, Manmatha R (2003) Word image matching using dynamic time warping. Proc IEEE Conf Computer Vis Pattern Recognit 2:521–527Google Scholar
  47. 47.
    Rath TM, Kane S, Lehman A, Partridge E, Manmatha R (2002) Indexing for a digital library of George Washington’s manuscripts: a study of word matching techniques. Tech Rep Google Scholar
  48. 48.
    Rath TM, Lavrenko V, Manmatha R (2003) A statistical approach to retrieving historical manuscript images without recognition. Tech rep Google Scholar
  49. 49.
    Rath TM, Manmatha R, Lavrenko V (2004) A search engine for historical manuscript images. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp. 369–376. ACM Google Scholar
  50. 50.
    Rodríguez-Serrano JA, Perronnin F (2009) Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recognit 42(9):2106–2116CrossRefMATHGoogle Scholar
  51. 51.
    Rothfeder JL, Feng S, Rath TM (2003) Using corner feature correspondences to rank word images by similarity. Comput Vis Pattern Recognit Workshop 3:30–36Google Scholar
  52. 52.
    Saykol E, Sinop A, Gudukbay U, Ulusoy O, Cetin A (2004) Content-based retrieval of historical Ottoman documents stored as textual images. IEEE Trans Image Process 13(3):314–325CrossRefGoogle Scholar
  53. 53.
    Seni G, Cohen E (1994) External word segmentation of off-line handwritten text lines. Pattern Recognit 27:41–52CrossRefGoogle Scholar
  54. 54.
    Sinno JP, Qiang Y (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359CrossRefGoogle Scholar
  55. 55.
    Srihari SN, Ball GR (2008) Language independent word spotting in scanned documents. In: Proceedings of the 11th International Conference on Asian Digital Libraries, pp. 134–143 Google Scholar
  56. 56.
    Sun S, Xu Z, Yang M (2013) Transfer learning with part-based ensembles. Lect Notes Comput Sci 7872:271–282CrossRefGoogle Scholar
  57. 57.
    Adamek TN, Smeaton A (2007) Word matching using single closed contours for indexing handwritten historical documents. Int J Doc Anal Recognit 9:153–165CrossRefGoogle Scholar
  58. 58.
    Tomai CI, Zhang B, Govindaraju V (2002) Transcript mapping for historic handwritten document images. In: 8th International Workshop on frontiers in Handwriting Recognition Google Scholar
  59. 59.
    Tseng YH, Lee HJ (1999) Recognition-based handwritten Chinese character segmentation using a probabilistic viterbi algorithm. Pattern Recognit Lett 20(8):791–806CrossRefGoogle Scholar
  60. 60.
    Tu W, Sun S (2012) Cross-domain representation-learning framework with combination of class-separate and domain-merge objectives. In: Proceedings of the 1st International Workshop on Cross Domain Knowledge Discovery in Web and Social Network Mining, CDKD ’12, pp. 18–25. ACM Google Scholar
  61. 61.
    Varga T, Bunke H (2005) Tree structure for word extraction from handwritten text lines. In: 8th International Conference on Document Analysis and Recognition , vol. 1, pp. 352–356 Google Scholar
  62. 62.
    Yalniz I, Altingovde I, Gudukbay U, Ulusoy O (2009) Integrated segmentation and recognition of connected Ottoman script. Opt Eng 48(11):1–12CrossRefGoogle Scholar
  63. 63.
    Yalniz I, Altingovde I, Gudukbay U, Ulusoy O (2009) Ottoman archives explorer: a retrieval system for digital Ottoman archives. J Comput Cult Herit 2(3):1–20CrossRefGoogle Scholar
  64. 64.
    Zand M, Naghsh A, Monadjemi A (2008) Recognition-based segmentation in Persian character recognition. In: Proceedings of the Second International Conference on Advances in Pattern Recognition. World Academy of Science, Engineering and Technology 38Google Scholar
  65. 65.
    Zhang B, Srihari SN, Huang C (2003) Word image retrieval using binary features. Doc Recognit Retr XI 1:45–53Google Scholar
  66. 66.
    Zirari F, Ennaji A, Nicolas S, Mammass D (2013) A methodology to spot words in historical arabic documents pp. 1–4 Google Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Pinar Duygulu
    • 1
  • Damla Arifoglu
    • 2
  • Mehmet Kalpakli
    • 3
  1. 1.Department of Computer EngineeringBilkent UniversityAnkaraTurkey
  2. 2.Computer Engineering DepartmentSabanci UniversityIstanbulTurkey
  3. 3.Department of HistoryBilkent UniversityAnkaraTurkey

Personalised recommendations