Cross-document word matching for segmentation and retrieval of Ottoman divans
- 384 Downloads
- 1 Citations
Abstract
Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are difficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents.
Keywords
Segmentation Retrieval Matching Historical documents Ottoman divansReferences
- 1.Al-Badr BH (1995) A segmentation-free approach to text recognition with application to Arabic text, Ph.D. thesis. University of Washington, SeattleGoogle Scholar
- 2.Andrews WG (1985) Poetry voice, society song: ottoman lyric poetry. University of Washington Press, Seattle and LondonGoogle Scholar
- 3.Andrews WG, Black N, Kalpakli M (2006) Ottoman lyric poetry: an anthology. University of Washington Press Google Scholar
- 4.Anonymous (1897) Kulliyat-ı Divan-ı Fuzuli. Hurşid Matbaası, İstanbul Google Scholar
- 5.Asi A, Rabaev I, Kedem K, El-Sana J (2011) User-assisted alignment of Arabic historical manuscripts. In: International workshop on historical document imaging and processingGoogle Scholar
- 6.Ataer E, Duygulu P (2006) Retrieval of ottoman documents. In: Proceedings of the 8th ACM International workshop on Multimedia Information retrieval, pp. 155–162 Google Scholar
- 7.Ataer E, Duygulu P (2007) Matching ottoman words: an image retrieval approach to historical document indexing. In: Proceedings of the 6th ACM International conference on Image and Video Retrieval, pp. 341–347 Google Scholar
- 8.Ball G, Srihari SN, Srinivasan H (2006) Segmentation-based and segmentation-free methods for spotting handwritten Arabic words. In: 10th International Workshop on Frontiers in Handwriting Recognition Google Scholar
- 9.Brina CD, Niels R, Overvelde A, Levi G, Hulstijn W (2008) Dynamic time warping: a new method in the study of poor handwriting. Hum Mov Sci 27(2):242–255CrossRefGoogle Scholar
- 10.Broumandnia A, Shanbehzadeh J, Varnoosfaderani MR (2008) Persian/Arabic handwritten word recognition using M-band packet wavelet transform. Image Vis Comput 26:829–842CrossRefGoogle Scholar
- 11.Bulacu M, Schomaker L (2007) Text-independent writer identification and verification using textural and allographic features. IEEE Trans Pattern Anal Mach Intell 29:701–717CrossRefGoogle Scholar
- 12.Can E, Duygulu P, Can F, Kalpakli M (2010) Redif extraction in handwritten Ottoman literary texts. In: Proceedings of the 20th International Conference on Pattern Recognition Google Scholar
- 13.Can EF, Duygulu P (2011) A line-based representation for matching words in historical manuscripts. Pattern Recognition Letters 32(8):1126–1138CrossRefGoogle Scholar
- 14.Cheung A, Bennamoun M, Bergmann NW (2001) An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognit 34(2):215–233CrossRefMATHGoogle Scholar
- 15.Dogan MN (1997) Mecnun ve Leyla Dilinden Siirler. Enderun Kitabevi (1997).Google Scholar
- 16.Fischer A, Indermuhle E, Frinken V, Bunke H (2011) HMM-based alignment of inaccurate transcriptions for historical documents. In: 11th Int. Conf. on Document Analysis and Recognition, p. 53 Google Scholar
- 17.Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recognit Lett 33:934–942CrossRefGoogle Scholar
- 18.Fornés A, Lladós J, Sánchez G (2008) Old handwritten musical symbol classification by a dynamic time warping based method. Graph Recognit 5046:51–60Google Scholar
- 19.Fornes A, Llados J, Sanchez G, Karatzas D (2010) Rotation invariant hand-drawn symbol recognition based on a dynamic time warping model. Int J Doc Anal Recognit 13(3):229–241CrossRefGoogle Scholar
- 20.Gatos B, Pratikakis I Segmentation-free word spotting in historical printed documents. In: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, ICDAR ’09, pp. 271–275 Google Scholar
- 21.Howe NR, Rath TM, Manmatha R (2005) Boosted decision trees for word recognition in handwritten document retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 377–383. ACM Google Scholar
- 22.Huang C, Srihari SN (2008) Word segmentation of off-line handwritten documents. Document Recognition and Retrieval XV, Proc. SPIE 6815 Google Scholar
- 23.Jain A (1986) Fundamentals of digital image processing. Prentice-Hall, Englewood CliffsMATHGoogle Scholar
- 24.Ji Y, Sun S (2013) Multitask multiclass support vector machines: model and experiments. Pattern Recognit pp. 914–924 Google Scholar
- 25.Khurshid KCF, Vincent N (2012) Word spotting in historical printed documents using shape and sequence comparisons. Pattern Recognition 45:2598–2609CrossRefGoogle Scholar
- 26.Kabacali A (1998) Cumhuriyet oncesi ve sonrasi matbaa ve basin sanayii. Cem Ofset Google Scholar
- 27.Kchaou MG, Kanoun S, Ogier JM (2012) Segmentation and word spotting methods for printed and handwritten arabic texts: a comparative study. In: International Conference on Frontiers in Handwriting Recognition Google Scholar
- 28.Khayyat M, Lam L, Suen CY (2012) Arabic handwritten word spotting using language models pp. 43–48 Google Scholar
- 29.Khayyat M, Lam L, Suen CY (2014) Learning-based word spotting system for Arabic handwritten documents. Pattern Recognit 47(3):1021–1030CrossRefGoogle Scholar
- 30.Kim S, Jeong S, Lee GS, Suen C (2001) Word segmentation in handwritten Korean text lines based on gap clustering techniques. In: Sixth International Conference on Document Analysis and Recognition, pp. 189–193 Google Scholar
- 31.Konidaris T, Gatos B, Ntzios K, Pratikakis I, Theodoridis S, Perantonis SJ (2007) Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int J Doc Anal Recognit 9(2):167–177CrossRefGoogle Scholar
- 32.Kut T, Ture F (1996) Yazmadan basmaya: muteferrika, muhendishane. Yapi Kredi Kultur Merkezi, UskudarGoogle Scholar
- 33.Lados J, Rusinol M, Fornes A, Fernandes D, Dutta A (2012) On the influence of word representations for handwritten word spotting in historical documents. International J Pattern Recognit Artif Intell 26(05) Google Scholar
- 34.Leydier Y, Ouji A, LeBourgeois F, Emptoz H (2009) Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recognit 42(9):2089–2105CrossRefMATHGoogle Scholar
- 35.Lladós, J, Pratim-Roy P, Rodríguez JA., Sánchez G (2007) Word spotting in archive documents using shape contexts. In: Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part II, pp. 290–297. Springer-Verlag Google Scholar
- 36.Louloudis G, Gatos B, Pratikakis I, Halatsis C (2009) Text line and word segmentation of handwritten documents. Pattern Recognit 42(12):3169–3183CrossRefMATHGoogle Scholar
- 37.Manmatha R, Han C, Riseman E (1996) Word spotting: a new approach to indexing handwriting. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 631–637 Google Scholar
- 38.Manmatha R, Han C, Riseman EM, Croft WB (1996) Indexing handwriting using word matching. In: Proceedings of the first ACM international conference on Digital libraries, pp. 151–159. ACM Google Scholar
- 39.Manmatha R, Srimal N (1999) Scale space technique for word segmentation in handwritten documents. Scale-Space Theories in Computer Vision. Lect Notes Comput Sci 1682:22–33CrossRefGoogle Scholar
- 40.Marcolino A, Ramos V, Ramalho M, Pinto JC (2000) Line and word matching in old documents. In: Proceedings of the 5th IberoAmerican Symposium on Pattern Recognition, pp. 123–125 Google Scholar
- 41.Marti UV, Bunke H (2001) Using a statistical language model to improve the performance of an HMM-Based cursive handwriting recognition system. Int J Pattern Recognit Artif Anal 15(1):65–90CrossRefGoogle Scholar
- 42.Micah KE, Manmatha R, James A (2004) Text alignment with handwritten documents. In: DIAL ’04: Proceedings of the First International Workshop on Document Image Analysis for Libraries, p. 195. IEEE Computer Society, Washington DCGoogle Scholar
- 43.Niels R (2004) Dynamic time warping: an intuitive way of handwriting recognition. Master’s thesis Google Scholar
- 44.Nikolaou N, Makridis M, Gatos B, Papamarkos NSN (2010) Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis Comput 28(4):590–604CrossRefGoogle Scholar
- 45.Nikolaou N, Makridis M, Gatos B, Stamatopoulos N, Papamarkos N (2010) Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image Vis Comput. 28(4):590–604CrossRefGoogle Scholar
- 46.Rath T, Manmatha R (2003) Word image matching using dynamic time warping. Proc IEEE Conf Computer Vis Pattern Recognit 2:521–527Google Scholar
- 47.Rath TM, Kane S, Lehman A, Partridge E, Manmatha R (2002) Indexing for a digital library of George Washington’s manuscripts: a study of word matching techniques. Tech Rep Google Scholar
- 48.Rath TM, Lavrenko V, Manmatha R (2003) A statistical approach to retrieving historical manuscript images without recognition. Tech rep Google Scholar
- 49.Rath TM, Manmatha R, Lavrenko V (2004) A search engine for historical manuscript images. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp. 369–376. ACM Google Scholar
- 50.Rodríguez-Serrano JA, Perronnin F (2009) Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recognit 42(9):2106–2116CrossRefMATHGoogle Scholar
- 51.Rothfeder JL, Feng S, Rath TM (2003) Using corner feature correspondences to rank word images by similarity. Comput Vis Pattern Recognit Workshop 3:30–36Google Scholar
- 52.Saykol E, Sinop A, Gudukbay U, Ulusoy O, Cetin A (2004) Content-based retrieval of historical Ottoman documents stored as textual images. IEEE Trans Image Process 13(3):314–325CrossRefGoogle Scholar
- 53.Seni G, Cohen E (1994) External word segmentation of off-line handwritten text lines. Pattern Recognit 27:41–52CrossRefGoogle Scholar
- 54.Sinno JP, Qiang Y (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359CrossRefGoogle Scholar
- 55.Srihari SN, Ball GR (2008) Language independent word spotting in scanned documents. In: Proceedings of the 11th International Conference on Asian Digital Libraries, pp. 134–143 Google Scholar
- 56.Sun S, Xu Z, Yang M (2013) Transfer learning with part-based ensembles. Lect Notes Comput Sci 7872:271–282CrossRefGoogle Scholar
- 57.Adamek TN, Smeaton A (2007) Word matching using single closed contours for indexing handwritten historical documents. Int J Doc Anal Recognit 9:153–165CrossRefGoogle Scholar
- 58.Tomai CI, Zhang B, Govindaraju V (2002) Transcript mapping for historic handwritten document images. In: 8th International Workshop on frontiers in Handwriting Recognition Google Scholar
- 59.Tseng YH, Lee HJ (1999) Recognition-based handwritten Chinese character segmentation using a probabilistic viterbi algorithm. Pattern Recognit Lett 20(8):791–806CrossRefGoogle Scholar
- 60.Tu W, Sun S (2012) Cross-domain representation-learning framework with combination of class-separate and domain-merge objectives. In: Proceedings of the 1st International Workshop on Cross Domain Knowledge Discovery in Web and Social Network Mining, CDKD ’12, pp. 18–25. ACM Google Scholar
- 61.Varga T, Bunke H (2005) Tree structure for word extraction from handwritten text lines. In: 8th International Conference on Document Analysis and Recognition , vol. 1, pp. 352–356 Google Scholar
- 62.Yalniz I, Altingovde I, Gudukbay U, Ulusoy O (2009) Integrated segmentation and recognition of connected Ottoman script. Opt Eng 48(11):1–12CrossRefGoogle Scholar
- 63.Yalniz I, Altingovde I, Gudukbay U, Ulusoy O (2009) Ottoman archives explorer: a retrieval system for digital Ottoman archives. J Comput Cult Herit 2(3):1–20CrossRefGoogle Scholar
- 64.Zand M, Naghsh A, Monadjemi A (2008) Recognition-based segmentation in Persian character recognition. In: Proceedings of the Second International Conference on Advances in Pattern Recognition. World Academy of Science, Engineering and Technology 38Google Scholar
- 65.Zhang B, Srihari SN, Huang C (2003) Word image retrieval using binary features. Doc Recognit Retr XI 1:45–53Google Scholar
- 66.Zirari F, Ennaji A, Nicolas S, Mammass D (2013) A methodology to spot words in historical arabic documents pp. 1–4 Google Scholar