WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models

Nikolaidou, Konstantina; Retsinas, George; Christlein, Vincent; Seuret, Mathias; Sfikas, Giorgos; Smith, Elisa Barney; Mokayed, Hamam; Liwicki, Marcus

doi:10.1007/978-3-031-41679-8_22

Konstantina Nikolaidou¹¹,
George Retsinas¹²,
Vincent Christlein¹³,
Mathias Seuret¹³,
Giorgos Sfikas^14,15,
Elisa Barney Smith¹¹,
Hamam Mokayed¹¹ &
…
Marcus Liwicki¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14188))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1083 Accesses
1 Citations

Abstract

Text-to-Image synthesis is the task of generating an image according to a specific text description. Generative Adversarial Networks have been considered the standard method for image synthesis virtually since their introduction. Denoising Diffusion Probabilistic Models are recently setting a new baseline, with remarkable results in Text-to-Image synthesis, among other fields. Aside its usefulness per se, it can also be particularly relevant as a tool for data augmentation to aid training models for other document image processing tasks. In this work, we present a latent diffusion-based method for styled text-to-text-content-image generation on word-level. Our proposed method is able to generate realistic word image samples from different writer styles, by using class index styles and text content prompts without the need of adversarial training, writer recognition, or text recognition. We gauge system performance with the Fréchet Inception Distance, writer recognition accuracy, and writer retrieval. We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data. Code is available at: https://github.com/koninik/WordStylist.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GANwriting: Content-Conditioned Generation of Styled Handwritten Word Images

Pictorial Image Synthesis from Text and Its Super-Resolution Using Generative Adversarial Networks

Optimizing and interpreting the latent space of the conditional text-to-image GANs

Article Open access 21 November 2023

Notes

1.
https://huggingface.co/CompVis/stable-diffusion.

References

Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2911–2918. Providence, June 2012
Google Scholar
Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S., Shah, M.: Handwriting transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1086–1094, October 2021
Google Scholar
Christlein, V., Bernecker, D., Angelopoulou, E.: Writer identification using vlad encoded contour-zernike moments. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 906–910. Nancy, August 2015
Google Scholar
Christlein, V., Bernecker, D., Hönig, F., Maier, A., Angelopoulou, E.: Writer identification using GMM supervectors and exemplar-svms. Pattern Recogn. 63, 258–267 (2017)
Article Google Scholar
Christlein, V., Gropp, M., Fiel, S., Maier, A.: Unsupervised feature learning for writer identification and writer retrieval. In: 2017 14th International Conference on Document Analysis and Recognition, vol. 01, pp. 991–997. Kyoto (2017)
Google Scholar
Christlein, V., Maier, A.: Encoding CNN activations for writer recognition. In: 13th IAPR International Workshop on Document Analysis Systems, pp. 169–174. Vienna (2018)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dowson, D., Landau, B.: The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 12(3), 450–455 (1982)
Article MATH Google Scholar
Frolov, S., Hinz, T., Raue, F., Hees, J., Dengel, A.: Adversarial text-to-image synthesis: a review. Neural Netw. 144, 187–209 (2021)
Article Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2008)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Kang, L., Riba, P., Rusinol, M., Fornés, A., Villegas, M.: Content and style aware generation of text-line images for handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 8846–8860 (2021)
Google Scholar
Kang, L., Riba, P., Wang, Y., Rusiñol, M., Fornés, A., Villegas, M.: GANwriting: content-conditioned generation of styled handwritten word images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 273–289. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_17
Chapter Google Scholar
Kingma, D., Salimans, T., Poole, B., Ho, J.: Variational diffusion models. Adv. Neural Inf. Process. Syst. 34, 21696–21707 (2021)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014, Conference Track Proceedings (2014)
Google Scholar
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1x1 convolutions. Adv. Neural Inf. Process. Syst. 31 (2018)
Google Scholar
Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: DiffWave: a versatile diffusion model for audio synthesis. In: International Conference on Learning Representations (2020)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision (2014)
Google Scholar
Lombardi, F., Marinai, S.: Deep learning for historical document analysis and recognition-a survey. J. Imaging 6(10), 110 (2020)
Article Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002)
Article MATH Google Scholar
Mattick, A., Mayr, M., Seuret, M., Maier, A., Christlein, V.: SmartPatch: improving handwritten word imitation with patch discriminators. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 268–283. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_18
Chapter Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Mittal, G., Engel, J.H., Hawthorne, C., Simon, I.: Symbolic music generation with diffusion models. In: Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, 7–12 November 2021, pp. 468–475 (2021). https://archives.ismir.net/ismir2021/paper/000058.pdf
Nikolaidou, K., Seuret, M., Mokayed, H., Liwicki, M.: A survey of historical document image datasets. Int. J. Doc. Anal. Recogn. (IJDAR) 25, 305–338 (2022)
Article Google Scholar
Pondenkandath, V., Alberti, M., Diatta, M., Ingold, R., Liwicki, M.: Historical document synthesis with generative adversarial networks. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 146–151 (2019). https://doi.org/10.1109/ICDARW.2019.40096
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP Latents. ArXiv abs/2204.06125 (2022)
Google Scholar
Retsinas, G., Sfikas, G., Gatos, B., Nikou, C.: Best practices for a handwritten text recognition system. In: Uchida, S., Barney, E., Eglin, V. (eds.) Document Analysis Systems, pp. 247–259. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_17
Chapter Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022)
Google Scholar
Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., Litman, R.: ScrabbleGAN: semi-supervised varying length handwritten text generation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4323–4332 (2020)
Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Vögtlin, L., Drazyk, M., Pondenkandath, V., Alberti, M., Ingold, R.: Generating synthetic handwritten historical documents with OCR constrained GANs. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12823, pp. 610–625. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86334-0_40
Chapter Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Luleå University of Technology, Luleå, Sweden
Konstantina Nikolaidou, Elisa Barney Smith, Hamam Mokayed & Marcus Liwicki
National Technical University of Athens, Athens, Greece
George Retsinas
Friedrich-Alexander-Universität, Erlangen, Germany
Vincent Christlein & Mathias Seuret
University of West Attica, Egaleo, Greece
Giorgos Sfikas
University of Ioannina, Ioannina, Greece
Giorgos Sfikas

Authors

Konstantina Nikolaidou
View author publications
You can also search for this author in PubMed Google Scholar
George Retsinas
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Christlein
View author publications
You can also search for this author in PubMed Google Scholar
Mathias Seuret
View author publications
You can also search for this author in PubMed Google Scholar
Giorgos Sfikas
View author publications
You can also search for this author in PubMed Google Scholar
Elisa Barney Smith
View author publications
You can also search for this author in PubMed Google Scholar
Hamam Mokayed
View author publications
You can also search for this author in PubMed Google Scholar
Marcus Liwicki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Konstantina Nikolaidou .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nikolaidou, K. et al. (2023). WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-41679-8_22
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models

Abstract

Access this chapter

Similar content being viewed by others

GANwriting: Content-Conditioned Generation of Styled Handwritten Word Images

Pictorial Image Synthesis from Text and Its Super-Resolution Using Generative Adversarial Networks

Optimizing and interpreting the latent space of the conditional text-to-image GANs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models

Abstract

Access this chapter

Similar content being viewed by others

GANwriting: Content-Conditioned Generation of Styled Handwritten Word Images

Pictorial Image Synthesis from Text and Its Super-Resolution Using Generative Adversarial Networks

Optimizing and interpreting the latent space of the conditional text-to-image GANs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation