1 Introduction

Handwriting is a multi-sensory activity and skill that plays a vital role in everyday life. This skill is called brain writing, which defines the reflection (mood, nature, and personality) and complex motor skills (age, gender, and nationality) of an individual brain [1, 2]. An individual generally creates a unique handwriting pattern. This uniqueness has encouraged researchers to use handwriting, as a behavior biometric, for several tasks, including forensics, security, and disease prediction. Handwriting analysis or graphology can help identify, assess, and interpret individuals’ behavior using their handwritten samples. Graphology uses the characteristic of personality and emotional activities for various applications such as writer verification/identification, handwriting recognition, signature verification, age and gender classification/detection, and disease prediction. Automated handwriting analysis can improve the performance of these applications. Thus, developing automated handwriting analysis has become one of the active research areas in the last few decades to understand and identify an individual’s personality automatically.

Detecting an individual based on handwriting is a kind of behavior biometric identification well-acknowledged by psychologists, neurologists, paleographers, forensic analysts, document analysts, and computer science researchers. Similarly, the existence of a relationship between handwriting and different demographic attributes of writers, such as gender, handedness, and age, is also confirmed by psychologists and neurologists [2,3,4,5,6]. For instance, it has been reported that females’ handwritings are generally neat, delicate, consistent, regular, homogenous, attractive, and decorative. In contrast, male handwriting tends to be hurried, untidy, scruffy, and spiky [2, 6]. Gender detection of the writer from a handwritten document has been a challenging problem for the document analysis community [2, 6].

From the handwriting analysis perspective, gender detection is a task between handwriting recognition and writer identification. In handwriting recognition, the variability of handwriting should be blurred to raise a common feature among the writers to enable more accurate handwritten document recognition. In contrast, the variability in the writer identification task should be highlighted to increase differences among the handwritings of different writers in order to achieve more accurate gender/age classification results. In gender detection, however, we look for common features between a particular group (male or female) of writers [7]. The gender detection problem is a binary classification, while age detection is a multi-class classification problem. In general, to address the problems of gender and age detection, signal and image processing followed by pattern classification techniques are applied to handwriting signals or images. The methods in the literature are grouped into online and offline categories according to data acquisition. The handwriting samples written on paper with a typical pen/pencil and converted into digital documents using a digital camera or scanner are categorized as offline samples. Unlike offline, online handwriting samples are collected digitally, using the iPad/tablet and digital pen [8, 9].

Given online and offline categories, different types of features may be extracted from handwriting documents. For instance, in online age and gender detection, features such as time-in-air, time-on-surface, velocity, acceleration, pen pressure, x–y coordinates of the pen position, writing trajectory, and order of strokes, carry important information about the age and gender of the writer. In contrast, storing this information for the offline category or extracting them from document images is impossible. Features extracted from online or offline handwriting samples can be divided into macro and micro [10]. The macro features describe the overall pictorial characteristics (size, slant, shape, and space between the words and characters) of handwriting samples, whereas micro features describe the attributes of individual characters/components, for example, the geometry or shape of the individual characters. The macro features were then used to find the similarity or distance measure between the two documents. However, micro features were used to compute the similarity of characters and correlation measures [11].

In addition, features used for characterizing handwriting can be categorized into two groups, conventional handcrafted and deep learning-based features, based on the types of methodologies used in the literature. Examples of handcrafted features are textural and structural features that can be extracted from the word, line, and whole documents. Deep learning features are those extracted from different layers of, for example, a convolutional neural network (CNN) based model. It is worth noting that a limited number of handwriting samples can be collected from writers, resulting in small benchmark databases, making extracting deep learning features challenging.

Considering the dependency of age and gender classification approaches on scripts (the language of writing), two categories of methods, script-dependent and independent, were proposed in the literature. Script-dependent refers to selecting training and testing samples from the same scripts. In contrast, when training and testing samples are selected from different scripts, for example, training samples are English manuscripts, and testing samples are Arabic manuscripts, the approach is called script-independent.

Various research has been conducted on age and gender detection using handwriting. A comprehensive survey for age and gender detection in recent years can provide readers with a summary of research work and the development in this research area. Thus, this article reviews recent papers about age and gender classification/detection in the literature. We believe this study would be helpful for novice researchers in the handwriting analysis field, in general, and in gender and age detection techniques, in particular, and provide researchers with insights and new research directions. Overall, this article contains the following contributions and tries to address the following points:

  • discussion of the trend of age and gender classification/detection systems in the last decade using a text mining technique;

  • comparison of the results obtained from traditional and deep learning techniques and providing readers with information about the merits and demerits of methods and identifying methods that obtained the highest accuracies for age and gender detection so far;

  • identification of the feature extraction and classification methods performed better on the benchmark databases for gender and age detection tasks;

  • a detailed study of commonly used databases in the literature for age and gender detection, their scripts, the number of writers, and the number of samples.

The rest of this article is organized as follows. Section 2 provides the methodology of the research. An overview of a general age and gender detection framework is discussed in Sect. 3. The related works in this domain are discussed in Sect. 4. Metrics and databases used for experimentation and comparison analysis are presented in Sect. 5. Section 6 provides discussion and remarks. Finally, the conclusion and future directions are given in Sect. 7.

2 Methodology

The methodology applied to collect, filter and finally choose relevant papers for review is depicted in Fig. 1. As shown in Fig. 1, we used different databases, including Google Scholar, Web of Science, ProQuest, and EBSCO, to download relevant peer-reviewed papers. The databases were carefully selected based on their privileges, such as credibility, reliability, availability of peer-reviewed content sources, and advanced search function. Several keywords, such as “gender detection using handwriting”, “gender classification using handwriting”, “gender identification using handwriting”, “age classification using handwriting”, and “age identification”, were used for searching the databases. As a result, 103 papers were downloaded and collected accordingly.

Fig. 1
figure 1

The methodology used for the collection and selection of the papers

Zotero, a reference management software, was used to manage and work with the downloaded papers. The software allows for storing and organizing research papers and sharing references. After checking for duplication, 35 papers were eliminated. Since our research focused on the studies published from 2012 to 2023, 12 more papers were further eliminated from the list of collected papers. In the subsequent filtering, 14 more papers were eliminated as they were not published in high-ranked journals and conferences. The remaining 42 papers were finally considered for this study. Considering the dates of publications, we noted that the number of publications on gender detection was more consistent in contrast to the age classification. Most research publications were conducted between 2014 and 2017 for both age and gender detection. There was a gap and lack of focus on the age classification using handwriting between 2016 and 2021 and a sudden rise in 2022. Although there was a fluctuation in the number of publications on gender detection from 2012 to 2023, there was no significant gap in publishing papers in this area. This could be because of various gender detection applications and the availability of databases and metadata in the literature.

3 Overview of general age and gender detection

Developing an automated handwriting analysis system to detect a gender or age category from handwriting samples involves two stages, developing and training a model and then testing the trained model. A block diagram of a general system for age and gender classification/detection is represented in Fig. 2. As demonstrated in Fig. 2, similar pre-processing and feature extraction methods are employed in both the training and testing stages. Pe-processing techniques are applied to improve handwriting quality as a preliminary step. These techniques may be used in different sequences, such as binarization, segmentation, and normalization. For instance, the pre-processing step may segment a handwritten document into lines, words, characters, or patches.

Fig. 2
figure 2

Block diagram of a general framework for age and gender classification/detection

Feature extraction comes right after the pre-processing step to compute a set of informative features and to support training in the classifier step. Features may be extracted from different levels of a sample, such as a pixel, patch, or whole sample. In addition, features may be extracted using handcrafted or deep learning methods. Various classifiers, such as Support Vector Machine (SVM), logistic regression, K-nearest neighbor (K-NN), decision tree (DT), random forest (RF), artificial neural network (ANN), and deep neural network, may be trained for gender/age detection or classification.

In the testing stage, similar to the training, the pre-processing and feature extraction techniques are applied to a new sample. The extracted features are passed to the trained classifier to carry out the age and gender classification/detection task.

4 Related works

Demographic attributes of writers, such as age or gender, were considered in several papers in the literature for handwriting analysis [8, 12, 13]. Some papers further considered a combination of demographic attributes of writers for analysis, for example, gender and handedness of writers, to improve the performance of automated handwriting systems in the literature [14, 15]. Various approaches employed in the literature for age and gender classification/detection through handwriting are reviewed in this section. Figure 3 illustrates a categorization of different approaches in age and gender classification/detection research, including traditional and deep learning. Each category is further divided into offline and online methods based on data acquisition types. It is noted that there is no significant amount of research on age classification available in the literature compared to gender detection. The studies that characterized only the writer’s age are summarized in the age classification sub-section. The studies that characterized the writer’s gender and a combination of the writer’s gender and age as demographic attributes are discussed in the gender detection sub-section.

Fig. 3
figure 3

Categorization of age and gender classification/detection in the literature

4.1 Age classification

As mentioned earlier, only a few research studies focused on the age classification problem [12, 16,17,18]. The age classification methods in the literature can be grouped into offline [12, 16], and online [17, 18], considering the input data used for analysis.

Traditional and deep learning methods were considered in the offline age classification approaches. Disconnectedness features extracted from Canny and Sobel edge images, and k-means clustering were used for a four-class age classification problem in [12]. The proposed method was applied on a database, and an accuracy of 66.25% was obtained. The IAM and KHATT databases were also used to evaluate the proposed methods, and age classification accuracies of 63.6% and 64.4% were obtained, respectively. Considering the deep learning approach for age detection, ResNet and GoogleNet were used for transfer and feature learning [16]. An SVM classifier was then used for age detection [16]. The proposed method was tested on the FSHS database. Considering ResNet and GoogleNet features, 69.7% and 61.1% correct age detection were obtained from the proposed method [16]. The same authors further considered different features and SVM and NN for age detection in two age categories, youth-adult and mature-adult, and accuracies of 71% and 63.5% were obtained, respectively [19]. However, the main issue with deep learning based methods, for example, ResNet, is that when a network becomes deeper, it will be more challenging to train it effectively due to issues such as vanishing or exploding gradients. In addition, the performance of deep learning based methods relies heavily on the quality and diversity of the training dataset, and collecting high-quality data for age detection is difficult due to privacy and human factors. It is worth noting that the SVM classifier was originally developed for binary classification and regression. It was further modified for multi-class problems and showed its superiority in various applications. The basic idea behind SVM is to find the best possible decision boundary, called a hyperplane, that separates different classes of data points. The SVM algorithm tries to maximize the margin, which is the distance between the hyperplane and the nearest data points of each class. The intuition is that a larger margin indicates a better generalization ability of the classifier.

Age characterization from online handwriting was examined using supervised and unsupervised learning approaches in the literature [18]. The authors used a 2-level clustering approach for online handwriting characterization style. In the first level, raw spatial-dynamic information was extracted, whereas, in the second level, the style of word variability was extracted and converted into a Bag of Prototype Words (BPW). At the first level, Linear Discriminant Analysis (LDA) detected the features that distinguish different age groups. For instance, teenagers' handwriting has shown the highest stability compared to elders’ handwriting, although there are no significant differences in the middle-aged group. The K-means clustering was considered for age detection [18]. However, the LDA based approach is unimodal and may not be applicable in different scenarios. The Norm Discriminant Eigenspace Transform or Moments discriminant analysis may be utilized to address this issue. The same authors further discussed several patterns, such as high time on air, slow velocity, acceleration, the highest number of smallest strokes, and less fluent handwriting, in people's handwriting by comparing the elders' handwriting with middle-aged handwriting [8]. The sequential forward floating selection (SFFS) method was also considered for classifying children from adult handwriting [17]. Online handwriting text and patterns were collected from several children and adults. Different features, such as pen pressure, time of writing, and pen angle, were extracted from handwriting samples and then selected using the SFFS method. SVM and RF methods were applied for classifying the samples, and the age detection accuracies of 87.4% and 91.5% were obtained from the handwritten text, respectively.

A summary of age detection methods in the literature is provided in Table 1. From Table 1, it can be noted that online methods performed better than offline methods in age classification. When looking at the results presented in Table 1, it is important to note that the direct comparison of the results obtained from various methods applied on different databases is not fair; however, the implication and significance of the features and signals used in different handwriting analysis methods can be considered as an important finding.

Table 1 Overview of the recent research on age classification using traditional and deep learning methods

4.2 Gender detection

The proposed gender detection methods in the literature are designed based on traditional and deep learning approaches to deal with offline and online handwriting samples. Details of traditional and deep learning approaches and the analysis of keywords used in the gender detection literature are discussed in the subsequent subsections.

4.2.1 Traditional methods

In most of the conducted research in the gender detection area, traditional methods were applied to offline document images [6, 13, 20]. Traditional methods are generally designed based on handcrafted features extracted from handwriting images, followed by the classification using one or a set of classifiers. Various feature extraction methods based on codebook, geometric, texture, or a combination of both geometric and texture were investigated in the gender detection literature [6, 13, 20]. The geometrical features were used to measure the orientations, curvatures, roundness, slants and strokes of the characters. Texture features, however, were used to extract textural characteristics from words, patches, or whole images. These characteristics were used to generate a histogram of features.

A summary of the traditional approaches in the literature that are ordered based on the databases used for evaluation is presented in Table 2. From Table 2, it can be noted that most of the available databases, metadata, and methods in the literature were developed for offline approaches [6, 13, 20, 21]. Each method presented in Table 2 is detailed in the following.

Table 2 Overview of the recent research on gender detection using traditional methods

Authors in [22] used geometrical and transformed features followed by kernel mutual information consisting of the kernel function as a feature selection technique for gender classification. The classification was carried out using SVM on the Chinese Registration Document Form (RDF) database and the ICDAR 2013 database consisting of English and Arabic document images. Accuracies of 66.7% and 66.3% were obtained from their proposed method on RDF and ICDAR 2013 databases, respectively. The same authors further discussed the minimal-Redundancy-Maximal-Relevance (mRMR) based on the mutual information method for selecting the features for the gender identification task [23]. The geometrical features, such as orientation, roundness, slant, and transformed features, including the Gabor and Fourier features, were also considered in this study. Different shape descriptors, such as tangent angle function, curvature function, and Fourier descriptors, were also considered for gender detection in Bosnian document images, BHDH a database consisting of 3766 Bosnian document images [24]. Notably, the mRMR captures linear relationships between variables and cannot capture complex nonlinear relationships between features. Unlike mRMR, Recursive Feature Elimination (RFE) concentrates on the relevance of features to the target variable and assigns weights or ranks to features based on their importance in predicting the target variable and can be used for feature transformation.

In [25], curvatures, directions, tortuosity, and chain codes as a set of geometric features were further considered to characterize age, gender and nationality in handwritten document images. The kernel discriminant and random forests classifiers were applied to perform age, gender, and nationality prediction. The proposed systems were evaluated using the QUWI database, and the chain code feature, with an accuracy of 74.05%, provided better gender prediction results than other features. To address the problem of using only local features for document characterization, Siddiqi et al. [26] proposed local and global features, including curvature, slant, and legibility, for gender prediction [26]. They then used SVM and ANN classifiers to detect gender from handwritten documents. They finally evaluated their proposed method using QUWI and MSHD databases, and correct gender detections of 68.75% and 73.02% from QUWI were obtained from the MSHD database [26]. Ibrahim et al. also used local and global features (gradient and wavelet domain local binary patterns (WD-LBP)) along with an SVM classifier for gender identification [27, 28]. The experimental results were conducted using the ICDAR 2013 dataset, and the highest accuracy of 94.7% was obtained with local gradient features [27, 28]. It should, however, be noted that geometric and local features are often sensitive to noise, and the resolution of the scanned document images plays a significant role in final prediction results. Moreover, SVM's performance can be affected by the choice of the kernel and its parameters. In addition, although SVM can handle high-dimensional feature spaces effectively, it can be computationally expensive, especially for large datasets.

In addition, open-end-point and curve-fitting based methods were considered for feature extraction in the gender detection literature [20]. Several classification techniques, such as SVM, K-NN, RF, hybridization, and multi-layered perceptron, were further used to detect genders from handwritten documents. The proposed method was evaluated using Gurumukhi (Punjabi) handwriting, and the highest accuracy of 90.57% was obtained with curve fitting-based features and a hybrid classifier [20].

The curvelet transformation and One-class SVMs were further proposed for gender detection [29]. The proposed system was evaluated using the IAM database, and the highest accuracy of 62.49% was obtained. Wavelet transforms using symbolic dynamic filtering were also considered to extract features from each level of document images using a probabilistic finite state automata (PFSA) [21]. The classification was conducted using ANN and SVM classifiers. Various experiments, such as text-dependent, text-independent, script-dependent, and script-independent, using QUWI and MSHD databases, were conducted to evaluate the proposed method. The best gender detection accuracies of 77.70% and 77.60% using script-dependent scenarios with an SVM classifier were obtained from QUWI and MSHD databases, respectively [21]. It is, however, important to note that although wavelets offer a multi-resolution analysis, wavelet transforms are inherently not scale-invariant, and system performances may be affected by changes in the scale or size of images if the original wavelets are used for handwritten document characterization.

Various features, such as histograms of oriented gradients (HOG), local binary patterns (LBP), and gray-level co-occurrence matrices (GLCM), and ensembles of classification techniques based on SVM, ANN, DT, and RFs were also examined to improve gender classification accuracy in the literature [30]. The authors conducted several experiments on the QUWI database and obtained better results, up to 85% accuracy, using ensemble classifiers. Similarly, in [14], features from handwriting images were extracted using HOG and gradient local binary patterns (GLBP). The SVM classifier was employed for detecting gender from document images. The proposed system was evaluated using the KHATT and IAM databases, and an accuracy of around 70% was reported while the databases were combined for language-independent tests. In [31], numerous local features, such as pixel distribution, pixel density, LBP, and HOG, were considered for the gender prediction task. The prediction was conducted using the SVM classifier. Two different training sets using the IAM database were employed to evaluate and compare the proposed system. The HOG feature extraction with 70% and 74% correct gender prediction performed better in both sets compared to the other methods. The same authors further examined gender, age, and handedness detection tasks using HOG and pixel density features [32]. Considering the IAM database, an accuracy of 73.63% for gender detection and 73.21% for age detection using pixel density was reported. The LBP method was recently extended to AND local binary pattern (ALBP) and OR local binary pattern (OLBP) to extract two-level neighboring pixels from handwriting images in order to improve gender prediction tasks [33]. The highest accuracy was obtained using the ALBP feature extraction method and the SVM classifier [33].

In [34], Adaptive Multi-Gradient (AMG) features were also extracted using adaptive multi-gradient of Sobel kernels for the gender identification task. The text lines in each image were extracted by considering the dominant pixels. The correlation coefficient between lines and their consistency and inconsistency were computed to identify the gender. Finally, by finding converging and diverging criteria using the error of correlation between the first and successive text lines, the gender of the writer was identified [34]. It is worth mentioning that the performance of AMG may be sensitive to various parameters, such as the geometry of the domain. This method also requires tuning parameters to obtain optimal convergence rates and accuracy.

In addition to other texture features, the Gabor based feature extraction method was used for gender detection in the literature [13]. The mean and standard deviation values of the filtered images were computed and then the Fourier transform was employed to obtain a feature set. The classification was carried out using ANN. The QUWI database was considered for evaluating the proposed system. By employing the same protocol as the ICDAR 2015, the highest accuracy of 70% was reported [13]. Gender detection using textural information obtained based on oriented basic image features (oBIFs) was proposed in [6]. The SVM was applied on three subsets of the QUWI database using experimental protocols of the ICDAR 2013, ICDAR 2015, and ICFHR 2016, and accuracies of 71%, 76% and 68% were obtained, respectively.

The literature on gender detection shows that using only texture feature extraction methods, such as LBP, GLCM, HOG, and GLBP, cannot provide high detection accuracies. The extracted features using these methods are generally converted into histograms which decrease individuality. Moreover, sensitivity to image rotation, noise, resolution, and grayscale are other weaknesses of texture features.

In addition to texture features, the significance of Cloud of Line Distribution (COLD) and Hinge features, along with an SVM classifier, was investigated in the gender detection literature [7]. The proposed system was then evaluated using the QUWI database, and the highest accuracy of 64.40% was reported [7]. Graphological features, such as height, pressure, and margin, followed by a fuzzy rule-based classification, were also proposed for gender detection [35]. A small number of samples (75 digital samples) were considered to evaluate the proposed method, and an accuracy of 76% correct gender detection was reported [35].

In [36], likelihood ratios and binomial logistic regression were considered for gender detection from handwriting. The chi-square test was employed for feature extraction. Applying the proposed method on a dataset written by 150 individuals resulted in a correct classification rate of 80% for females and 76.4% for males [36]. Decision trees and data mining techniques in conjunction with J48 and ID3 algorithms were proposed for identifying gender from handwriting [37]. Employing the decision tree with the J48 and ID3 algorithms on the database, an accuracy of 70.83% and 93.75% were obtained, respectively.

Gender prediction based on static and dynamic features and classification methods, such as KNN, SVM, and Naïve Bayes, was also proposed in the literature [38]. The SVM classifier performed better compared to the other classifiers. Gender detection based on spatial pyramid matching was further proposed in [39]. The weighted histogram of the SIFT descriptor was considered for extracting features from sub-regions. For detecting the gender of a writer from a document image, SVM and ensemble classifiers were then employed. The proposed system was evaluated using QUWI and MSHD databases in script-dependent and script-independent scenarios. Better accuracies of 82% and 90% were obtained using an ensemble classifier in script-dependent scenarios considering QUWI and MSHD databases, respectively. Moreover, features such as pen pressure, margins, irregularity, and space between words and SVM and ANN classifiers, were considered for gender detection [40]. The proposed method was evaluated using the FSHS database, and the gender detection accuracies of 94.7% and 97.1% were obtained using SVM and ANN, respectively [40].

The traditional gender detection approaches based on online methods are comparably fewer than offline methods. An online gender prediction based on an allographic approach was proposed in the literature [41]. Uppercase characters in Spanish were collected using an online device, and both pen-down and pen-up strokes, as the structural features, were considered to characterize online handwriting. The highest accuracy of 74% was reported [41]. In addition, other features, such as time, space, ductus, and pressure, were considered for online gender detection in the literature [10]. A collection of 240 online handwriting and drawing samples using a digital pen and Wacom tablets was created to help with gender, male and female, detection in online handwriting [10]. Furthermore, Marzinotto et al. [15] proposed two feature sets, including dynamic and spatial, for age and gender detection. The dynamic features consist of acceleration, speed, and jerk, while the spatial features consist of local pen trajectories. A two-level clustering approach using the K-means algorithm and the Bag of Prototype Words (BPW) was proposed for age and gender detection [15]. The IRONOFF, an online database composed of English and French words, was considered to evaluate the proposed method [15].

4.2.2 Deep learning methods

Deep learning generally uses several hidden layers to create a deep neural network for learning features and patterns to make intelligent decisions. Deep learning methods have grown substantially as they provide more accurate results with large databases compared to traditional methods. Data augmentation techniques and transfer learning have further been employed to enable the application of deep learning, where a small number of samples are available for training [42].

Researchers from different disciplines, including the handwriting analysis community, have recently used and explored deep learning in different applications [43, 44]. Table 3 shows the recent research on gender detection using deep learning methods. The results are listed based on the databases used for experimental analysis.

Table 3 Overview of the recent research on gender detection using deep learning methods

As demonstrated in Table 3, deep learning was used for feature extraction and classification in the gender detection literature. For example, two convolutional neural network-based methods, ResNet and GoogleNet, were applied on Arabic handwritten samples to extract deep features [42]. An SVM was then considered for gender detection purposes. The ResNet obtained the best accuracy of 83.32% on the FSHS database [42]. AlexNet was further used at different levels of documents, including word, patch, and the whole page, to extract deep features for gender detection [45]. For experimental analysis, Linear Discriminant Analysis (LDA) was employed for detection/classification purposes. The QUWI database and different scenarios were considered to evaluate the proposed method, and the highest accuracy of 70.08% correct gender prediction was obtained [45]. In [43], a network architecture consisting of four convolutional layers, a single fully-connected layer, and a softmax output layer was proposed for the final gender classification. The majority vote and average softmax were used for classification purposes. Several experiments were conducted on the Hebrew database considering intra, inter, and mixed-language setups to evaluate the proposed system. The highest accuracy of 82.89% was reported using mixed language with the majority vote classification [43]. Recently, bilinear ResNet (B-ResNet), followed by a softmax layer, was proposed to extract fine-grained features from offline handwriting and perform age and gender detection [46]. To evaluate the proposed system, experiments were conducted on the KHATT and HHD (Hebrew) databases, and accuracies of 76.17% and 84% were obtained, respectively [46]. In addition, ATP-DenseNet, with two pathways, feature pyramid, and A-DenseNet, was developed for the problem of gender detection [44]. Page-level features were extracted using a feature pyramid, while the word-level features were extracted using A-DenseNet. IAM, KHATT, and ICDAR 2013 databases were considered to evaluate the proposed method, and accuracies of 77.6%, 74.1%, and 71.8% were obtained, respectively [44].

In some research work, handwriting samples were used for classifying more than one biometric characteristic, such as gender and handedness, of a writer [47]. A CNN architecture was considered for feature extraction and classification processes. The IAM and KHATT databases were used for experiments, and accuracies of 80.72% and 68.90% were reported, respectively [3]. Moreover, deep learning methods, including InceptionV3, DenseNet201, and Xception, were used for gender and handedness classification [48]. The IAM and KHATT databases were considered for experimental analysis, and the DenseNet201 performed better than other models. It is worth mentioning that the availability of enough training data is one of the key parameters to further improve the accuracies of deep learning based age/gender/handedness detection approaches. Therefore, using robots for generating human-like data may be one area of research focus in this niche research field.

4.3 Word frequency graph of the related works

We performed the quantitative analysis to determine the trend of handwriting analysis between 2012 and 2023 to find the methods and databases used frequently in the literature. The KH Coder software (https://khcoder.net/en/) helps explore and analyze large amounts of unstructured and semi-structured text to distinguish concepts, topics, patterns, and other aspects of interest. The KH Coder can perform word frequency statistics, automatic clustering, and analyze survey data, such as questionnaires and interviews, based on text mining analysis [49, 50].

This study used the KH Coder to analyze metadata collected through our research [49]. The general steps in data mining using KH Coders are collecting and organizing the text, loading the text into software, filtering the words using chi-square value, and then generating the co-occurring relationship. We used the KH Coder to conduct a textual analysis of the title, abstract, and keywords of 42 selected papers related to age and gender detection problems.

A co-occurrence network diagram is drawn based on retrieved words with similar appearance patterns, where nodes/circles show the retrieved words and the lines (edges) show the interconnectivity between words. In the co-occurrence network diagram, the words with high degrees of co-occurrence are connected through several lines. The size of the circles shows the frequency of occurrence of the words, so larger circles appear more frequently than smaller circles. Figure 4 crafts an appropriate model with 14 subgraphs or clusters to describe relationships between topics. For example, the cluster with higher word frequency (i.e., colored in orange) is composed of related words such as “handwriting”, “gender”, “classification”, “feature”, and “age”. The interconnection between the words, such as “age-category”, “feature-classification”, and “gender-feature-classification,” informed the derivation of themes. These visualizations demonstrate the results according to the keywords of the papers. It should also be noted that the frequency of the word “gender” is higher than the word “age,” which means more research on gender detection was carried out in the last decade. Similarly, the commonly used databases are the QUWI and IAM databases. In addition, English and Arabic scripts are among the most frequently used scripts for age and gender detection in the literature.

Fig. 4
figure 4

The co-occurrence network words diagram

Another co-occurrence network based on the year of publication is shown in Fig. 5. Figure 5 demonstrates that the words “handwriting,” “gender,” “classification,” and “feature” were frequently used between 2012 and 2023. In addition, offline methods were more often explored in 2012, whereas online methods were increasingly explored in 2020. Texture features and SVM-based classifiers were investigated highly in 2017, whereas convolutional neural networks received more attention since 2018. Age and handedness classification received some attention in 2015.

Fig. 5
figure 5

The co-occurrence network diagram obtained based on the year of publications

5 Metrics and databases

Databases and evaluation metrics are pillars to fairly compare state-of-the-art methods' performances in any research domain. Accuracy and precision, as two evaluation metrics, have been commonly used in the literature for age and gender classification/detection tasks. Accuracy is the number of correct classified/detected samples divided by the total number of samples. Precision is the fraction of the relevant samples among the retrieved samples.

The process of developing standard handwritten databases was initially commenced when research in handwriting/document analysis and recognition started in the early 1990s and received significant attention in the last three decades [51]. The databases have been commonly created by requiring the subjects and writers to fill in a questionnaire with their personal information (metadata), such as gender, age category, handedness, nationality, education level, and profession. There are a few handwriting databases with a different number of writers and samples written in a single script or multiple scripts for gender, age, and handedness in the literature [52,53,54,55].

Most publicly accessible databases are offline databases. In offline databases, writers copied a text (multiple texts) on separate sheets of paper using pens. Then, documents were scanned using scanners or cameras and converted into image formats, such as PNG, JPEG, GIF, or TIFF. A few online available databases were also collected using the Wacom tablets. The lower availability of online databases compared to offline databases may refer to the cost of required devices for collecting online handwriting. In online databases, in addition to handwriting, other features, including pen pressure, angle, speed, and time in the air, were collected to be used for further analysis. A list of databases used in the literature for age and gender detection is provided in Table 4. Table 4 reveals that a relatively larger number of samples were collected in offline databases than in online databases. Further analyses show that most of the available databases were written in English, Arabic, and French scripts.

Table 4 Handwriting databases used for age and gender detection in the literature

IAM, QUWI, KHATT, MSHD, and ICDAR are a few publicly accessible databases commonly used for offline age and gender detection in the literature [27, 29, 30]. The IAM handwriting database includes 1539 English handwritten document pages written by 657 writers [55]. Documents were scanned at a resolution of 300 dpi in PNG format. The lines and words on each page were also segmented and made available in the same format for detailed analysis [55]. The Qatar University Writer Identification (QUWI) database [53] contains Arabic and English handwritten documents. This database is a collection of 1,017 handwriting samples. Each writer contributed two Arabic and two English texts, including two arbitrary and two predefined samples. Documents were scanned at a resolution of 600 dpi in JPEG format [53]. The KHATT is an offline Arabic database [54] of handwritten texts written by 1000 writers in different age categories. Each writer completed a form of four pages, and documents were scanned at resolutions of 200, 300, and 600 dpi. The written samples include 2000 randomly chosen paragraphs from 46 sources covering all Arabic characters [54]. The multi-script handwritten database (MSHD) [52] contains 1300 Arabic and French handwriting samples written by 100 writers. Each writer copied six texts in Arabic, six in French, and one page of digits which can be used for age and gender analysis [52].

It is noted that MSHD and KHATT databases are suitable for text-dependent, text-independent, script-dependent, and script-independent experiments. In script-dependent experiments, training and testing samples are from the same scripts; however, in script-independent experiments, training and testing samples are from different scripts. For instance, a system can be trained with Arabic document images and tested with French/English document images and vice versa.

In addition to the above datasets, several competitions were organized over the last decade in conjunction with ICDAR 2009, ICDAR 2011, ICDAR 2013, and ICDAR 2015 for writer, gender, and age detection. Most databases used in these competitions are publicly available to extend research in this field. However, it is worth mentioning that due to the current research trend in deep learning methods, the number of samples in all benchmark databases is relatively small and insufficient for effectively training different models, including deep learning models. Thus, creating databases with a large number of writers and samples would be beneficial for designing more complex deep learning-based methods to obtain more accurate results in this domain.

6 Discussion and remarks

In terms of benchmark and publicly available databases for gender/age detection, three commonly used benchmark databases (IAM, QUWI, and KHATT) are available in the literature. Apart from their limitation in sample size, these databases include only Arabic, English, and French handwritten document image, indicating a lack of databases for other scripts and the unavailability of large-size databases in the literature. Thus, further research on benchmark and evaluation metrics is necessary in this domain.

For detection and classification purposes, artificial neural networks, convolutional deep neural networks, DT, RF, and SVMs were frequently used in the age and gender detection literature (Tables 2 and 3). Although a direct comparison of the results obtained from various methods on different databases is impossible, results reported in the literature have further brought to our attention a noticeable difference between the results of traditional and deep learning methods in the gender classification task. The highest accuracy of 97.10% was obtained among the traditional methods in the literature using pen pressure, margins, irregularity, and space between words features and an ANN classifier on the FSHS database [40]. Moreover, the authors applied their proposed method on ICDAR2013, and an accuracy of 91.4% was reported with the SVM classifier [40]. The lowest accuracy is 66.3% using geometrical features and SVM for the classification, where the ICDAR 2013 database was used for evaluation [23]. Among the deep learning methods, the highest reported accuracy is 84% using the DenseNet201method applied on the IAM database [48]. The lowest accuracy was reported as 68.90% using ReLU and softmax methods applied on KHATT databases [47]. In both studies, CNN was used for the classification. The literature review indicates that in most studies, traditional models, especially SVM-based methods, performed better than the other models, including deep learning-based approaches [40]. The lower accuracy obtained using deep learning compared to SVMs and other traditional models could be due to the following reasons: limitations in the number of samples and/or the lack of diversity in databases to train high-performance models using deep learning approaches to learn meaningful patterns, and noise sensitivity of deep learning models, as they are highly data-driven and can be sensitive to noisy or outlier data points and if a dataset contains significant amounts of noise, deep learning models may be unable to make reliable predictions.

From another perspective, we reviewed Tables 2 and 3 and compared the results obtained based on the databases. It is important to note that the highest accuracy on QUWI was 85% when texture features and ensemble classifiers were used for gender detection [30]. Meanwhile, the highest accuracy of 90% was obtained on the MSHD when SIFT descriptors, spatial pyramid matching, and ensemble classifiers were used for gender detection [39]. The highest accuracy of 94.79% on the ICDAR 2013 database was obtained when Gradient and WD-LBP features and SVMs were used for gender detection [27]. The highest gender detection accuracy of 97.1% was obtained on the FSHS using pen pressure, margins, and irregularity features [40]. Considering the IAM dataset, the highest accuracy of 84% was obtained based on InceptionV3, DenseNet201, and Xception feature extraction methods and CNN classifiers [48]. The highest accuracy of 76.17% was obtained on the KHATT database when B-ResNet was considered for feature extraction, and B-CNNs were employed for the classification [46]. It is worth mentioning that the traditional methods performed relatively better on QUWI, MSHD, FSHS, and ICDAR 2013 databases, whereas deep learning performed better on IAM and KHATT databases.

At the outset, it is important to note that deep learning achieved remarkable success in many research domains and applications. However, it is not a one-size-fits-all solution, and depending on the specific problem, dataset, resources, interpretability requirements, or domain expertise, traditional methods may still be preferred or even outperform deep learning approaches. Thus, there would be many opportunities for researchers in this field to expand the investigation and research on age and gender classification/detection problems using deep learning methods to obtain more accurate results.

7 Conclusion

This article presented a systematic and comprehensive review of age and gender classification/detection from handwritten documents. A general age and gender identification framework comprising pre-processing, feature extraction, and classification was provided and discussed in detail. The most commonly used databases, their number of samples, and the type of scrips were discussed. The methods in the literature were grouped into traditional and deep learning methods, and the results were compared accordingly. The KH Coder software was used for text mining purposes, and the trends of words in the literature were analyzed.

This study can bring the requirements of handwriting analyses to the attention of novice researchers and open a new direction for them. Although there has been considerable research on this area (age and gender detection from handwriting), it is still considered a challenging problem. Neither computerized analysis nor human experts achieved highly accurate results for these tasks. In further research on these problems, researchers can investigate the suitabilities of different kinds of features to characterize gender and age from handwriting and also explore feature selection techniques to identify the most appropriate features for these problems. These types of features would be beneficial, as the annotated datasets in these areas are still of small size. Deep learning, especially transfer learning and continuous adaptive learning, is another direction of research to improve age/gender detection accuracies. Synthetic data generation, data augmentation, and a combination of these methods may also be investigated by researchers in the future to explore different results and build plausible systems.