1 Introduction

The availability of large and rich textual datasets, mainly created from web content, has fostered data mining and machine learning algorithms and methods at an unprecedented pace. As a result, machine learning for text analysis has become a very active research area Aggarwal (2018). Nowadays, text miningSrivastava and Sahami (2009) and Natural Language Processing (NLP) Eisenstein (2019) constitute the basis for many studies and implementations.

The Wikidata project Vrandečić and Krötzsch (2014) is an example of a freely available and structured data source nurtured by information extracted from Wikipedia articles in different languages. Data gathered from Wikipedia articles undergo a semantic annotation process to expand the Wikidata knowledge base, following a graph-structured data model to integrate new applications.

One such application area is gender identification in text. The goal is to assign a gender label (e.g., “female” or “male”) to a text block based on linguistic properties and other descriptive features. In general, previous research has focused on identifying the gender and profile of the person who writes the text. Examples include gender identification of weblog authors Yan and Yan (2006) and users in microblogging platforms Mukherjee and Bala (2017) or in Customer Relationship Management (CRM) systems Amado et al. (2018).

Building new Wikidata entries relies on existing hints and structured data in Wikipedia articles to complete each topic. For instance, if we want to retrieve structured data about “Gabriel García Marquez”, we can search in Wikidata for the corresponding entry.Footnote 1 Then, we can explore several fields with descriptive features about this person, including the “sex or gender” property.Footnote 2 In this example, the attributed value for this property is “male”. At the time of this writing, the accuracy of this information relies on up to six references to different Wikipedia articles that support the attribution of the “male” value to this field based on structured data. However, as an open project for collaborative knowledge creation, Wikipedia has frequently suffered vandalism and content attacks Adler et al. (2011), Geiger and Ribes (2010). If malicious agents alter original structured data in Wikipedia articles, detecting errors without manual inspection would be a daunting task, even more so given the large size of the Wikidata topic database. Moreover, if structured information about gender is not present in a given Wikipedia article, it would not be possible to assign a value directly to the “sex or gender” property for its corresponding Wikidata entry. Instead, labelling the gender information based on an automated analysis of unstructured textual data in the body of Wikipedia articles could improve resilience against these issues. This example can be easily extrapolated to any other information system that must automatically tag the content of some text block with a gender label.

We focus on a specific case, namely, the problem of automatically determining the gender of the person described in a biographical text. In the remainder of this work, for the sake of space, we will refer to this problem as gender identification. The lack of studies that include languages other than English hinders the development of novel procedures in text mining to tackle this problem better. Instead of focusing on English exclusively, which provides fewer gender elements, we examine the use of Spanish text to capture morphemes that carry gender information and are more prevalent in that language. To achieve this, we bypass usual normalisation steps in traditional text mining, such as stemming and lemmatisation Eisenstein (2019), that discard gender information in linguistic terms. Following this approach, we show that it is possible to assign a gender value to textual content more accurately.

When processing text using machine learning techniques, it is common to use vector representations, characterised by high dimensionality and sparsity, since they contain numerous zero-valued components. In this context of high-dimensional sparse representations, Support Vector Machines (SVM) Moguerza and Muñoz (2006) are computationally adequate classifiers Joachims (2002). However, adapting traditional optimisation problem-solving techniques for SVM classification Joachims (1999) applied to gender identification does not guarantee rapid convergence when the iterations approach the optimal solution of the underlying optimisation problem.

In this paper, within a support vector framework, we take advantage of the geometric properties of the text gender identification task to modify the stopping criterion of the underlying optimisation algorithm by substituting the classical Karush–Kuhn–Tucker criterion Feng and Li (2018) with our proposal. It incorporates information (the error rate) that is not directly mathematically related to the criterion that the classical optimisation SVM algorithm tries to fulfil, as it can potentially adapt the nature of the algorithm to the geometric problem to resolve. In the following, we will refer to this modified SVM as Geometrical SVM (GSVM). To evaluate the performance of this new approach, we compare our proposal to other supervised learning methods Witten et al. (2017) for automated gender identification within biographical texts of literary authors extracted from Wikipedia. We assess if avoiding the common step of stemming in text mining, which eliminates gender suffixes in relevant words for text analysis, improves performance in gender identification. We show the effectiveness of the GSVM approach for the acceleration of the algorithm training phase and that this method does not significantly affect the procedure performance.

The rest of this paper is organised as follows. Section 2 reviews related research work and identifies limitations of existing approaches for gender identification in textual data. Section 3 briefly describes the foundations of SVM. Section 4 develops the proposed new stopping criterion within support vector optimisation algorithms. Then, Sect. 5 shows the problem statement and the experimental setup to validate the proposed method. Section 6 presents the numerical results. Section 7 discusses additional aspects and some limitations of our approach and lays out possible lines for future work. Finally, we summarise the main conclusions from this research in Sect. 8.

2 Related work on gender identification

Here, we present a summary of previous research work on gender identification, including several methodological aspects that exert a direct influence on this setting.

2.1 Machine learning and text mining

Identifying and extracting useful information and patterns from textual data are a very active research area in machine learning and artificial intelligence Aggarwal (2018). Information Retrieval Baeza-Yates and Ribeiro-Neto (2011), Natural Language Processing (NLP) Eisenstein (2019), Jurafsky and Martin (2009), and text mining Feldman and Sanger (2006), Srivastava and Sahami (2009), Berry and Kogan (2010) are related knowledge areas providing methods and tools for this task.

Although much of this previous work has considered English textual data, some studies have also focused on methodological aspects of text analysis in other languages, like in this case. In particular, Hedlund et al. (2001) study the use of Swedish for cross-language information retrieval. They highlight specific traits that could be exploited in certain applications, for example, morphological features, such as inflexion, derivation, and gender. In our approach, we attempt to leverage similar morphological elements in Spanish to improve gender identification in text.

Besides, our method introduces some changes in the conventional pipeline for text data preparation to retain these valuable morphological elements for gender identification. Previous research confirms the importance of selecting appropriate combinations of pre-processing tasks to improve the accuracy of machine learning algorithms for classification Uysal and Gunal (2014). That is even more relevant in the analysis of languages different from English.

2.2 Identifying the gender of text authors

Gender identification in textual data has attracted significant interest in previous research. One of the most frequent applications have been detecting the gender of the person who writes the text, including microblogging users Mukherjee and Bala (2017), Huang et al. (2014), authors of e-mail messages Corney et al. (2002), chat messages Kucukyilmaz et al. (2006) and weblog posts Yan and Yan (2006), contributors in peer-production online projects Vasilescu et al. (2014), Lin and Serebrenik (2016), Terrell et al. (2017), Das et al. (2019), or gender detection of feedback authors in CRM platforms Lau et al. (2005), Amado et al. (2018). Profiling text authors has also been addressed in multilingual settings Kocher and Savoy (2017),Fatima et al. (2017), López-Santillán et al. (2020), confirming the advantages of content-based methods for this task. Other methodological approaches in this context include the use of graph analysis Kretschmer and Aguillo (2005), Rangel and Rosso (2016) and Part Of Speech (POS) tagging Fourkioti et al. (2019).

In Krüger and Hermann (2019), the authors examine 215 previous research works published between 2017 and 2019 on the gender identification of text authors. According to this review, the best experiment in previous literature reports an accuracy of 93.4% on cases extracted from Facebook Posts Markov et al. (2017). It is worth mentioning that all papers considered in this survey follow standard machine learning procedures to prepare textual input, including tokenisation and text normalisation Eisenstein (2019) (such as lemmatisation or stemming). As we explain in Sect. 5, in our approach, we introduce changes in standard procedures to improve the performance in gender identification as suggested in previous research Uysal and Gunal (2014). In addition, a common limitation raised in this comparison is that all previous studies only consider a binary gender target (we will discuss again this matter in Sect. 8).

Furthermore, according to Krüger and Hermann (2019), there is ample variability in gender identification accuracy across different languages. A closer inspection of results from featured works in this review reveals that experiments with languages providing richer gender information, such as Spanish and Portuguese, outperform the accuracy in English. In our method, we seek to confirm if it is possible to leverage specific features in some languages for gender identification purposes.

2.3 Text gender identification

Text gender identification, that is, deciding the gender of a person described in a text, has received comparatively less attention in previous studies. Nonetheless, there are applications for targeted advertising Jansen et al. (2013), evaluating gender differences in online labour markets Foong et al. (2018) and named entity recognition Cho et al. (2013). In contrast with prior approaches, predominantly based on database services Wais (2016), Santamaría and Mihaljević (2018), more recent work by Das and Paik (2021) shows the utility of applying machine learning algorithms that analyse contextual information. We propose to combine machine learning algorithms with the retention of morphological elements that carry gender information in some languages to further improve the accuracy of gender tagging a person described in text.

3 Background on SVM optimisation

SVM are algorithms whose performance is based on the use of kernels. A kernel K(xy) is a real-valued function \(K:X \times X \rightarrow {\mathbb {R}}\) that acts as a dot product in a real vector space Z. To this aim, there exists a function

$$\begin{aligned} \Phi : X \rightarrow Z, \end{aligned}$$
(1)

such that \(K(x, y) = \Phi (x)^T \Phi (y)\). X and Z are, respectively, known as input and feature spaces.

SVM belong to the type of algorithms based on regularisation theory Moguerza and Muñoz (2006). These methods allow the construction of classification functions by solving an optimisation problem of the form (Tikhonov and Arsenin 1977)

$$\begin{aligned} \min _{f \in H_K} \frac{1}{n} \sum _{i=1}^n L(y_i, f(x_i)) + \mu \Vert f\Vert _K^2, \end{aligned}$$
(2)

where \(\mu > 0\); \(H_K\) denotes the reproducing kernel Hilbert space (RKHS) associated with the kernel K; \(\Vert f\Vert _K\) is the norm of f in the RKHS; \(x_i\) are the n sampled data points; \(y_i \in \{-1, +1\}\) indicates the two possible classes of \(x_i\); and, finally, \(L(y_i, f(x_i))\) is an error function. In this context, \(f(x) = 0\) is a decision surface in \(H_K\). The typical SVM approach uses the specific error function L, called hinge loss, defined as

$$\begin{aligned} L(y_i, f(x_i)) = (1 - y_i f(x_i))_+, \end{aligned}$$
(3)

with \((x)_+ = \max (x, 0)\).

In problem (2), \(\mu\) helps to establish a compromise between the fit of solution f to the data, quantified by L, and the complexity of function f, quantified by \(\Vert f\Vert _K\).

It is immediate to show that \(\Vert f\Vert _K^2 = \Vert w\Vert ^2\), where \(w=\sum _i^n \alpha _i \Phi (x_i)\) and \(\Phi\) is the mapping defined in (1). Therefore, problem (2) can be reformulated as

$$\begin{aligned} \min _{w,b} \frac{1}{n} \sum _{i=1}^n (1 - y_i(w^T \Phi (x_i) + b))_+ + \mu \Vert w\Vert ^2. \end{aligned}$$
(4)

Problem (4) is equivalent to solving the following optimisation dual problem Moguerza and Muñoz (2006):

$$\begin{aligned} \begin{aligned} \min_{\beta }\quad &\frac{1}{2} \varvec{\beta }^T Q \varvec{\beta } - \varvec{e}^T \varvec{\beta } \\ \text {s.t.}{}\quad & {} \varvec{y}^T \varvec{\beta } = 0, \\ {}{} & {} \varvec{0} \le \varvec{\beta } \le C \varvec{e}, \\ \end{aligned} \end{aligned}$$
(5)

where \(\varvec{\beta } = (\beta _1, \ldots , \beta _n)^T\), \(\varvec{y} = (y_1, \ldots , y_n)^T\), \(\varvec{0}\) is a vector of all zeros, \(\varvec{e}\) is a vector of all ones, Q is a positive definite \(n \times n\) symmetric matrix with \(Q_{ij} = y_i y_j K(x_i, x_j)\), and \(C = \frac{1}{2 \mu n}\) is a constant. This problem is convex and quadratic and, therefore, every local minimum is a global minimum.

For the sake of simplicity, let us define \(g(\varvec{\beta })\) as the gradient of the objective function of problem (5), that is, \(g(\varvec{\beta }) = Q \varvec{\beta } - \varvec{e}\).

A vector \(\varvec{\beta }\) is a stationary point of (5) if and only if there is a number d and two non negative vectors \(\varvec{\lambda }\) and \(\varvec{\mu }\), such that

$$\begin{aligned} \begin{aligned} g(\varvec{\beta }) + d y&= \varvec{\lambda } - \varvec{\mu } , \\ \varvec{\lambda _i} \varvec{\beta _i}&= 0, i = 1 \ldots n,\\ \varvec{\mu _i}(C - \varvec{\beta _i})&= 0, i = 1 \ldots n,\\ \varvec{\lambda _i}, \varvec{\mu _i}&\ge 0, i = 1 \ldots n. \end{aligned} \end{aligned}$$
(6)

It can be shown, Chen et al. (2006), that a vector \(\varvec{\beta }\) such that \(0 \le \varvec{\beta _i} \le C\) satisfies conditions (6) if and only if

$$\begin{aligned} \begin{aligned} -y_i g(\varvec{\beta })_i \le d,{} & {} \quad \forall i \in I_{up}(\varvec{\beta }), \\ -y_i g(\varvec{\beta })_i \ge d,{} & {} \quad \forall i \in I_{low}(\varvec{\beta }), \end{aligned} \end{aligned}$$
(7)

where

$$\begin{aligned} \begin{aligned} I_{up}(\varvec{\beta })&\equiv \{t \mid \beta _t< C, y_t = 1 \text { or } \beta _t> 0, y_t = -1\}, \text {and}\\ I_{low}(\varvec{\beta })&\equiv \{t \mid \beta _t < C, y_t = -1 \text { or } \beta _t > 0, y_t = 1\}. \end{aligned} \end{aligned}$$
(8)

From (7), it holds that \(m(\varvec{\beta }) \le M(\varvec{\beta })\), where

$$\begin{aligned} \begin{aligned} m(\varvec{\beta })&\equiv \max _{i \in I_{up}(\varvec{\beta })} - y_i g(\varvec{\beta })_i, \\ M(\varvec{\beta })&\equiv \min _{i \in I_{low}(\varvec{\beta })} - y_i g(\varvec{\beta })_i, \end{aligned} \end{aligned}$$
(9)

In the SVM literature, problem (5) is solved using the so-called Sequential Minimal Optimisation (SMO) type algorithms (Joachims 1999; Platt 1998). These algorithms are essentially Newton-type quadratic methods that, to make the problem computationally tractable, consider only a subset of variables in each iteration, the so-called working set, instead of working with the entire matrix \(\varvec{Q}\).

Theorem 1

Let \(\{\varvec{\beta }^k\}\) be the infinite sequence generated by an SMO-type method for problem (5). Then, if Q is a positive definite matrix, the limit point of \(\{\varvec{\beta }^k\}\) is the unique and global minimum of problem (5).

Proof

See Chen et al. (2006). \(\square\)

As a consequence, the following corollary holds.

Corollary 3.1

If \(\{\varvec{\beta }^k\}\) is an infinite sequence, then the following two limits exist and are equal:

$$\begin{aligned} \lim _{k \rightarrow \infty } m(\varvec{\beta }^k) = \lim _{k \rightarrow \infty } M(\varvec{\beta }^k). \end{aligned}$$
(10)

Considering the definition of the sets \(I_{up}\) and \(I_{low}\) from Eq. (8), and in particular the \(\varvec{\beta }^k\) involved within each set (within bounds), for k large enough and a small tolerance \(\epsilon > 0\), the following condition holds:

$$\begin{aligned} m(\varvec{\beta }^k) - M(\varvec{\beta }^k) \le \epsilon , \end{aligned}$$
(11)

that is

$$\begin{aligned} \lim _{k \rightarrow \infty } \vert {m(\varvec{\beta }^k) - M(\varvec{\beta }^k)}\vert = 0. \end{aligned}$$

Based on these results, SMO-type algorithms implement the following stopping criterion:

$$\begin{aligned} \vert {m(\varvec{\beta }^k) - M(\varvec{\beta }^k)}\vert \le \epsilon . \end{aligned}$$
(12)

4 Geometrical stopping criterion

The geometry of the SVM decision function linked to the training data is quantifiable within each iteration by measuring or estimating the error rate. In this sense, when such an error rate stabilises, the decision function fulfils the requirements for successfully classifying the data. Figure 1 shows an example with two 2-dimensional Gaussian clouds. Running an SVM up to iteration 2 produces the separating surface represented in Fig. 1a, whereas Fig. 1b exhibits the separating surface calculated up to iteration 18. The decision function in Fig. 1a provides the same empirical error as the decision function in Fig. 1b, which required more iterations and, therefore, more training time for its construction. In contrast, building the decision function in Fig. 1a needed fewer training iterations.

Fig. 1
figure 1

Two separating surfaces calculated with SVM: (a) separating surface obtained after 2 iterations; (b) separating surface calculated after 18 iterations

Based on these foundations, let us define \(\varepsilon _k\) as the error rate for the training set at iteration k and \(\tau > 0\) as a real value acting as a tolerance. According to (3)

$$\begin{aligned} \epsilon _k = \frac{1}{n} \sum _{i=1}^n (1 - y_i f(x_i))_+. \end{aligned}$$
(13)

Let \(\delta _k = \vert {\varepsilon _k - \varepsilon _{k-5}}\vert\). The method will stop when it holds that

$$\begin{aligned} (\delta _k \le \tau ) \wedge (\delta _{k-5} \le \tau ) \wedge (\delta _{k-15} \le \tau ) \wedge (|M(\varvec{\beta }^k) - m(\varvec{\beta }^k) |\le 100 \epsilon ), \end{aligned}$$
(14)

where the symbol “\(\wedge\)” denotes the logical operator “and”, that is, the four inequalities within the criterion must hold simultaneously. By measuring inequality \(\delta _k \le \tau\) every five iterations and relaxing the classical criterion, we can avoid undesired error peaks caused by randomness. The following theorem demonstrates that there is a direct relationship between the stopping criteria (12) and (14).

Theorem 2

Given \(\tau > 0\) and \(\epsilon > 0\), there exists an iteration number \(k_{\tau ,\epsilon }\), such that \(\delta _k \le \tau\) and \(\vert {m(\varvec{\beta }^k) - M(\varvec{\beta }^k)}\vert \le \epsilon , \forall k > k_{\tau , \epsilon }\).

Proof

By Corollary 3.1, it holds that

$$\begin{aligned} \lim _{k \rightarrow \infty } \vert {m(\varvec{\beta }^k) - M(\varvec{\beta }^k)}\vert = 0. \end{aligned}$$

Thus, there exists an iteration number \(k_{\epsilon }\), such that \(\forall k > k_{\epsilon }\), it holds that

$$\begin{aligned} \vert {m(\varvec{\beta }^k) - M(\varvec{\beta }^k)}\vert \le \epsilon . \end{aligned}$$

Let the vector \(\varvec{\beta }^*\) denote the solution to problem (5), where \(x_i\) such that \(\beta _i^* > 0\) are the so-called support vectors. It is well known that the function \(f^*\), which determines the decision surface \(f^*(x) = 0\), takes the form

$$\begin{aligned} \begin{aligned} f^*(x)&= \sum _{i=1}^n \beta _i^*y_i K(x_i,x) + b^*, \end{aligned} \end{aligned}$$
(15)

with

$$\begin{aligned} b^* = - \displaystyle \frac{\sum _{i=1}^n \beta _i^* y_i K(x_i, x^+)}{2} + \frac{\sum _{i=1}^n \beta _i^* y_i K(x_i, x^-)}{2}, \end{aligned}$$

where \(x^+\) and \(x^-\) are two support vectors in classes +1 and -1, respectively, such that their associated Lagrange multipliers \(\beta ^+\) and \(\beta ^-\) hold that \(0< \beta ^+ < C\) and \(0< \beta ^- < C\).

Let us consider the decision function determined by problem (5), at iteration k

$$\begin{aligned} f^k(x) = \sum _{i=1}^n \beta _i^ky_i K(x_i,x) + b^k. \end{aligned}$$

It is straightforward to show that, \(\forall \gamma > 0\), there exists an iteration number \(k_{\gamma }\), such that \(\forall k > k_{\gamma }\)

$$\begin{aligned} \vert {f^k(x) - f^*(x)}\vert \le \gamma . \end{aligned}$$
(16)

This is due to Theorem 1, which guarantees that

$$\begin{aligned} \lim _{k \rightarrow \infty } \varvec{\beta }^k = \varvec{\beta }^*, \end{aligned}$$

and, hence, that \(\forall \nu > 0\), there exists an iteration number \(k_{\nu }\) such that, \(\forall k > k_{\nu }\), it holds that

$$\begin{aligned} \vert {\varvec{\beta }^k - \varvec{\beta }^*}\vert \le \nu . \end{aligned}$$

Taking \(\nu\) small enough, (16) holds for \(k_{\gamma } = k_{\nu }\). Given \(\tau > 0\), taking \(\gamma\) small enough in (16), \(\forall k > k_{\gamma }\), considering definition (13), it holds that

$$\begin{aligned} \delta _k = \vert {\varepsilon _k - \varepsilon _{k-5}}\vert \le \tau , \end{aligned}$$

and the theorem holds for \(k_{\tau , \epsilon } = \max \{k_{\epsilon }, k_{\nu }\}\). \(\square\)

Since criterion (14) is based on the error rate, it is intuitive to fix a value for \(\tau\). We can set it to, for example, \(\tau = 0.0005\), since an error rate of up to \(0.05 \%\) can be considered significantly low. This intuition does not exist for criterion (12), because the magnitude of the \(g(\varvec{\beta })\) cannot be estimated in advance. In Sect. 5, we demonstrate that, for the particular case of text gender identification, we can find suitable generalising decision surfaces that fulfil criterion (14) in significantly fewer iterations than required to meet the commonly used theoretical criterion (12). The key point is the high-dimensional setting in which the text is represented, which allows reaching a good empirical generalisation quickly. Therefore, from an empirical point of view, it is expected that GSVM inherits the generalisation properties of SVM.

5 Experiments

In this section, we describe in detail the gender identification problem and the experimental setup.

5.1 Problem statement

We set out the problem of gender identification in textual data as a supervised learning classification task. Given a text block whose content can be tagged with a gender label associated with the person described in the text, the algorithm must automatically infer such a label. In this case, we consider two possible output labels, \(L = \{\text {``female''}, \text {``male''}\}\), since our experimental dataset only contains instances of these two gender classes, as described in Sect. 5.3. Hence, we restrict our choice of algorithms to those geared towards binary classification. The same approach could be extended to consider additional output labels, reformulating the problem as a one-versus-all or a multi-label classification task.

Our framework differs from previous approaches in two main aspects:

  • While prior works follow conventional text-mining procedures, such as stemming, we intentionally avoid this step to retain the gender suffix in terms and explore its impact on gender identification for text in strongly inflected languages like Spanish.

  • Instead of tagging a sequence of named entities in a text, we tackle the problem of assigning a gender label to a block of text or a complete document that describes a person. That is a relevant problem for many applications, including automated annotation of entries in semantic databases.

Many languages are inflected, meaning that words can change their form to reflect grammatical information, such as number, tense, or gender Hedlund et al. (2001). In Spanish, words ending in “-o” or a consonant denote masculine gender, whereas those ending in “-a” are primarily feminine. Stemming attempts to remove the differences between inflected forms of a word to reduce each word to its root form. This pre-processing has been the standard approach in text mining and Information Retrieval to improve the precision in identifying key terms in a given text.

Nevertheless, as mentioned above, in Spanish, the suffix of many words can be a handy and direct indicator of gender. For this reason, it seems reasonable to assume that if we do not apply stemming and retain gender suffixes instead, we can improve accuracy in gender identification. Thus, we propose the following approach for gender identification in textual data:

  1. 1.

    Start by applying standard procedures in text mining to prepare the raw input text, including tokenisation and removal of stopwords (additional implementation details are provided in Sect. 5).

  2. 2.

    Before creating the vectorised representation for each document, conventional text-mining procedures suggest performing some text normalisation procedure Sproat et al. (2001), such as stemming, to eliminate inflectional affixes, provide a typical representation of similar words and reduce the vocabulary size. On the contrary, we propose suppressing this text normalisation step, retaining instead inflectional suffixes that carry extra information about gender in specific languages.

  3. 3.

    Then, we resume the standard pipeline for document preparation, generating a vectorized representation of each document and building a term-document matrix (again, further details are provided in Sect. 5).

This strategy applies to preparing textual data that will train a machine learning algorithm for gender identification. To show the applicability of this procedure in a real setting, we have designed a series of validation experiments, described in the next section.

5.2 Experimental setup

The main objective is to evaluate the capacity of our proposed framework for automated identification of gender associated with real-world text documents. For this purpose, we retrieve biographical articles describing literary authors from the English and Spanish versions of Wikipedia, whose data feed other semantic knowledge-based systems like Wikidata. Therefore, two different datasets were created, one for each language. We have designed our dataset’s construction to ensure equal representation of the two labels in biographical entries chosen for our experiments in English and Spanish.

Implementing our procedure on input text in English and Spanish, we aim to confirm if avoiding stemming has a positive impact on the performance of our gender identification classifier. Our curated dataset consisted of biographies of female and male authors. This choice allows us to frame the experiment as a binary classification problem for which several standard machine learning algorithms exist. Figure 2 shows an overview of the experimental procedure to assess each supervised learning method.

Fig. 2
figure 2

Description of the proposed framework to test alternative supervised learning methods

Each dataset of biographies in English and Spanish is divided into training and testing subsets. Specifically, 70% of the biographies in each language are used for training, whereas the remaining 30% are held out as a testing set to assess the performance of supervised learning algorithms. The following section provides additional details about retrieving and preparing both datasets.

5.3 Datasets

To begin with, we retrieve biographies of literary authors from Wikipedia, crawling the web interface directly using the WikipediR package Keyes and Tilbert (2017), available for the R statistical environment R Core Team (2022). This package can obtain a list of pages, subcategories, page content, and other information about a specified category.

The English dataset Gomez et al. (2021a), publicly available,Footnote 3 consists of 1000 biographies about writers created in the English Wikipedia. These articles are equally partitioned into a female and a male set. Female biographies have been extracted from the category “19th-century_women_writers”, whereas the first 500 pages of the “19th-century_male_writers” category have been obtained for male biographies.

The Spanish dataset Gomez et al. (2021b), also publicly available,Footnote 4 comprises 832 biographies from the Spanish Wikipedia. Female biographies have been extracted from the “Escritoras de España” category. In contrast, the first 416 pages of category “Escritores de España del siglo XX” have been retrieved for male biographies. Hence, these biographies are also divided into a female and a male set.

Table 1 summarises some general statistics from a preliminary exploratory analysis of the English and Spanish datasets described above.

Table 1 Total and average number describing the English and Spanish datasets

We pre-process each biography using the tm text-mining R package Feinerer et al. (2008). Stop words (common words that usually do not add helpful information for the analysis) Eisenstein (2019), numbers, punctuation marks, and white spaces are removed from the biographies. Later, the text in every biography is converted to lowercase, and the corresponding term-biography matrix was created, as presented in Table 2. Thus, each biographical set can be represented as an \(m \times n\) matrix, where m is the number of unique terms in the dictionary and n is the number of biographies in the training set. Each element \(w_{ij}\) of the term-biography matrix represents the importance or weight of the term i in the biography j. To obtain the value of \(w_{ij}\), we use the TF-IDF measure Aizawa (2003), calculated as (17)

$$\begin{aligned} w_{ij} = tf_{ij} \times \log \left( \frac{n}{df_i} \right) , \end{aligned}$$
(17)

where \(tf_{ij}\) denotes the number of occurrences of the term i in the biography j; n is the total number of biographies in the training set; and, finally, \(df_i\) represents the number of biographies in which the term i appears.

Table 2 Example of term-biography matrix, where \(w_{ij}\) represents the importance of term i in biographic entry j

We must remark that only training biographies contribute to the construction of the dictionary of terms that represents all biographies. That is, when the classifier checks new biographies in the testing set, any terms not found in biographies from the training set are simply ignored to calculate the distance. As well, synonyms, abbreviations, or alternative forms for a given term have not been considered in our study.

6 Results

This section presents the results of experiments comparing different strategies for text gender identification. Experiments for each classification algorithm involve ten trials of randomly selected train-test splits. To support the results, Wilcoxon statistical hypothesis tests Hollander et al. (2013) were performed, considering statistical significance for p values lower than 0.05.

6.1 GSVM versus standard SVM

The following sections summarise results from the evaluation experiments comparing GSVM with standard SVM Joachims (1998) in the context of text gender identification.

6.1.1 Iteration count comparison: GSVM versus standard SVM

Table 3 compares the iteration count for GSVM and a typical SVM using the classical stopping criterion. It is clear that, on average, for the Spanish dataset, GSVM converges to approximately \(18\%\) (with stemming) and \(25\%\) (with no stemming) fewer iterations than the standard SVM approach. In turn, for the English dataset, GSVM reduces the number of iterations by approximately \(42\%\) (with stemming) and \(44\%\) (without stemming). The large values in the standard deviation for GSVM are a consequence of both approaches reaching the maximum number of iterations in some experiments, that is, the number of iterations of the standard SVM approach.

The Wilcoxon test for the English dataset shows that the improvement of GSVM over SVM is statistically significant, both when stemming is applied (p-value = 0.00195) and when it is not (p value = 0.00589). In the case of biographies in Spanish, the improvement is statistically significant when stemming is not included (p value = 0.00195), but not significant when stemming is used (p value = 0.1).

Table 3 Average (avg.) and standard deviation (s.d.) of iteration count in training stages

Figures 3a, 3b, 4a and 4b graphically show, for each experiment, at which iteration count GSVM and standard SVM stop their training execution, respectively. The GSVM stopping criterion detects the instant when the error rate stabilises (dotted line) in all cases. In contrast, the classical SVM stopping criterion requires a higher number of iterations to stop without providing any error rate improvement.

Fig. 3
figure 3

No stemming. Comparison of the stopping criteria when the stemming pre-processing is not used

Fig. 4
figure 4

Stemming. Comparison of the stopping criteria when the stemming pre-processing is applied

At this point, it is remarkable that the GSVM stopping criterion performance for biographies in English is similar, disregarding whether stemming pre-processing is applied or not (see Figs. 3b and 4b). Another noticeable finding is that the error rate for biographies in Spanish (see Figs. 3a and 4a) exhibits an oscillating pattern when the stemming pre-processing is applied (Fig. 4a). This behaviour may be due to a significant information loss introduced by the stemming operation in this language.

As a final comment, it is worth mentioning that the evaluation metrics are similar, in all cases, for the “female” and “male” output labels in both languages.

6.1.2 Accuracy comparison: GSVM versus standard SVM

To evaluate the performance comparison between standard SVM and GSVM we follow the well-known metrics of precision, recall and \(F_1\) score Baeza-Yates and Ribeiro-Neto (2011), Sokolova and Lapalme (2009), Olson and Delen (2008). Table  4 shows a comparison of SVM and GSVM for biographies in English, with and without stemming. For each method and metric, this table shows the average (avg.) and standard deviation (s.d.). In addition, the p-value of the Wilcoxon test is shown for each comparison. The Wilcoxon test results indicate that there are no statistically significant differences for any of the metrics considered, as shown in the p-value column, where all values are greater than 0.05. Results in Tables  4 and 5 empirically show that for text gender identification problems GSVM inherits the properties of SVM, providing similar accuracy results.

Table 4 English. SVM and GSVM performance metrics for biographies in English classification, with and without stemming

Table 5 shows a comparison of SVM and GSVM for biographies in Spanish, with and without stemming. Again, for each method and metric, this table shows the average (avg.), the standard deviation (s.d.), and the corresponding p-values of the Wilcoxon test. Similarly to biographies in English, the Wilcoxon test results indicate that there are no statistically significant differences for any of the metrics (p values are greater than 0.05). As a consequence, the drop in the number of iterations achieved by GSVM does not affect the classification accuracy.

Table 5 Spanish. SVM and GSVM performance metrics for biographies in Spanish classification, with and without stemming

Table 6 shows for GSVM a comparison of stemming versus no stemming, along with the corresponding p-values. For biographies in Spanish, results show significant degradation when stemming is included, with p values 0.0097, 0.00586 and 0.00195, for precision, recall, and \(F_1\) score, respectively, on Female gender; and p-values 0.0137, 0.00195 and 0.00195, for precision, recall, and \(F_1\)-score, respectively, on Male gender. In contrast, for biographies in English, p values for all metrics exceed the 0.05 significance level, indicating the absence of a significant effect.

Table 6 Performance metrics’ comparison of stemming versus no stemming when the GSVM method is applied

6.2 GSVM versus other algorithms

Next, we compare the GSVM performance and training time against the performance and training time obtained for other well-known machine learning techniques, namely Random Forests (RF) Breiman (2001) and Boosting Schapire (1990), implemented in standard software libraries. In these tests, the algorithms hyperparameters are fixed to their default values. For every algorithm and language, we compare two alternative text-mining workflows for each metric: i) including a stemming step, like in most conventional text-mining applications; and ii) avoiding stemming, retaining any affixes that provide extra gender information in some languages. We use time as a comparison metric, given that there is no possibility of fairly comparing iteration counts, as a different iteration approach drives the implementation of each method.

6.2.1 Training time comparison: GSVM versus other algorithms

Table 7 summarises the execution time (in seconds) for all versions of the algorithms considered in this study. Using stemming for the Spanish language, GSVM is, on average, up to 10.05 times faster than the Boosting method and 42.87 times faster than RF, whereas, for the English language, GSVM is, on average, up to 15.79 times faster than Boosting and 59.15 times faster than RF. When stemming is not applied, for the Spanish language, GSVM is, on average, up to 49.63 times faster than Boosting and 81.00 times faster than RF. As for English, GSVM is, on average, up to 22.91 times faster than Boosting and 77.70 times faster than RF. In summary, based on the conducted Wilcoxon test, the p-value column indicates a statistically significant difference between the methods, showing that the GSVM training time is consistently lower than in the other methods.

Table 7 Average (avg.) and standard deviation (s.d.) for training time

6.2.2 Accuracy comparison: GSVM versus other algorithms

Since stemming does not appear to be significant when processing biographies in English, for the sake of space, we will focus the remainder of our analysis on biographies in Spanish. Table  8 shows a comparison of stemming versus no stemming, and the corresponding p values, for GSVM, Boosting, and Random Forest. GSVM obtains very promising results, especially when no stemming is used, with precision, recall, and \(F_1\) score over \(90\%\) in most cases, although Boosting and Random Forest mostly improve GSVM results. This is due to the fact that both techniques belong to the so-called ensemble methods (Sagi and Rokach 2018). By training multiple models and integrating their predictions, these methods enhance the predictive performance of a single model. We must remark that, since GSVM behaves like a standard SVM in terms of accuracy, these results should be considered from the point of view of the importance of stemming versus no stemming in languages such as Spanish. This pre-processing technique also affects the results that would be obtained using other types of classifiers.

In addition, the p values in Table  8 show a significant degradation in performance for all algorithms when stemming is applied. These results consistently show that for the Spanish language, without stemming, precision, recall, and \(F_1\)-score are over 90% in most cases, whereas using stemming a degradation rounding of 15% occurs.

Table 8 Accuracy metrics for methods on biographies in Spanish

7 Discussion

In this section, we consider some explanations for detected differences in gender identification performance and ponder over possible limitations in our work.

7.1 Classification algorithms’ comparison

Results suggest that controlling for stemming application in textual data preparation, performance metrics are pretty similar, regardless of the machine learning algorithm selected for classification. Nevertheless, there exist differences depending on the language used.

As a result, we can conclude that machine learning classification algorithms for gender identification can be directly affected by suffixes present in the datasets used to train the algorithms. Our results confirm that stemming pre-processing, a method that apparently simplifies the classification problem, may induce gender confusion to solve the problem at hand.

7.2 Impact of stemming elimination

One of the primary goals of the proposed framework is to explore if retaining morphological elements containing gender information in some languages, like Spanish, can provide superior classification performance for gender identification in text. According to the results presented in Tables 6 and 8, there is quite a noticeable increment in classification performance in Spanish for the algorithms considered in our experiments when the stemming step is eliminated from the data preparation pipeline.

In contrast, results for the English language indicate minimal variation in the \(F_1\) score metric between applying stemming or not in the data preparation process for any classification algorithm. This language has similar outcomes regarding precision and recall performance metrics.

Globally, these results underpin our initial consideration of avoiding stemming can positively impact gender identification in texts when content is available in languages providing additional morphological elements that bring in gender information, like Spanish. Moreover, they also, confirm indications from prior research about the importance of carefully selecting the most appropriate combination of data preparation tasks, especially working with languages different from English Uysal and Gunal (2014).

7.3 Limitations

The novel GSVM method introducing an early stopping criterion for the training phase works well for the specific problem of text gender identification. Therefore, further experiments should be conducted to assess the performance of GSVM in other classification tasks before recommending GSVM as a general SVM methodological innovation in machine learning.

As for the proposed framework for text gender identification, an important limitation of our experimental setting is that we circumscribe the target gender variable to a binary choice. Other authors have already raised this issue Hamidi et al. (2018), Keyes (2018). In fact, Keyes (2018) found that 55 out of 58 studies for automated gender identification with machine learning assumed a binary gender output. In this regard, one possible path for future research could be replacing our binary classification algorithms with alternative machine learning models that support continuous and/or multivariate targets. In such contexts, it would be interesting to evaluate whether gender suffixes still provide some advantages for gender identification.

Another limitation of our approach is that input text may not be available in a strongly inflected language, different from English. However, in that case, the text could be automatically translated from English to Spanish or other languages with similar properties through automated services. This translated version of the input text can provide extra gender information to improve classification performance. In the same way, further research is needed to confirm that these initial results comparing English and Spanish are replicable with input data in other alternative languages.

8 Conclusions

In this article, we present a new stopping criterion for support vector optimisation algorithms based on the geometrical properties of vector representation of text content to solve the problem of automated gender identification. For this particular problem, we show that the proposed algorithm requires less time for training compared to standard SVM algorithms. Regarding the framework, rather than following conventional normalisation procedures in text mining that eliminate gender affixes, such as stemming, we retain those morphological elements found in strongly inflected languages to improve the performance of gender identification methods. We assess the effectiveness of this new approach in terms of training times by comparing different machine learning algorithms for classification, using a dataset of biographical entries from Wikipedia in English and Spanish.

Our results suggest that avoiding stemming positively influences gender tagging in text for languages that include additional elements incorporating gender information, like Spanish. This procedural change does not impact more gender-neutral languages like English.