1 Introduction

In information retrieval, the similarity between a given user query and each document in a document-term matrix is calculated and documents with high similarity are returned (Kolda & O’leary, 1998; Zhang et al., 2011; Al-Qahtani et al., 2015; Guo et al., 2022). Latent semantic analysis (LSA) has been used as a common baseline for information retrieval (Parali et al., 2019; Duan et al., 2021; Chang et al., 2021). Compared to Word2Vec (Skip-Gram model) LSA showed a better performance in extracting relevant semantic patterns in dream reports (Altszyler et al., 2016). LSA also outperformed neural network methods (such as ELMo word embeddings) in text classification tasks for educational data (Phillips et al., 2021).

New methods that rely on LSA have been proposed (Azmi et al., 2019; Gupta & Patel, 2021; Hassani et al., 2021; Suleman & Korkontzelos, 2021; Horasan, 2022; Patil, 2022). For example, Gupta and Patel (2021) proposed an algorithm for text summarization that uses LSA, TF-IDF keyword extractor, and BERT encoder model. The algorithm performed better than latent Dirichlet allocation. Horasan (2022) proposed a collaborative filtering-based recommendation system using LSA and achieved good performance. Patil (2022) developed a new promising procedure for information retrieval using LSA and TF-IDF.

Weighting the elements of the raw document-term matrix is a common and effective method to improve the performance of LSA (Dumais, 1991; Horasan et al., 2019; Bacciu et al., 2019). LSA usually involves the SVD of a raw or pre-processed document-term matrix. In addition, Caron (2001) proposed changing the weighting exponent of the singular values in LSA to improve information retrieval. His results showed that adjusting the weighting exponent of singular values improves the performance of information retrieval. Since Caron (2001), singular value weighting exponents have been studied and applied in word embeddings generated from word-context matrices (Bullinaria & Levy, 2012; Österlund et al., 2015; Drozd et al., 2016; Yin & Shen, 2018). Other variants that change the singular value weighting exponent have been studied in word embeddings created by Word2Vec and GloVe (Mu & Viswanath, 2018; Liu et al., 2019).

The larger the weighting exponent of the singular values, the higher is the emphasis given to the initial dimensions. According to the experimental results of Caron (2001), giving more emphasis to initial dimensions can often improve the performance of information retrieval on standard test datasets, whereas giving more emphasis to initial dimensions can decrease the performance on question/answer matching. Papers about word embeddings tend to reduce the contribution of initial dimensions to improve performance (Bullinaria & Levy, 2012; Österlund et al., 2015; Drozd et al., 2016; Yin & Shen, 2018; Mu & Viswanath, 2018; Liu et al., 2019), although the optimal value of the singular value weighting exponent is task dependent (Österlund et al., 2015). Bullinaria and Levy (2012) reported that assigning less weight to initial dimensions leads to improved performance for TOEFL, distance comparison, semantic categorization, and clustering purity tasks on a word-context matrix created from the ukWaC corpus (Baroni et al., 2009). They argued that the general pattern appears to be that the initial dimensions tend not to contribute the most useful information about semantics and have a large “noise” component that is best removed or reduced.

Capturing associations between documents and terms appears necessary for the success of LSA in computing science; however, the solution of LSA is a mix of the associations between documents and terms, and marginal effects arising from the lengths of documents and marginal frequencies of terms (Qi et al., 2023). Hu et al. (2003) and Qi et al. (2023) showed that margins play an important role in the first dimensions extracted by LSA.

Correspondence analysis (CA) is another information retrieval technique that uses SVD (Greenacre, 1984; Morin, 2004; Greenacre, 2017; Beh & Lombardo, 2021). In computing science, CA has not been explored as much as LSA. CA is usually used to make two-dimensional graphical displays (Hou et al., 2020; Arenas-Márquez et al., 2021; Van Dam et al., 2021). For example, Arenas-Márquez et al. (2021) depicted a biplot using CA to show that the document encoding of convolutional neural encoder can emphasize the dissimilarity between documents belonging to different classes. Unlike LSA, CA ignores the information on marginal frequency differences between documents and between terms from the solution by preprocessing the data, and it only focuses on the relationships between documents and terms (Qi et al., 2023). Thus, CA seems more suitable for information retrieval.

Séguéla and Saporta (2011) and Qi et al. (2023) experimentally compared LSA and CA for text clustering and text categorization, respectively, and they found that CA performed better than LSA. Although LSA was originally proposed for information retrieval, an empirical comparison between LSA and CA continues to remain lacking in this field. In this paper, therefore, three English datasets and one Dutch dataset are used to compare the performance of LSA and CA in information retrieval.

Whereas LSA owes its popularity to its applicability to different matrices, in CA, it is unusual to weight the elements of the raw document-term matrix. Processing the raw document-term matrix is an integral part of CA (Greenacre, 1984, 2017; Beh & Lombardo, 2021). CA is based on the SVD of the matrix of standardized residuals. Here, however, we study the CA of document-term matrices whose entries are weighted to see if this has an impact on the performance of CA. In addition, based on the success of adjusting the weighting exponent of singular values in LSA, we will explore whether this is also successful in CA.

In summary, this work makes three contributions. First, to compare LSA and CA in information retrieval. Second, to explore whether weightings, including the weighting of the elements of the raw document-term matrix and the adjusting of the singular value weighting exponent, can improve the performance of CA. Third, to study what the initial dimensions of LSA correspond to and whether CA is effective in ignoring the useless information in the raw or pre-processed document-term matrix that contributes a large part of the initial dimensions extracted by LSA. We extensively compare the performances of LSA and CA applied to four datasets using Euclidean distance, dot similarity, and cosine similarity.

The paper is organized as follows. In Section 2, LSA and CA are described in brief. Section 3 presents the methodology used in this paper. The results for Euclidean distance are presented in Section 4, and the results for dot similarity and cosine similarity are presented in Section 5. Finally, Section 6 concludes and discusses the results.

2 LSA and CA

In this section, we briefly describe LSA and CA. We refer the readers to Qi et al. (2023) for a more detailed presentation of the methods.

2.1 LSA

Consider a raw document-term matrix \(\varvec{F} = [f_{ij}]\) with m rows \((i = 1, ...,m)\) and n columns \((j = 1, ..., n)\), where the rows represent documents and the columns represent terms. Weighting might be used to prevent the differential lengths of documents from considerably affecting the representation, or to impose certain preconceptions about which terms are more important (Deerwester et al., 1990). The weighted element \(a_{ij}\) for term j in document i is

$$\begin{aligned} a_{ij} = L(i,j)\times G(j) \times N(i), \end{aligned}$$
(1)

where the local weighting term L(ij) is the weight of term j in document i, G(j) is the global weight of term j in the entire set of documents, and N(i) is the weighting component for document i. The popular TF-IDF can be written in the form \(L(i,j)=f_{ij}, G(j)=1+\log _2(n\text {docs}/df_j) , N(i)=1\), where ndocs is the number of documents in the set and \(df_j\) is the number of documents where term j appears (Dumais, 1991). The SVD of \(\varvec{A}=[a_{ij}]\) is

$$\begin{aligned} \varvec{A} =\varvec{U}\varvec{\Sigma }\varvec{V}^T \end{aligned}$$
(2)

where \(\varvec{U}^T\varvec{U} = \textbf{I}\), \(\varvec{V}^T\varvec{V} = \textbf{I}\), and \(\varvec{\Sigma }\) is a diagonal matrix with singular values on the diagonal in the descending order. We denote matrices that contain the first k columns of \(\varvec{U}\), first k columns of \(\varvec{V}\), and k largest singular values of \(\varvec{\Sigma }\) by \(\varvec{U}_k\), \(\varvec{V}_k\), and \(\varvec{\Sigma }_k\), respectively. Then, \(\varvec{U}_k\varvec{\Sigma }_k(\varvec{V}_k)^T\) provides the optimal rank-k approximation of \(\varvec{A}\) in a least-squares sense, which shows that SVD can be used for data reduction. In LSA, the rows of \(\varvec{U}_k\varvec{\Sigma }_k\) and \(\varvec{V}_k\varvec{\Sigma }_k\) provide the coordinates of row and column points, respectively. Euclidean distances between the rows of \(\varvec{U}_k\varvec{\Sigma }_k\) (\(\varvec{V}_k\varvec{\Sigma }_k\)) approximate those between the rows (columns) of \(\varvec{A}\).

Representing out-of-sample documents or queries in the k-dimensional subspace of LSA is important for many applications including information retrieval. Suppose that the new weighted document is a row vector \(\varvec{d}\). Since \(\varvec{V}^T\varvec{V} = \textbf{I}\) and \(\varvec{U}^T\varvec{U} = \textbf{I}\), we have

$$\begin{aligned} \begin{aligned} \varvec{A}\varvec{V}_k=\varvec{U}_k\varvec{\Sigma }_k \end{aligned} \end{aligned}$$
(3)

and

$$\begin{aligned} \begin{aligned} \varvec{A}^T\varvec{U}_k=\varvec{V}_k\varvec{\Sigma }_k \end{aligned} \end{aligned}$$
(4)

Therefore, using (3), the coordinates of the out-of-sample document \(\varvec{d}\) in the k-dimensional subspace of LSA is \(\varvec{d}\varvec{V}_k\). Similarly, using (4), the coordinates of the out-of-sample term \(\varvec{t}\) (represented as row vector) in the k-dimensional subspace of LSA is \(\varvec{t}\varvec{U}_k\).

As in Qi et al. (2023), we first use a small dataset to illustrate LSA. This small dataset is introduced in Aggarwal (2018) (see Table 1), and it contains 6 documents. For each document, we are interested in the frequency of occurrence of six terms. The first three documents primarily refer to cats, the last two primarily to cars, and the fourth to both. The fourth term, jaguar, is polysemous because it can refer to either a cat or a car.

Table 1 A document-term matrix \(\varvec{F}\): size 6\(\times \)6

In the LSA of the raw document-term matrix (LSA-RAW), the rows and columns of \(\varvec{F}\) are not weighted, and therefore, we can replace \(\textbf{A}\) in (2) by \(\varvec{F}\). The coordinates of the documents and of the terms for LSA-RAW in the first two dimensions are \(\varvec{U}_2\varvec{\Sigma }_2\) and \(\varvec{V}_2\varvec{\Sigma }_2\), respectively. Figure 1a shows the two-dimensional plot of the documents and terms. Cat terms (lion, cheetah, and tiger) are close together; car terms (porsche and ferrari) are close together; car documents (5 and 6) are close together. However, the cat documents (1, 2, and 3) are not close together, neither is document 4 in between cat documents and car documents, and neither is jaguar in between cat terms and car terms. This can be attributed to the fact that LSA displays both the relationships between documents and terms and the sizes of the documents and terms: for the latter, jaguar, for example, is used most often in the documents and is furthest away from the origin.

Fig. 1
figure 1

A two-dimensional plot of documents and terms for (a) LSA-RAW, (b) CA (Qi et al., 2023)

2.2 CA

In CA, an SVD is applied to the matrix of standarized residuals given by Greenacre (2017)

$$\begin{aligned} {\textbf {S }} = {\textbf {D }}_r^{-\frac{1}{2}}({\textbf {P }}-{\textbf {E }}){\textbf {D }}_{c}^{-\frac{1}{2}} \end{aligned}$$
(5)

where \({\textbf {P }} = [p_{ij}]\) is the matrix of joint observed proportions with \(p_{ij} = f_{ij}/\sum _{i}\sum _{j}f_{ij}\), \({\textbf {D }}_r\) is a diagonal matrix with \(r_i = \sum _jp_{ij}\) \((i = 1, 2, \cdots , m)\) on the diagonal, \({\textbf {D }}_c\) is a diagonal matrix with \(c_j = \sum _ip_{ij}\) \((j = 1, 2, \cdots , n)\) on the diagonal, and \({\textbf {E }} = [r_ic_j]\) is the matrix of expected proportions under the statistical independence of the documents and the terms. The elements of \({\textbf {D }}_r^{-\frac{1}{2}}({\textbf {P }}-{\textbf {E }}){\textbf {D }}_{c}^{-\frac{1}{2}}\) are standardized residuals under the statistical independence model. The sum of squares of these elements yields the total inertia, i.e., the Pearson \(\chi ^2\) statistic divided by sample size \(\sum _{i}\sum _{j}f_{ij}\). By taking the SVD of the matrix of standardized residuals, we get

$$\begin{aligned} {\textbf {D }}_r^{-\frac{1}{2}}({\textbf {P }}-{\textbf {E }}){\textbf {D }}_{c}^{-\frac{1}{2}} = {\textbf {U }} \varvec{\Sigma } {\textbf {V }}^T \end{aligned}$$
(6)

In CA, the rows of \(\varvec{\Phi }_k\varvec{\Sigma }_k\) and \(\varvec{\Gamma }_k\varvec{\Sigma }_k\) provide the coordinates of row and column points, respectively, where \(\varvec{\Phi }_k = {\textbf {D }}_r^{-\frac{1}{2}}{} {\textbf {U }}_k\) and \(\varvec{\Gamma }_k = {\textbf {D }}_c^{-\frac{1}{2}}{} {\textbf {V }}_k\). The weighted sum of the coordinates is 0: \(\sum _ir_i\phi _{ik} = 0 = \sum _jc_j\gamma _{jk}\). Euclidean distances between the rows of \(\varvec{\Phi }_k\varvec{\Sigma }_k\) (\(\varvec{\Gamma }_k\varvec{\Sigma }_k\)) approximate \(\chi ^2\)-distances between the rows (columns) of \({\textbf {F }}\), where the squared \(\chi ^2\)-distance between rows k and l is

$$\begin{aligned} \delta _{kl}^2 = \sum _j{\frac{\left( p_{kj}/r_k - p_{lj}/r_l\right) ^2}{c_j}} \end{aligned}$$
(7)

In (7), the rows are transformed into vectors of conditional proportions adding up to 1 for each row, such as the kth row: \(p_{kj}/r_k\), \(j = 1, 2, \cdots , n\), and the differences between the column elements for column j in the transformed rows are corrected for \(c_j\), which represents the size of column j.

The transition formulas are

$$\begin{aligned} {\textbf {D }}_r^{-1}{} {\textbf {P }}\varvec{\Gamma }_k = \varvec{\Phi }_k\varvec{\Sigma }_k \end{aligned}$$
(8)

and

$$\begin{aligned} {\textbf {D }}_c^{-1}{} {\textbf {P }}^T\varvec{\Phi }_k = \varvec{\Gamma }_k\varvec{\Sigma }_k \end{aligned}$$
(9)

Equation (8) shows that the row points are in the weighted averages of the column points when rows of \({\textbf {D }}_r^{-1}{} {\textbf {P }}\) are used as weights, and (9) shows that the column points are in the weighted averages of the row points simultaneously.

According to (8), a new document \({\textbf {d }}\), represented by a row vector, can be projected onto the k-dimensional subspace by placing it in the weighted average of the column points using \(({\textbf {d }}/\sum _{j=1}^nd_j)\varvec{\Gamma }_k\). This can be similarly done for a new term \({\textbf {t }}\).

For the CA of Table 1, the coordinates of the documents and terms for CA in the first two dimensions are \(\varvec{\Phi }_2\varvec{\Sigma }_2\) and \(\varvec{\Gamma }_2\varvec{\Sigma }_2\), respectively. Figure 1b shows a two-dimensional plot of the documents and terms. Cat terms (lion, cheetah, and tiger) are close together; car terms (porsche and ferrari) are close together; jaguar is in between cat and car terms; car documents (5 and 6) are close together, cat documents (1, 2, and 3) are close together; and document 4 is in between cat and car documents. All data properties are found in Fig. 1b. A comparison of Fig. 1b and a suggests that CA provides a clearer visualization of the important aspects of the data than LSA. This is because the coordinates of each dimension are orthogonal to the margins due to \(\sum _ir_i\phi _{ik} = 0 = \sum _jc_j\gamma _{jk}\), and CA focuses only on the relationship between the documents and the terms.

3 Methodology

In this section, we introduce the CA of a document-term matrix whose entries are weighted. We also discuss how the influence of the initial dimensions can be studied. Subsequently, we describe the study design, datasets, and evaluation methods used.

3.1 CA of a document-term matrix of weighted frequencies

Weighting the entries of the raw document-term matrix is an effective method for improving the performance of LSA, and this motivates us to study the weighting of the elements of the input matrix of CA. So, we try to improve the performance of CA by using the same weighting methods as in LSA.

The processing of the raw data matrix by \({\textbf {D }}_r^{-\frac{1}{2}}({\textbf {P }}-{\textbf {E }}){\textbf {D }}_{c}^{-\frac{1}{2}}\) (see (5)) is considered an integral part of CA. This processing step effectively eliminates the margins, which allows CA to focus on the relationships between documents and terms. The weighting of the entries of the raw document-term matrix in (1), such as by TF-IDF, can be used to assign higher values to terms with more indicative of the meaning of documents. Thus, the weighting of the entries of the raw document-term matrix may also be an effective method for improving the performance of CA.

To perform the CA of a document-term matrix of weighted frequencies, we first use (1) to obtain a document-term matrix \({\textbf {A }}\) of weighted frequencies, and then, we perform CA on this matrix \({\textbf {A }}\) instead of \({\textbf {F }}\).

3.2 Changing the contributions of the initial dimensions in SVD

Caron (2001) proposed adjusting the relative strengths of vector components in LSA using \(\varvec{U}_k\varvec{\Sigma }_k^\alpha \) or \(\varvec{V}_k\varvec{\Sigma }_k^\alpha \) as coordinates instead of \(\varvec{U}_k\varvec{\Sigma }_k\) or \(\varvec{V}_k\varvec{\Sigma }_k\), where \(\alpha \) is the singular value weighting exponent that adjusts the importance of the dimensions. The weighting exponent \(\alpha \) determines how components are weighted relative to the standard \(\alpha = 1\) case described in Section 2.1. In comparison to \(\alpha = 1\), \(\alpha < 1\) gives less emphasis to initial dimensions, and \(\alpha > 1\), more emphasis.

Bullinaria and Levy (2012) used both weighting exponent \(\alpha <1\) and the exclusion of initial dimensions, which led to performance improvements of a similar degree. They argued that the general pattern appears to be that the dimensions with the highest singular values tend not to contribute the most useful information about semantics and have a large “noise” component that is best removed or reduced. However, it is unclear what the initial dimensions actually correspond to. Given this context, we change the contributions of the initial dimensions extracted by both LSA and CA and compare their performances. We explore whether the performance of CA can be improved by adjusting the singular value weighting exponent using \(\varvec{\Phi }_k\varvec{\Sigma }_k^\alpha \) or \(\varvec{\Gamma }_k\varvec{\Sigma }_k^\alpha \) as coordinates instead of \(\varvec{\Phi }_k\varvec{\Sigma }_k\) or \(\varvec{\Gamma }_k\varvec{\Sigma }_k\). That is, we try to improve the performance of CA by using the method (adjusting the singular weighting exponent) used in LSA.

Table 2 The \(\sigma ^\alpha \), \(\sigma ^{2\alpha }\), and the proportion of explained \(\alpha \)-inertia \(\sigma ^{2\alpha }/\sum _\sigma \sigma ^{2\alpha }\) for each dimension of LSA-RAW

We use Table 1 to illustrate the impact of \(\alpha \) on singular values and coordinates. We use \(\alpha = 0.5\), \(\alpha = 1\), and \(\alpha = 1.5\). In the literature, we regularly encounter \(\alpha = 0.5\) because it relates to

$$\begin{aligned} \varvec{F} = \varvec{U}\varvec{\Sigma }\varvec{V}^T = \left( \varvec{U}\varvec{\Sigma }^{1/2}\right) \left( \varvec{\Sigma }^{1/2}\varvec{V}^T\right) \end{aligned}$$
(10)

which can then be used for making biplots (Gabriel, 1971) using coordinate pairs \(\varvec{U}_2\varvec{\Sigma }^{1/2}_2\) and \(\varvec{V}_2\varvec{\Sigma }^{1/2}_2\). In practice, one often sees the use of the coordinate pair \(\varvec{U}_2\varvec{\Sigma }_2\) and \(\varvec{V}_2\varvec{\Sigma }_2\); however, this is not a biplot representation as \(\varvec{\Sigma }_2\) is used twice. In a biplot, if the row points are \(\varvec{U}_2\varvec{\Sigma }_2^{a}\), then the column points are \(\varvec{V}_2\varvec{\Sigma }_2^{1-a}\), i.e., any entry of the matrix is approximated by the inner product of the corresponding row and column vectors. Hereafter, we do not make a biplot; instead, we make a symmetric plot where documents and terms have the same value of \(\alpha \) because symmetric coordinates are usually used in experiments (Dumais et al., 1988; Deerwester et al., 1990; Berry et al., 1995; Levy et al., 2015).

Table 2 lists the singular values to the power \(\alpha \): \(\sigma ^\alpha \), the squared singular values to the power \(\alpha \): \(\sigma ^{2\alpha }\), and proportions \(\sigma ^{2\alpha }/\sum _\sigma \sigma ^{2\alpha }\), where we refer to the total sum of squared singular values to the power of \(\alpha \), \(\sum _\sigma \sigma ^{2\alpha }\), as \(\alpha \)–inertia. These proportions show how the sum of the Euclidean distances of all components to the origin is distributed over the components. The greater \(\alpha \) is, the more emphasis is given to the initial components and less emphasis to the latter ones. The first dimension accounts for 0.623, 0.855, and 0.943 of \(\alpha \)-inertia, while the fifth dimension accounts for 0.020, 0.001, and 0.000, with \(\alpha \) being 0.5, 1, and 1.5, respectively. The standard LSA solution has \(\alpha = 1\).

Figure 2 shows the two-dimensional plots of documents and terms for LSA-RAW with \(\alpha = 0.5, 1.5\). The standard coordinates with \(\alpha = 1\) was shown in Fig. 1a. As \(\alpha \) increases, the Euclidean distances between row points (column points) on the first dimension increase relative to the second dimension.

Fig. 2
figure 2

A two-dimensional plot of documents and terms for LSA-RAW with (a) \(\alpha = 0.5\) and (b) \(\alpha = 1.5\)

3.3 Design

We compare the performances of LSA and CA for information retrieval, where two kinds of weightings are studied in LSA: the elements of the raw document-term matrix are weighted and the weighting exponent \(\alpha \) is varied. We also explore the impact of these weightings in CA. We vary the number of dimension k from 1, 2, \(\cdots \), 20, 22, \(\cdots \), 50, 60, \(\cdots \) to 100 and the value of \(\alpha \) from -6, -5.5, \(\cdots \), -2, -1.8, \(\cdots \), 4, 4.5, \(\cdots \) to 8; we explore all \(40\times 47 = 1,880\) combinations of parameter values.

In the study of weighting the elements of the raw document-term matrix, we perform the LSA and CA of

  • raw matrix \({\textbf {F }}\), denoted by RAW,

  • L1 row-normalized matrix \({\textbf {F }}^{L1}\) with \(L(i,j)=f_{ij}\), \(G(j)=1\), and \(N(i)=1/\sum _{j=1}^n{f_{ij}}\), NROWL1,

  • L2 row-normalized matrix \({\textbf {F }}^{L2}\) with \(L(i,j)=f_{ij}\), \(G(j)=1\), and \(N(i)=1/\sqrt{\sum _{j=1}^n{f_{ij}^2}}\), NROWL2, and

  • TF-IDF matrix \({\textbf {F }}^{\text {TF-IDF}}\) described in Section 2.1, TFIDF.

We refer to the combination of the CA and TF-IDF matrix as CA-TFIDF. Similarly, we obtain LSA-RAW, LSA-NROWL1, LSA-NROWL2, LSA-TFIDF, CA-RAW, CA-NROWL1, and CA-NROWL2. For performance comparison, RAW is used for term matchings without dimensionality reduction.

3.4 Datasets

LSA and CA are compared using three English datasets and one Dutch dataset. The three English datasets are the BBCSport (Greene & Cunningham, 2017), BBCNews (Greene & Cunningham, 2017), and 20 Newsgroups datasets (20-news-18846 bydata version) (Rennie, 2005). The Dutch dataset is the Wilhelmus dataset (Kestemont et al., 2017). The three English datasets have recently been used in information retrieval studies (Bounabi et al., 2019; Bianco et al., 2023). The Wilhelmus dataset is produced for studying authorship attribution of the song Wilhelmus, which is the national anthem of the Netherlands. The author of the song is unknown.

Some statistics of the four datasets used are presented in Table 3. The BBCNews dataset includes 2,225 documents that fall into one of five categories. The BBCSport dataset includes 731 documents that fall into one of five categories. The 20 Newsgroups dataset includes 18,846 documents that fall into one of 20 categories. This dataset is sorted into a training (60%) and a test (40%) set. We use a subset of this dataset to evaluate information retrieval. We randomly choose 600 documents from the training set of four categories (comp.graphics, rec.sport.hockey, sci.crypt, and talk.politics.guns) and 400 documents from the test set of these four categories. The Wilhelmus dataset includes 186 documents divided into six categories.

Table 3 Characteristics of datasets

To pre-process the three English datasets, we change all characters to lower case, remove punctuation marks, numbers, and stop words, and apply lemmatization. Subsequently, terms with frequencies lower than 10 are ignored. In addition, we remove unwanted parts of the 20 Newsgroups dataset, such as the header (including fields like “From:” and “Reply-To:” followed by email address), because these are almost irrelevant for information retrieval. The Dutch Wilhelmus dataset is already pre-processed into tag-lemma pairs. Following Kestemont et al. (2017) and Qi et al. (2023), in Wilhelmus dataset, we use the 300 most frequent tag-lemma pairs.

Since the Wilhelmus and BBCSport datasets have a relatively low number of documents, we use leave-one-out cross-validation (LOOCV) for the Wilhelmus dataset and five-fold cross-validation for the BBCSport dataset to evaluate LSA and CA (Gareth et al., 2021). The BBCNews dataset is randomly divided into training (80%) and validation (20%) sets.

In the information retrieval part of the study, each document in the validation set is used as a query, where the category of the document is known. The documents in the training set that fall in the same category as the query are the relevant documents for this query.

3.5 Evaluation

We compare the MAP of each of the four versions of LSA and CA to explore the performance of these methods in information retrieval under changes in the contributions of initial dimensions (Kolda & O’leary, 1998). The MAP is calculated as follows:

  • The similarity is assessed between a query vector and each document vector of a document collection. We use three similarity metrics: Euclidean distance, dot similarity, and cosine similarity. As Euclidean distance is a key motivation for CA, we report results on Euclidean distance, and only report partial results for dot and cosine similarity in the main paper and the other results in the supplementary materials.

  • For Euclidean distance, the documents are ranked in an increasing order based on their similarity with the query vector (for dot and cosine similarity, the ranking is in the decreasing order); therefore, the first document has the highest similarity.

  • Precision-recall points are derived from the ordered list of documents. For a given query, Table 4 defines four types of documents in the ordered list based on whether a document is relevant and retrieved: \( {\textbf {C}} = \text {the set of relevant documents from the ordered list}\), i.e., documents that fall in the same category as the query \( {\textbf {D}} = \text {the set of retrieved documents from the ordered list.}\), i.e., when 10 documents are returned, the set of retrieved documents consists of the first 10 documents in the ordered list.

    Table 4 Retrieved and relevant documents

    Let \(|.|\) denote the number of documents in a set. Then, precision and recall are defined as

    $$\begin{aligned} \text {precision} = \frac{|{\textbf {C}} \cap {\textbf {D}}|}{|{\textbf {D}}|} \end{aligned}$$
    (11)

    and

    $$\begin{aligned} \text {recall} = \frac{|{\textbf {C}} \cap {\textbf {D}}|}{|{\textbf {C}}|}. \end{aligned}$$
    (12)

    Thus, precision is defined as the ratio of the number of relevant documents retrieved over the total number of retrieved documents, and recall is defined as the ratio of the number of relevant documents retrieved over the total number of relevant documents. For a given query, the set C is fixed. The set D is not fixed; if we return the first i documents, then D consists of the first i documents in the ordered list. Thus, for a given i, we can obtain a precision (see (11)) and recall (see (12)) pair. We run values of i from 1 to l (the number of documents in the ordered list), and obtain l precision-recall pairs.

  • Then, 11 pseudo-precisions are calculated under 11 recalls (0, 0.1, \(\cdots \), 1.0), where a pseudo-precision at recall x is the maximum precision from recall x to recall 1. For example, pseudo-precision at recall 0.2 is the maximum precision from recall 0.2 to recall 1.

  • The average precision for the query is obtained by averaging the 11 pseudo-precisions.

  • The MAP is the mean across all queries.

Greater MAP values indicate a better performance.

4 Results for Euclidean distance

4.1 Comparing LSA and CA for information retrieval

4.1.1 MAP as a function of the number of dimensions for the four versions of LSA with the standard weighting exponent \(\alpha = 1\) and for CA

We first investigate the performance of LSA and CA in terms of MAP, in their standard use, i.e., without varying the weighting exponent \(\alpha \), i.e., \(\alpha = 1\). Term matching without the preliminary use of LSA and CA, i.e., directly on the document-term matrix, is denoted by RAW. We expect that, in line with Qi et al. (2023), the performance of LSA and CA will be better than that of RAW, and the performance of CA will be better than that of the four versions of LSA.

Figure 3 shows MAP as a function of the number of dimensions k for different weighting schemes of LSA, and for CA. We display only the first 20 dimensions, as all lines usually decrease after dimension 20. Figures with dimensionality up to 100 can be found in the supplementary materials. For the four versions of LSA, and for CA, Table 5 presents the dimension number for which the optimal MAP is reached, as well as the MAP values, in each of the four datasets. We conclude the following from Fig. 3 and Table 5:

  • Both LSA and CA result in better MAP than RAW, which results in a straight line when the full dimensional matrix is used.

  • For both LSA and CA, performance is a function of the number of dimensions k. Overall, MAP rises as a function of k to reach a peak, and then, it goes down. For CA, the peak is reached at \(k = 4\). In CA, the information used to calculate MAP increases in the first four dimensions in comparison to the noise. In the components of \(k \ge 5\), the noise dominates the useful information, which results in the MAP going down from this point.

  • CA results in a considerably better MAP than the four versions of LSA: LSA-RAW, LSA-NROWL1, LSA-NROWL2, and LSA-TFIDF, which is in line with Qi et al. (2023), who showed that the performance of CA is better than that of LSA for document-term matrices. This is because of the differential treatment of margins in LSA and CA. The margins provide irrelevant information for making queries. In CA, the margins are removed, and therefore, the relative amount of information in comparison to the noise, which we informally refer to as the information - noise ratio, is considerably larger in CA than in LSA. This explains the better MAP in CA.

  • The peaks for the four versions of LSA are usually found at higher dimensionality k than the peaks for CA. This is because margins are noise for queries when we fix \(\alpha = 1\); in LSA, this noise plays an important role in the first few dimensions. Hence, this earlier peak in CA is also explained by its better information - noise ratio.

  • The four LSA methods are not equally effective. In all four datasets, the performance of LSA can be significantly improved using weighting schemes. The improvements over LSA-RAW are data dependent. On average, across the four datasets, LSA-NROWL2 is the best, but for the Wilhelmus dataset, LSA-NROWL1 and LSA-NROWL2 result in a somewhat worse MAP than that with LSA-RAW.

Fig. 3
figure 3

MAP as a function of the number of dimensions k under standard coordinates

Table 5 MAP with the optimal number of dimensions k

4.1.2 MAP as a function of the weighting exponent \(\alpha \) for LSA compared with MAP for CA under varying numbers of dimensions

In Section 4.1.1, we found that CA outperforms the four versions of LSA in terms of MAP, where LSA had the usual weighting exponent \(\alpha = 1\). In this section, we study whether the performance of LSA-RAW improves when we vary \(\alpha \).

Figure 4 shows MAP as a function of \(\alpha \) for LSA-RAW with the number of dimensions \(k = 4, 6, 9, 12, \text {and }24\). For comparison, we also report the MAP values for CA found in Section 4.1.1 under these dimensions. We choose these values of k because these dimensions are optimal for LSA-RAW and CA in Table 5. Table 6 shows the optimal \(\alpha \) and corresponding MAP, which is a condensed version of Fig. 4. We conclude the following from Fig. 4 and Table 6:

  • Although the performance of LSA-RAW improves by varying \(\alpha \), CA still outperforms LSA-RAW.

  • For LSA-RAW, the overall MAP first increases and then decreases as a function of \(\alpha \). This means that varying \(\alpha \) can potentially improve the performance of LSA-RAW.

  • The increase in MAP is minor. Consider, for example, the BBCNews dataset. In Section 4.1.1, we found that the MAP was optimal with a value of 0.652 for \(\alpha = 1\), when \(k = 6\). Table 6 shows that for \(\alpha = 0.2\), the MAP increases to 0.658. Apparently, for 6 dimensions, when \(\alpha = 0.2\), the information - noise ratio is optimal in terms of MAP. For \(\alpha = 0.2\), the distances on later dimensions (of the 6 dimensions) are increased and those on initial dimensions are reduced. This means that, with \(\alpha = 0.2\), the impact of the initial dimensions affected most by the margins is reduced. This is consistent with the results of Bullinaria and Levy (2012), which indicates that reducing the initial dimensions improves performance.

  • Moreover, the optimal \(\alpha \) for LSA-RAW is data dependent and generally increases with k. This replicates results of Caron (2001). As the number of dimensions varies, the change in the optimal \(\alpha \) is the result of the information - noise ratio for the specific number of dimensions studied. For example, for the BBCNews dataset, the optimal number of dimensions is 6; for larger numbers of dimensions, the optimal \(\alpha \) increases. An increasing \(\alpha \) indicates that distances at earlier dimensions are more important for information retrieval, and therefore, the role of the later dimensions is played down.

Fig. 4
figure 4

MAP as a function of \(\alpha \) for LSA-RAW and MAP for CA under varying k

Table 6 MAP with the optimal weighting exponent \(\alpha \) for LSA-RAW and MAP for CA under \(k = 4, 6, 9, 12, \text {and }24\)

4.2 Adjusting CA using weighting

4.2.1 Weighting the elements of the raw document-term matrix for CA

Weighting the elements of the raw document-term matrix is an effective way to improve the performance of LSA for information retrieval. Here, we explore whether this holds for CA. Similar to Fig. 3, Fig. 5 shows MAP as a function of k for different weighting schemes of CA. CA in Fig. 3 is referred to as CA-RAW in Fig. 5; for CA/CA-RAW, the results in these two figures are identical. For the four versions of CA, Table 7 shows the dimensionality for which the optimal MAP is reached, as well as the MAP value. We conclude the following from Fig. 5 and Table 7:

  • Overall, the weighting of the elements of the raw matrix sometimes improves the performance of CA, but these improvements over CA-RAW are small and data dependent.

  • Comparing Table 5 with Table 7, the performance of CA-NROWL1 is better than that of LSA-NROWL1, the performance of CA-NROWL2 is better than that of LSA-NROWL2, and the performance of CA-TFIDF is better than that of LSA-TFIDF.

Relative to LSA, it is harder to improve the performance of CA in information retrieval by weighting the elements of the raw matrix because (1) the MAP of CA-RAW is already relatively high, and (2) CA-RAW has weighted the elements of the raw document-term matrix as it is an integral part of this technique (5).

Fig. 5
figure 5

MAP as a function of the number of dimensions k for the four versions of CA under standard coordinates

Table 7 MAP with the optimal number of dimensions k for the four versions of CA
Fig. 6
figure 6

MAP as a function of \(\alpha \) for CA-RAW under various values of k

4.2.2 MAP as a function of the weighting exponent \(\alpha \) for CA

In this section, we introduce CA with weighting exponent \(\alpha \). Similar to Fig. 4, Fig. 6 shows MAP as a function of \(\alpha \) in CA-RAW for the number of dimensions \(k = 4, 6, 9, 12, \text {and }24\). Table 8 shows the optimal \(\alpha \) and the corresponding MAP, which is a condensed version of Fig. 6. We conclude the following from Fig. 6 and Table 8:

  • For CA, the overall MAP first increases and then decreases as a function of \(\alpha \). This means that varying \(\alpha \) can potentially improve the performance of CA.

  • The increase in MAP by adjusting \(\alpha \) is data and dimension dependent.

  • If we compare the maxima in Table 6 with those in Table 8, there is hardly a noticeable increase.

Now, we check the optimal \(\alpha \) like Bullinaria and Levy (2012) did. Comparing Table 8 with part LSA-RAW of Table 6, the optimal \(\alpha \) for CA-RAW is almost always larger than LSA-RAW and is almost always larger than 1. That is, CA-RAW needs a larger \(\alpha \) than LSA-RAW to obtain its maximum MAP. Thus, compared to LSA, CA improves by placing more emphasis on its initial dimensions. The important difference between LSA and CA is that LSA involves margins, and CA does not. Therefore, we infer that margins in LSA considerably contribute to the initial dimensions; however, they are irrelevant (“noise”) for information retrieval. On the other hand, CA effectively eliminates this irrelevant information.

We study MAP as a function of \(\alpha \) under the optimal number of dimensions. The details including tables and figures are in the supplementary materials. Again, CA performs better than LSA. Adjusting \(\alpha \) can potentially improve the performance of LSA and CA. Although the optimal \(\alpha \) under the optimal number of dimensions is data dependent, the optimal \(\alpha \) of CA is usually considerably larger than that of LSA.

Table 8 MAP with the optimal \(\alpha \) for CA-RAW under \(k = 4, 6, 9, 12, \text {and }24\)

5 Results for dot similarity and cosine similarity

In Section 4, we presented the results where Euclidean distance was used as a measure of similarity. Here, for comparison, we provide results for dot similarity and cosine similarity. Tables and figures for dot similarity and cosine similarity are presented in the supplementary materials.

The results for both dot similarity and cosine similarity lead to conclusions that match those for Euclidean distance. However, cosine similarity leads to a better performance in terms of MAP than Euclidean distance and dot similarity. We displayed the results for Euclidean distance in Section 4 because (1) it is more easily interpretable in the context of adjusting weighting exponent \(\alpha \): as \(\alpha \) increases, Euclidean distances between row points (column points) on initial dimensions increase relative to the later dimensions; and (2) in the literature, the Euclidean distance is the preferred way to interpret CA (in fact, we have never seen an interpretation of CA in terms of cosine or dot similarity).

6 Conclusions and discussions

Both LSA and CA make use of SVD. The main difference between LSA and CA is the matrix that is decomposed by SVD. In LSA, the decomposed matrix is the weighted matrix A. In CA the decomposed matrix is the matrix S of standardized residuals, where in the part (P - E) the marginal effects are eliminated (Qi et al., 2023), and whose rank is one less the rank of A. That is why the CA solution only displays the dependence between documents and terms. In LSA, on the other hand, the decomposed matrix also includes marginal effects, which are usually not relevant for information retrieval.

CA is related to the statistical independence model (Greenacre, 1984). The elements of S display the departure from marginal products, i.e., the departure form the statistical independence model. The sum of squared elements of S equals the Pearson chi-square statistic divided by the sum of elements of F. CA decomposes the departure from statistical independence into a number of dimensions using SVD. LSA, on the other hand, has no connection with the statistical independence model.

In this paper, we compared four versions of LSA: LSA-RAW, LSA-NROWL1, LSA-NROWL2, and LSA-TFIDF with CA and found that CA always performs better than LSA in terms of MAP. Then, we compared LSA-RAW as a function of weighting exponent \(\alpha \) with CA under a range of the numbers of dimensions. Even though LSA is improved by choosing an appropriate value for \(\alpha \), CA always performed better than LSA.

Next, we applied different weighting elements of the raw document-term matrix to CA. We found that weighting elements of the raw matrix sometimes improves the performance of CA, but improvements over CA-RAW are small and data dependent. The performance of CA-NROWL1 is better than that of LSA-NROWL1, the performance of CA-NROWL2 is better than that of LSA-NROWL2, and the performance of CA-TFIDF is better than that of LSA-TFIDF. Then, we adjusted the weighting exponents \(\alpha \) in CA. For CA, as a function of \(\alpha \), MAP first increases and then decreases. Adjusting the weighting exponent \(\alpha \) can potentially improve the performance of CA. However, the increased performance obtained by adjusting \(\alpha \) is data and dimension dependent.

Using the standard coordinates of \(\alpha = 1\), for LSA, the Euclidean distances between the rows of coordinates approximate the Euclidean distances between the rows of the decomposed matrix. For CA, the Euclidean distances between the rows of coordinates approximate the \(\chi ^2-\)distances between the rows of the decomposed matrix. \(\alpha < 1\) gives less emphasis to the initial dimensions relative to the standard coordinates. Conversely, \(\alpha > 1\) gives more emphasis to the initial dimensions relative to the standard coordinates. The optimal \(\alpha \) for CA is almost always larger than that for LSA and is almost always larger than 1.

Bullinaria and Levy (2012) argued that the initial dimensions in LSA tend not to contribute the most useful information about semantics and tend to be contaminated by “noise”. The above mentioned results indicate that CA places more emphasis on the initial dimensions than LSA. The major difference between LSA and CA is that LSA involves margins but CA does not (Qi et al., 2023). Thus, we infer that margins considerably contribute to the initial dimensions in LSA. These margins are irrelevant for information retrieval. The CA effectively eliminates this irrelevant information.

In this paper, we focused on the performances of CA and LSA using Euclidean distances. We also performed identical experiments for dot similarity and cosine similarity. Both have nearly identical results with the Euclidean distance. Cosine similarity performs better than the Euclidean distance and dot similarity. We focus on Euclidean distance in the paper because (1) it is more easily interpretable in the context of adjusting \(\alpha \): as \(\alpha \) increases, the Euclidean distances between row points (column points) on the initial dimensions increase relative to the later dimensions; (2) for CA, dot similarity and cosine similarity have never been used before, and therefore, by focusing on Euclidean distances, the results fit better into the existing literature.

Based on theoretical considerations and experimental results, we have the following three suggestions for practical guidance:

  1. 1.

    Use CA instead of LSA under the four kinds of feature extraction: RAW, NROWL1, NROWL2, and TF-IDF; use CA for visualizing data.

  2. 2.

    If information retrieval is the key issue, use cosine similarity instead of Euclidean distance and dot similarity for calculating MAP.

  3. 3.

    If optimal performance in terms of MAP is not of key importance, there is no need to weight the elements of raw document-term matrix for CA and optimize the performance over \(\alpha \) for CA to saving time. Otherwise, these two weightings may be considered potential approaches for improving the performance of CA.

Our finding that CA performs better than LSA for information retrieval is very important for creating next generation intelligent information systems. Among many other tasks, LSA has been widely used for information retrieval. We expect that the performance of these tasks can be improved by replacing LSA with CA.

Concluding, CA and LSA are both tools for information retrieval but the performance of CA is better. In our paper we tried to further improve CA by weighting the input matrix and by weighting dimensions. This did not lead to large or consistent improvements of the performance of CA.

Further studies on the combination of LSA and CA will also be interesting. For example, creating an ensemble voting system using the coordinates from LSA and CA in the process of returning documents of a query. This paper, however, focuses on the comparison of LSA and CA for information retrieval and other explorations are left for future studies.