Improving information retrieval through correspondence analysis instead of latent semantic analysis

Both latent semantic analysis (LSA) and correspondence analysis (CA) are dimensionality reduction techniques that use singular value decomposition (SVD) for information retrieval. Theoretically, the results of LSA display both the association between documents and terms, and marginal effects; in comparison, CA only focuses on the associations between documents and terms. Marginal effects are usually not relevant for information retrieval, and therefore, from a theoretical perspective CA is more suitable for information retrieval. In this paper, we empirically compare LSA and CA. The elements of the raw document-term matrix are weighted, and the weighting exponent of singular values is adjusted to improve the performance of LSA. We explore whether these two weightings also improve the performance of CA. In addition, we compare the optimal singular value weighting exponents for LSA and CA to identify what the initial dimensions in LSA correspond to. The results for four empirical datasets show that CA always performs better than LSA. Weighting the elements of the raw data matrix can improve CA; however, it is data dependent and the improvement is small. Adjusting the singular value weighting exponent usually improves the performance of CA; however, the extent of the improved performance depends on the dataset and number of dimensions. In general, CA needs a larger singular value weighting exponent than LSA to obtain the optimal performance. This indicates that CA emphasizes initial dimensions more than LSA, and thus, margins play an important role in the initial dimensions in LSA.


Introduction
In information retrieval, a given user query is matched with relevant documents (Al-Qahtani, Amira, & Ramzan, 2015; Guo et al., 2022;Kolda & O'leary, 1998;Zhang, Yoshida, & Tang, 2011) and a collection of documents is usually represented as a document-term matrix.The similarity between the given user query and each document in the documentterm matrix is calculated, and documents with highest similarity are returned.Unfortunately, the returned documents may not be relevant, because similarity is calculated based on word matching, and different words may have the same meaning and the same word may have different meanings.
Latent semantic analysis (LSA) is a variant of the vector space model for information retrieval (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990;Dumais, 1991;Dumais, Furnas, Landauer, Deerwester, & Harshman, 1988).LSA replaces the original documentterm matrix with latent semantic vectors using singular value decomposition (SVD).It experimentally compared LSA and CA for text clustering and text categorization, respectively, and they found that CA performed better than LSA.Although LSA was originally proposed for information retrieval, to the best of our knowledge, an empirical comparison between LSA and CA continues to remain lacking in this field.In this paper, we use four datasets, three English datasets and a Dutch dataset, to compare the performances of LSA and CA in information retrieval.
Weighting the elements of the raw document-term matrix is a common and effective method to improve the performance of LSA (Bacciu et al., 2019;Dumais, 1991;Horasan, Erbay, Varc ¸ın, & Deniz, 2019).LSA usually involves the SVD of a raw or pre-processed document-term matrix.Further, LSA owes its popularity to its applicability to different matrices.In CA, it is unusual to weight the elements of the raw document-term matrix, and processing the raw document-term matrix is an integral part of CA (Beh & Lombardo, 2021;Greenacre, 1984Greenacre, , 2017)).CA is based on the SVD of the matrix of standardized residuals.Here, we study the CA of document-term matrices whose entries are weighted to see if this has an impact on the performance of CA.
In addition to weighting the elements of the document-term matrix to improve information retrieval, Caron (2001) proposed changing the weighting exponent of the singular values in LSA.His results showed that adjusting the weighting exponent of singular values improves the performance of information retrieval.Since Caron (2001), singular value weighting exponents have been widely studied and applied in word embeddings generated from word-context matrices (Bullinaria & Levy, 2012;Drozd, Gladkova, & Matsuoka, 2016;Österlund, Ödling, & Sahlgren, 2015;Yin & Shen, 2018).Further, other variants that change the singular value weighting exponent are studied in word embeddings created by Word2Vec and GloVe (Liu, Ungar, & Sedoc, 2019;Mu & Viswanath, 2018).To the best of our knowledge, there is no study of CA where the weighting exponent of the singular values is varied.Based on the success of adjusting the weighting exponent of singular values in LSA, we will explore whether this is also successful in CA.
The larger the weighting exponent of the singular values, the higher is the emphasis given to the initial dimensions.The standard weighting exponent corresponds to 1.
According to the experimental results of Caron (2001), giving more emphasis to initial dimensions can often improve the performance of information retrieval on standard test datasets, whereas giving more emphasis to initial dimensions can decrease the performance on question/answer matching.Papers about word embeddings tend to reduce the contribution of initial dimensions to improve performance (Bullinaria & Levy, 2012;Drozd et al., 2016;Liu et al., 2019;Mu & Viswanath, 2018;Österlund et al., 2015;Yin & Shen, 2018), although the optimal value of the singular value weighting exponent is task dependent ( Österlund et al., 2015).Bullinaria and Levy (2012) reported that assigning less weight to initial dimensions leads to improved performance for TOEFL, distance comparison, semantic categorization, and clustering purity tasks on a word-context matrix created from the ukWaC corpus (Baroni, Bernardini, Ferraresi, & Zanchetta, 2009).They argued that the general pattern appears to be that the initial dimensions tend not to contribute the most useful information about semantics and have a large "noise" component that is best removed or reduced.However, it remains unclear what the initial dimensions correspond to.If the initial dimensions corresponds to marginal differences between documents and between terms, then CA may be the optimal procedure because it ignores this irrelevant information.
In summary, this work makes three contributions.First, to compare LSA and CA in information retrieval.The results show that CA yields better performance than LSA.Thus, CA may be used as common baseline and be widely applied in information retrieval, text mining, and natural language processing.Second, to explore whether weightings, including the weighting of the elements of the raw document-term matrix and the adjusting of the singular value weighting exponent, can improve the performance of CA.Third, to study what the initial dimensions of LSA correspond to and whether CA is effective in ignoring the useless information in the raw or pre-processed document-term matrix that contributes a large part of the initial dimensions extracted by LSA.We extensively compare the performances of LSA and CA applied to four datasets using Euclidean distance, dot similarity, and cosine similarity.
The remainder of this paper is arranged as follows.In Section 2, LSA and CA are described in brief.Section 3 presents the methodology used in this paper.The results for Euclidean distance are presented in Section 4, and the results for dot similarity and cosine similarity are presented in Section 5. Finally, Section 6 concludes and discusses the results.

LSA and CA
In this section, we briefly describe LSA and CA.We refer the readers to Qi et al. (2023 in press Natural Language Engineering) for a more detailed presentation of the methods.

LSA
Consider a raw document-term matrix F = [f ij ] with m rows (i = 1, ..., m) and n columns (j = 1, ..., n), where the rows represent documents and the columns represent terms.
Weighting might be used to prevent the differential lengths of documents from considerably affecting the representation, or to impose certain preconceptions about which terms are more important (Deerwester et al., 1990).The weighted element a ij for term j in document i is where the local weighting term L(i, j) is the weight of term j in document i, G(j) is the global weight of term j in the entire set of documents, and N (i) is the weighting component for document i.The choices L(i, j) = f ij , G(j) = 1, and N (i) = 1 result in the raw document-term matrix F. The popular TF-IDF can be written in the form , where ndocs is the number of documents in the set and df j is the number of documents where term j appears (Dumais, 1991).
where U T U = I, V T V = I, and Σ is a diagonal matrix with singular values on the diagonal in the descending order.We denote matrices that contain the first k columns of U , first k columns of V , and k largest singular values of Σ by U k , V k , and Σ k , respectively.
Then, U k Σ k (V k ) T provides the optimal rank-k approximation of A in a least-squares sense, which shows that SVD can be used for data reduction.In LSA, the rows of U k Σ k and V k Σ k provide the coordinates of row and column points, respectively.Euclidean distances between the rows of U k Σ k (V k Σ k ) approximate those between the rows (columns) of A (Deerwester et al., 1990;Parhizkar, 2013).
Representing out-of-sample documents or queries in the k-dimensional subspace of LSA is important for many applications including information retrieval.Suppose that the new weighted document is a row vector d.Since V T V = I and U T U = I, we have and Therefore, using Equation (3), the coordinates of the out-of-sample document d in the kdimensional subspace of LSA is dV k .Similarly, using Equation ( 4), the coordinates of the out-of-sample term t (represented as row vector) in the k-dimensional subspace of LSA is tU k .
As in Qi et al. (2023 in press Natural Language Engineering), we first use a small dataset to illustrate LSA.This small dataset is introduced in Aggarwal (2018) (see Table 1), and it contains 6 documents.For each document, we are interested in the frequency of occurrence of six terms.The first three documents primarily refer to cats, the last two primarily to cars, and the fourth to both.The fourth term, jaguar, is polysemous because it can refer to either a cat or a car.
In the LSA of the raw document-term matrix (LSA-RAW), the rows and columns of F are not weighted, and therefore, we can replace A in Equation ( 2) by F .The coordinates of the documents and of the terms for LSA-RAW in the first two dimensions are U 2 Σ 2 and V 2 Σ 2 , respectively.Figure 1a shows the two-dimensional plot of the documents and terms.Cat terms (lion and tiger) are close together; car terms (jaguar, porsche, and ferrari) are close together; car documents (5 and 6) are close together.However, the cat documents (1, 2, and 3) are not close together, neither is document 4 in between cat documents and car documents, and neither is jaguar in between cat terms and car terms.This can be attributed to the fact that LSA displays both the relationships between documents and terms and the sizes of the documents and terms: for the latter, jaguar, for example, is used most often in the documents and is furthest away from the origin.

CA
In CA, an SVD is applied to the matrix of standarized residuals given by Greenacre (2017) where P = [p ij ] is the matrix of joint observed proportions with In CA, the rows of Φ k Σ k and Γ k Σ k provide the coordinates of row and column points, respectively, where The weighted sum of the coordinates is 0: i r i φ ik = 0 = j c j γ jk .Euclidean distances between the rows of Φ k Σ k (Γ k Σ k ) approximate χ 2 -distances between the rows (columns) of F, where the squared χ 2 -distance between rows k and l is In Equation ( 7), the rows are transformed into vectors of conditional proportions adding up to 1 for each row, such as the kth row: p kj /r k , j = 1, 2, • • • , n, and the differences between the column elements for column j in the transformed rows are corrected for c j , which represents the size of column j.
The transition formulas are and Equation ( 8) shows that the row points are in the weighted averages of the column points when rows of D −1 r P are used as weights when rows of D −1 r P are used as weights, and Equation ( 9) shows that the column points are in the weighted averages of the row points simultaneously.
According to Equation (8), a new document d, represented by a row vector, can be projected onto the k-dimensional subspace by placing it in the weighted average of the column points using (d/ n j=1 d j )Γ k .This can be similarly done for a new term t.For the CA of Table 1, the coordinates of the documents and terms for CA in the first two dimensions are Φ 2 Σ 2 and Γ 2 Σ 2 , respectively.Figure 1b shows a two-dimensional plot of the documents and terms.Cat terms (lion and tiger) are close together; car terms (jaguar, porsche, and ferrari) are close together; jaguar is in between cat and car terms; car documents (5 and 6) are close together, cat documents (1, 2, and 3) are close together; and document 4 is in between cat and car documents.All data properties are found in Figure 1b.A comparison of Figures 1b and 1a suggests that CA provides a clearer visualization of the important aspects of the data than LSA.This is because the coordinates of each dimension are orthogonal to the margins due to i r i φ ik = 0 = j c j γ jk , and CA focuses only on the relationship between the documents and the terms.

Methodology
In this section, we introduce the CA of a document-term matrix whose entries are weighted.
We also discuss how the influence of the initial dimensions can be studied.Subsequently, we describe the study design, datasets, and evaluation methods used.

CA of a document-term matrix of weighted frequencies
Weighting the entries of the raw document-term matrix is an effective method for improving the performance of LSA, and this motivates us to study the weighting of the elements of the input matrix of CA.
The processing of the raw data matrix by D (see Equation ( 5)) is considered an integral part of CA.This processing step effectively eliminates the margins, which allows CA to focus on the relationships between documents and terms.The weighting of the entries of the raw document-term matrix in Equation (1), such as by TF-IDF, can be used to assign higher values to terms with more indicative of the meaning of documents.
Thus, the weighting of the entries of the raw document-term matrix may also be an effective method for improving the performance of CA.
To perform the CA of a document-term matrix of weighted frequencies, we first use Equation ( 1) to obtain a document-term matrix A of weighted frequencies, and then, we perform CA on this matrix A instead of F. Caron (2001) proposed adjusting the relative strengths of vector components in LSA us-

Changing the contributions of the initial dimensions in SVD
where α is the singular value weighting exponent that adjusts the importance of the dimensions.The weighting exponent α determines how components are weighted relative to the standard α = 1 case described in Section 2.1.In comparison to α = 1, α < 1 gives less emphasis to initial dimensions, and α > 1, more emphasis.Bullinaria and Levy (2012) used both weighting exponent α < 1 and the exclusion of initial dimensions, which led to performance improvements of a similar degree.They argued that the general pattern appears to be that the dimensions with the highest singular values tend not to contribute the most useful information about semantics and have a large "noise" component that is best removed or reduced.However, it is unclear what the initial dimensions actually correspond to.Given this context, we change the contributions of the initial dimensions extracted by both LSA and CA and compare their performances.
We use Table 1 to illustrate the impact of α on singular values and coordinates.We use α = −0.5, α = 0, α = 0.5, α = 1, and α = 1.5.In the literature, we regularly encounter α = 0.5 because it relates to which can then be used for making biplots (Gabriel, 1971) 2 .Biplots represent the rows and columns of a matrix in the same graphic representation, and the other often-used coordinate pairs are U 2 Σ 2 and V 2 , and pairs U 2 and V 2 Σ 2 .In the pair U 2 Σ 2 and V 2 , the Euclidean distances between the rows of U 2 Σ 2 approximate the Euclidean distances between the rows of F , and in the pair U 2 and V 2 Σ 2 , the Euclidean distances between the rows of V 2 Σ 2 approximate the Euclidean distances be- 2 provides a compromise between these two solutions while still representing matrix F in a graphic representation.
In practice, one often sees the use of the coordinate pair U 2 Σ 2 and V 2 Σ 2 ; however, this is not a biplot representation as Σ 2 is used twice.In a biplot, if the row points are U 2 Σ a 2 , then the column points are V 2 Σ 1−a 2 , i.e., any entry of the matrix is approximated by the inner product of the corresponding row and column vectors.Hereafter, we do not make a biplot; instead, we make a symmetric plot where documents and terms have the same value of α because symmetric coordinates are usually used in experiments (Berry, Dumais, & O'Brien, 1995;Deerwester et al., 1990;Dumais et al., 1988;Levy et al., 2015).
Table 2 lists the singular values to the power α: σ α , the squared singular values to the power α: σ 2α , and proportions σ 2α / σ σ 2α , where we refer to the total sum of squared singular values to the power of α, σ σ 2α , as α-inertia.These proportions show how the sum of the Euclidean distances of all components to the origin is distributed over the components.The greater α is, the more emphasis is given to the initial components.When α is negative, the initial components have less emphasis than the later components.For example, for σ −0.5 , the first dimension accounts for 0.017 of α-inertia, whereas the fifth dimension accounts for 0.536.When α is 0, all components of the coordinates have the same emphasis, and because the rank of F is 5, each dimension accounts for 0.2 of αinertia.As α increases, there is more emphasis on the initial components of the coordinates and less emphasis on the latter ones.The first dimension accounts for 0.623, 0.855, and 0.943 of α-inertia, while the fifth dimension accounts for 0.020, 0.001, and 0.000, with α being 0.5, 1, and 1.5, respectively.The standard LSA solution has α = 1.

Design
We compare the performances of LSA and CA for information retrieval, where two kinds of weightings are studied in LSA: the elements of the raw document-term matrix are weighted and the weighting exponent α is varied.We also explore the impact of these weightings in CA.Finally, we study mean average precision (MAP) as a function of α under an optimal number of dimensions for LSA and CA.We vary the number of dimension In the study of weighting the elements of the raw document-term matrix, we perform the LSA and CA of • raw matrix F, denoted by RAW, • L1 row-normalized matrix • TF-IDF matrix F TF-IDF described in Section 2.1, TFIDF.

Datasets
LSA and CA are compared using three English datasets and one Dutch dataset.The three English datasets are the BBCSport (Greene & Cunningham, 2006), BBCNews (Greene & Cunningham, 2006), and 20 Newsgroups datasets (20-news-18846 bydata version) (Rennie, 2005).The Dutch dataset is the Wilhelmus dataset (Kestemont, Stronks, De Bruin, & Winkel, 2017).Some statistics of the four datasets used are presented in Table 3.The BBCNews dataset includes 2,225 documents that fall into one of five categories.The BBCSport dataset includes 731 documents that fall into one of five categories.The 20 Newsgroups dataset includes 18,846 documents that fall into one of 20 categories.This dataset is sorted into a training (60%) and a test (40%) set.We use a subset of this dataset to evaluate information retrieval.We randomly choose 600 documents from the training set of four categories (comp.graphics,rec.sport.hockey,sci.crypt, and talk.politics.guns)and 400 documents from the test set of these four categories.The Wilhelmus dataset includes 186 documents divided into six categories.
To pre-process the three English datasets, we change all characters to lower case, remove punctuation marks, numbers, and stop words, and apply lemmatization.Subse- quently, terms with frequencies lower than 10 are ignored.In addition, we remove unwanted parts of the 20 Newsgroups dataset, such as the header (including fields like "From:" and "Reply-To:" followed by email address), because these are almost irrelevant for information retrieval.The Dutch Wilhelmus dataset is already pre-processed into taglemma pairs.
Since the Wilhelmus and BBCSport datasets have a relatively low number of documents, we use leave-one-out cross-validation (LOOCV) for the Wilhelmus dataset and five-fold cross-validation for the BBCSport dataset to evaluate LSA and CA (Gareth, Daniela, Trevor, & Robert, 2021).The BBCNews dataset is randomly divided into training (80%) and validation (20%) sets.
In the information retrieval part of the study, each document in the validation set is used as a query, where the category of the document is known.The documents in the training set that fall in the same category as the query are the relevant documents for this query.

Evaluation
We compare the MAP of each of the four versions of LSA and CA to explore the performance of these methods in information retrieval under changes in the contributions of initial dimensions (Kolda & O'leary, 1998).The MAP is calculated as follows: • The similarity is assessed between a query vector and each document vector of a document collection.We use three similarity metrics: Euclidean distance, dot similarity, and cosine similarity.As Euclidean distance is a key motivation for CA, we report results on Euclidean distance, and only report partial results for dot and cosine similarity in the main paper and the other results in the supplementary materials.
• For Euclidean distance, the documents are ranked in an increasing order based on their similarity with the query vector (for dot and cosine similarity, the ranking is in the decreasing order); therefore, the first document has the highest similarity.
• Precision-recall points are derived from the ordered list of documents.For a given query, Table 4 defines four types of documents in the ordered list based on whether a document is relevant and retrieved: C = the set of relevant documents from the ordered list, i.e., documents that fall in the same category as the query D = the set of retrieved documents from the ordered list., i.e., when 10 documents are returned, the set of retrieved documents consists of the first 10 documents in the ordered list.
Thus, precision is defined as the ratio of the number of relevant documents retrieved over the total number of retrieved documents, and recall is defined as the ratio of the number of relevant documents retrieved over the total number of relevant documents.For a given query, the set C is fixed.The set D is not fixed; if we return the first i documents, then D consists of the first i documents in the ordered list.Thus, for a given i, we can obtain a precision (see Equation ( 11)) and recall (see Equation ( 12)) pair.We run values of i from 1 to l (the number of documents in the ordered list), and obtain l precision-recall pairs.
• Then, 11 pseudo-precisions are calculated under 11 recalls (0, 0.1, • • • , 1.0), where a pseudo-precision at recall x is the maximum precision from recall x to recall 1.For example, pseudo-precision at recall 0.2 is the maximum precision from recall 0.2 to recall 1.
• The average precision for the query is obtained by averaging the 11 pseudo-precisions.
• The MAP is the mean across all queries.
Greater MAP values indicate a better performance.the performance of LSA and CA will be better than that of RAW, and the performance of CA will be better than that of the four versions of LSA.
Figure 3 shows MAP as a function of the number of dimensions k for different weighting schemes of LSA, and for CA.We display only the first 20 dimensions, as all lines usually decrease after dimension 20.Figures with dimensionality up to 100 can be found in the supplementary materials.For the four versions of LSA, and for CA, Table 5 presents the dimension number for which the optimal MAP is reached, as well as the MAP values, in each of the four datasets.We conclude the following from Figure 3 and Table 5: • Both LSA and CA result in better MAP than RAW, which results in a straight line when the full dimensional matrix is used.
• For both LSA and CA, performance is a function of the number of dimensions k.
Overall, MAP rises as a function of k to reach a peak, and then, it goes down.For CA, the peak is reached at k = 4.In CA, the information used to calculate MAP increases in the first four dimensions in comparison to the noise.In the components of k ≥ 5, the noise dominates the useful information, which results in the MAP going down from this point.
• CA results in a considerably better MAP than the four versions of LSA, which is in line with Qi et al. (2023 in press Natural Language Engineering), who showed that the performance of CA is better than that of LSA for document-term matrices.This is because of the differential treatment of margins in LSA and CA.The margins provide irrelevant information for making queries.In CA, the margins are removed, and therefore, the relative amount of information in comparison to the noise, which we informally refer to as the information -noise ratio, is considerably larger in CA than in LSA.This explains the better MAP in CA.
• The peaks for the four versions of LSA are usually found at higher dimensionality k than the peaks for CA.This is because margins are noise for queries when we fix α = 1; in LSA, this noise plays an important role in the first few dimensions.Hence, this earlier peak in CA is also explained by its better information -noise ratio.
• The four LSA methods are not equally effective.In all four datasets, the performance of LSA can be significantly improved using weighting schemes.The improve-ments over LSA-RAW are data dependent.On average, across the four datasets, LSA-NROWL2 is the best, but for the Wilhelmus dataset, LSA-NROWL1 and LSA-NROWL2 result in a somewhat worse MAP than that with LSA-RAW.

CA under varying numbers of dimensions
In Section 4.1.1,we found that CA outperforms the four versions of LSA in terms of MAP, where LSA had the usual weighting exponent α = 1.In this section, we study whether the performance of LSA-RAW improves when we vary α.
Figure 4 shows MAP as a function of α for LSA-RAW with the number of dimensions k = 4, 6, 9, 12, and 24.For comparison, we also report the MAP values for CA found in Section 4.1.1under these dimensions.We choose these values of k because these dimensions are optimal for LSA-RAW and CA in Table 5.Table 6 shows the optimal α and corresponding MAP, which is a condensed version of Figure 4. We conclude the following from Figure 4 and Table 6: • Although the performance of LSA-RAW improves by varying α, CA still outperforms LSA-RAW.
• For LSA-RAW, the overall MAP first increases and then decreases as a function of α.
This means that varying α can potentially improve the performance of LSA-RAW.
• The increase in MAP is minor.Consider, for example, the BBCNews dataset.In Section 4.1.1,we found that the MAP was optimal with a value of 0.652 for α = 1, when k = 6.Table 6 shows that for α = 0.2, the MAP increases to 0.658.Apparently, for 6 dimensions, when α = 0.2, the information -noise ratio is optimal in terms of MAP.For α = 0.2, the distances on later dimensions (of the 6 dimensions) are increased and those on initial dimensions are reduced.This means that, with α = 0.2, the impact of the initial dimensions affected most by the margins is reduced.This is consistent with the results of Bullinaria and Levy (2012), which indicates that reducing the initial dimensions improves performance.
• Moreover, the optimal α for LSA-RAW is data dependent and generally increases with k.This replicates results of Caron (2001).As the number of dimensions varies, the change in the optimal α is the result of the information -noise ratio for the specific number of dimensions studied.For example, for the BBCNews dataset, the optimal number of dimensions is 6; for larger numbers of dimensions, the optimal α increases.An increasing α indicates that distances at earlier dimensions are more important for information retrieval, and therefore, the role of the later dimensions is played down.Weighting the elements of the raw document-term matrix is an effective way to improve the performance of LSA for information retrieval.Here, we explore whether this holds for CA.Similar to Figure 3, Figure 5 shows MAP as a function of k for different weighting schemes of CA.CA in Figure 3 is referred to as CA-RAW in Figure 5; for CA/CA-RAW, the results in these two figures are identical.For the four versions of CA, Table 7 shows the dimensionality for which the optimal MAP is reached, as well as the MAP value.We conclude the following from Figure 5 and Table 7: • Overall, the weighting of the elements of the raw matrix sometimes improves the performance of CA, but these improvements over CA-RAW are small and data dependent.
Relative to LSA, it is harder to improve the performance of CA in information retrieval by weighting the elements of the raw matrix because (1) the MAP of CA-RAW is already relatively high, and (2) CA-RAW has weighted the elements of the raw document-term matrix as it is an integral part of this technique (Equation ( 5)).  8 shows the optimal α and the corresponding MAP, which is a condensed version of Figure 6.We conclude the following from Figure 6 and Table 8: • For CA, the overall MAP first increases and then decreases as a function of α.This means that varying α can potentially improve the performance of CA.
• The increase in MAP by adjusting α is data and dimension dependent.
• If we compare the maxima in Table 6 with those in Table 8, there is hardly a noticeable increase.Comparing Table 8 with part LSA-RAW of Table 6, the optimal α for CA-RAW is almost always larger than LSA-RAW and is almost always larger than 1.That is, CA-RAW needs a larger α than LSA-RAW to obtain its maximum MAP.Thus, compared to LSA, CA improves by placing more emphasis on its initial dimensions.The important difference between LSA and CA is that LSA involves margins, and CA does not.Therefore, we infer that margins in LSA considerably contribute to the initial dimensions; however, they are irrelevant ("noise") for information retrieval.On the other hand, CA effectively eliminates this irrelevant information.Table 9 shows the optimal α, optimal k, and corresponding MAP, which is a condensed version of Figure 7. Based on Figure 7 and Table 9, we can see that • CA methods are always better than the LSA methods and term matching methods under the optimal α and optimal number of dimensions k.
• Weighting the elements of the raw document-term matrix under the optimal number of dimensions k can improve the performance of CA; however, the improvements are small and data dependent.
• Similar to dimension k = 4, 6, 9, 12, and 24, MAP as a function of α under the optimal number of dimensions k also first increases and then decreases.Thus, adjusting α can potentially improve the performance of LSA and CA.
• For different datasets, the optimal α under the optimal number of dimensions k is very different.In constrast to LSA, CA needs a greater α to reach the optimal performance under the optimal number of dimensions k.This illustrates that CA places more emphasis on initial dimensions than LSA.

Results for dot similarity and cosine similarity
In Section 4, we presented the results where Euclidean distance was used as a measure of similarity.Here, for comparison, we provide results for dot similarity and cosine similarity.Figure 7: MAP as a function of α under the optimal number of dimensions.
To save space, we only show tables corresponding to the Euclidean distance (Table 5), dot similarity (Table 10) and cosine similarity (Table 11).Other tables and figures for dot similarity and cosine similarity are presented in the supplementary materials.
The results for both dot similarity and cosine similarity lead to conclusions that match those for Euclidean distance.However, cosine similarity leads to a better performance in terms of MAP than Euclidean distance and dot similarity.We displayed the results for Euclidean distance in Section 4 because (1) it is more easily interpretable in the context of adjusting weighting exponent α: as α increases, Euclidean distances between row points (column points) on initial dimensions increase relative to the later dimensions; and (2) in the literature, the Euclidean distance is the preferred way to interpret CA (in fact, we have never seen an interpretation of CA in terms of cosine or dot similarity).

Conclusions and discussions
We compared four versions of LSA with CA and found that CA always performs better than LSA in terms of MAP.Then, we compared LSA-RAW as a function of weighting exponent α with CA under a range of the numbers of dimensions.Even though LSA is improved by choosing an appropriate value for α, CA always performed better than LSA.
Next, we applied different weighting elements of raw document-term matrix to CA.We found that weighting elements of the raw matrix sometimes improves the performance of CA, but improvements over CA-RAW are small and data dependent.Then, we adjusted the weighting exponents α in CA.For CA, as a function of α, MAP first increases and then decreases.Adjusting the weighting exponent α can potentially improve the performance of CA.However, the increased performance obtained by adjusting α is data and dimension dependent.
Using the standard coordinates of α = 1, for LSA, the Euclidean distances between the rows of coordinates approximate the Euclidean distances between the rows of the decomposed matrix.For CA, the Euclidean distances between the rows of coordinates approximate the χ 2 −distances between the rows of the decomposed matrix.α < 1 gives less emphasis to the initial dimensions relative to the standard coordinates.Conversely, α > 1 gives more emphasis to the initial dimensions relative to the standard coordinates.The optimal α for CA is almost always larger than that for LSA and is almost always larger than 1.
Finally, we studied MAP as a function of α under the optimal number of dimensions.
Again, CA performs better than LSA.Although the optimal α under the optimal number of dimensions is data dependent, the optimal α of CA is usually considerably larger than that of LSA.Bullinaria and Levy (2012) argued that the initial dimensions in LSA tend not to contribute the most useful information about semantics and tend to be contaminated by "noise".
The above mentioned results indicate that CA places more emphasis on the initial dimensions than LSA.The major difference between LSA and CA is that LSA involves margins but CA does not (Qi et al., 2023 in press Natural Language Engineering).Thus, we infer that margins considerably contribute to the initial dimensions in LSA.These margins are irrelevant for information retrieval.The CA effectively eliminates this irrelevant information.
In this paper, we focused on the performances of CA and LSA using Euclidean distances.We also performed identical experiments for dot similarity and cosine similarity.
Both have nearly identical results with the Euclidean distance.Cosine similarity performs better than the Euclidean distance and dot similarity.We focus on Euclidean distance in the paper because (1) it is more easily interpretable in the context of adjusting α: as α increases, the Euclidean distances between row points (column points) on the initial dimensions increase relative to the later dimensions; (2) for CA, dot similarity and cosine similarity have never been used before, and therefore, by focusing on Euclidean distances, the results fit better into the existing literature.
Based on experimental results and analysis, we have the following three suggestions for practical guidance: 1. Use CA instead of LSA; use CA for visualizing data.
2. If information retrieval is the key issue, use cosine similarity instead of Euclidean distance and dot similarity for calculating MAP.
3. If optimal performance in terms of MAP is not of key importance, there is no need to weight the elements of raw document-term matrix for CA and optimize the performance over α for CA to saving time.Otherwise, these two weightings may be considered potential approaches for improving the performance of CA.
There exist various applications of LSA.In future work, it will be interesting to attempt to improve these applications by replacing LSA with CA.

Figure 1 :
Figure 1: A two-dimensional plot of documents and terms for (a) LSA-RAW, (b) CA.
the diagonal, and E = [r i c j ] is the matrix of expected proportions under the statistical independence of the documents and the terms.under the statistical independence model.The sum of squares of these elements yields the total inertia, i.e., the Pearson χ 2 statistic divided by sample size N .By taking the SVD of the matrix of standardized residuals
as a function of the number of dimensions for the four versions of LSA with the standard weighting exponent α = 1 and for CA We first investigate the performance of LSA and CA in terms of MAP, in their standard use, i.e., without varying the weighting exponent α, i.e., α = 1.Term matching without the preliminary use of LSA and CA, i.e., directly on the document-term matrix, is denoted by RAW.We expect that, in line withQi et al. (2023 in press  Natural Language Engineering),

Figure 3 :
Figure 3: MAP as a function of the number of dimensions k under standard coordinates.

Figure 4 :
Figure 4: MAP as a function of α for LSA-RAW and MAP for CA under varying k.

Figure 5 :
Figure 5: MAP as a function of the number of dimensions k for the four versions of CA under standard coordinates.

Figure 6 :
Figure 6: MAP as a function of α for CA-RAW under various values of k.

Figure 7
Figure7shows the MAP as a function of α under the optimal number of dimensions.Similar to Section 4.1.1 in the context of α = 1, we can obtain the corresponding optimal k (not shown in the figure) and the corresponding MAP (shown in the figure) for each α.
Figure A.1: MAP as a function of the number of dimensions k under standard coordinates.
Figure B.4: MAP as a function of α for CA-RAW under various values of k.
Figure C.2: MAP as a function of α for LSA-RAW and MAP for CA under various values of k.

Table 1 :
A document-term matrix F : size 6×6 lion tiger cheetah jaguar porsche ferrari

Table 3 :
Characteristics of datasets.

Table 4 :
Retrieved and relevant documents.Let |.| denote the number of documents in a set.Then, precision and recall are defined

Table 5 :
MAP with the optimal number of dimensions k.Bold values are best.

Table 6 :
MAP with the optimal weighting exponent α for LSA-RAW and MAP for CA under k = 4, 6, 9, 12, and 24.Bold values are best.

Table 7 :
MAP with the optimal number of dimensions k for the four versions of CA.Bold values are best.

.2 MAP as a function of the weighting exponent α for CA In
this section, we introduce CA with weighting exponent α.Similar to Figure 4, Figure 6 shows MAP as a function of α in CA-RAW for the number of dimensions k = 4, 6, 9, 12, and 24.Table

Table 9 :
MAP under the optimal α and optimal dimension k.Bold values are best within group; underlined values are best overall.

Table 10 :
MAP with the optimal number of dimensions k about dot similarity.Bold values are best.

Table 11 :
MAP with the optimal number of dimensions k about cosine similarity.Bold values are best.
Table B.4: MAP under the optimal α and optimal dimension k about dot similarity.Bold values are best within group; underlined values are best overall.