Abstract
The statistical methods discussed in the previous chapters are applicable mainly in the exploratory phase (also known as the descriptive phase) of an analysis. However, data exploration is more dynamic and interactive than simple data description. It uses multivariate statistics to obtain visualizations or groupings of elements that can be either whole texts or items within texts. It looks for associations and structures as well as interesting summaries.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Early discriminant analyses were carried out on biometric and anthropometric measures by statisticians Fisher (1936), and Mahalanobis (1936), who were attempting to predict belonging to ethnic groups on the basis of measurements of the skeleton. They were the first to use the technique that is sometimes known as linear discriminant analysis: it is one of the oldest methods, and it is also one of the methods that is most commonly used today.
The second version of the work of Mosteller and Wallace (1984) also contains a general panorama of attempts at authorship attribution.
Cf. the pioneering work of Palermo and Jenkins (1964). Cf. also Bouroche and Curvalle (1974).
This trait does not exclude the possibility of concentrating on such units in some areas. Think, for example, of the important role played by the tool words for and against in the analysis of political texts.
Cf. for example the work of Radday (1974) and Morton(1963) concerning the homogeneity of the book of Isaiah.
Today such an operation cannot be totally computerized. Important progress has been made in the realm of automatic syntactic analysis of texts, as shown, for example, by the ongoing improvements in spelling correction found in most text processors.
Note that although isolating tool words requires categorizing and removing ambiguities in a text (as in, for example, the word even),some expressions contain full words that are substitutes for function words (e.g., in fact),that a preliminary lemmatization might obscure.
Of course, this level of precision is misleading, because for some plays, there are entire paragraphs missing in certain editions, whereas disagreements as to the identity of words still exist for the text parts that are common to all sources.
The local Mahalanobis distance of point X to group k, which is used in quadratic discriminant analysis,is written dk(X) = (X-mk)’ Sk’’(X-mk) where Sk is the internal covariance matrix of group k with mean point (center of gravity) mk (see: Anderson, 1984; MacLachlan, 1992).
Cf. Lachenbruch and Mickey, 1968; Stone, 1974; Geisser, 1975.
This survey was instigated by the Institute of Research on Urban Life (a Japanese research institute sponsored by Tokyo Gas Company Ltd) under the direction of H. Akuto (cf. Akuto, 1992 ).
Details of the analyses in the three countries are given in Akuto (1992) and Akuto and Lebart (1992).
The latinization of Japanese writing introduced a confusion that would not have occurred with Chinese characters (Kanjis). The latinized graphical form SAKE, for example, designates rice wine as well as salmon in our coding scheme. This type of distortion of basic information only takes away a little from the richness of the aggregated lexical profiles, as the reader will be able to judge.
Recall that the test-value converts a critical probability into a standardized normal variable, for easier readability: the value 1.96 corresponds to the two-tailed threshold 0.05,whereas the value 5.51 corresponds to a probability on the order of 10–6.
Another calculation mode consists in characterizing a response by the mean test-value of the forms which it contains. This criterion, which favors concise responses, was not used here (cf. chapter 6, section 6.2).
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Lebart, L., Salem, A., Berry, L. (1998). Textual Discriminant Analysis. In: Exploring Textual Data. Text, Speech and Language Technology, vol 4. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-1525-6_9
Download citation
DOI: https://doi.org/10.1007/978-94-017-1525-6_9
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-4942-1
Online ISBN: 978-94-017-1525-6
eBook Packages: Springer Book Archive