Textual Discriminant Analysis

Lebart, Ludovic; Salem, André; Berry, Lisette

doi:10.1007/978-94-017-1525-6_9

Ludovic Lebart⁶,
André Salem⁷ &
Lisette Berry⁸

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 4))

876 Accesses

Abstract

The statistical methods discussed in the previous chapters are applicable mainly in the exploratory phase (also known as the descriptive phase) of an analysis. However, data exploration is more dynamic and interactive than simple data description. It uses multivariate statistics to obtain visualizations or groupings of elements that can be either whole texts or items within texts. It looks for associations and structures as well as interesting summaries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Early discriminant analyses were carried out on biometric and anthropometric measures by statisticians Fisher (1936), and Mahalanobis (1936), who were attempting to predict belonging to ethnic groups on the basis of measurements of the skeleton. They were the first to use the technique that is sometimes known as linear discriminant analysis: it is one of the oldest methods, and it is also one of the methods that is most commonly used today.
Google Scholar
The second version of the work of Mosteller and Wallace (1984) also contains a general panorama of attempts at authorship attribution.
Google Scholar
Cf. the pioneering work of Palermo and Jenkins (1964). Cf. also Bouroche and Curvalle (1974).
Google Scholar
This trait does not exclude the possibility of concentrating on such units in some areas. Think, for example, of the important role played by the tool words for and against in the analysis of political texts.
Google Scholar
Cf. for example the work of Radday (1974) and Morton(1963) concerning the homogeneity of the book of Isaiah.
Google Scholar
Today such an operation cannot be totally computerized. Important progress has been made in the realm of automatic syntactic analysis of texts, as shown, for example, by the ongoing improvements in spelling correction found in most text processors.
Google Scholar
Note that although isolating tool words requires categorizing and removing ambiguities in a text (as in, for example, the word even),some expressions contain full words that are substitutes for function words (e.g., in fact),that a preliminary lemmatization might obscure.
Google Scholar
Of course, this level of precision is misleading, because for some plays, there are entire paragraphs missing in certain editions, whereas disagreements as to the identity of words still exist for the text parts that are common to all sources.
Google Scholar
The local Mahalanobis distance of point X to group k, which is used in quadratic discriminant analysis,is written dk(X) = (X-mk)’ Sk’’(X-mk) where Sk is the internal covariance matrix of group k with mean point (center of gravity) mk (see: Anderson, 1984; MacLachlan, 1992).
Google Scholar
Cf. Lachenbruch and Mickey, 1968; Stone, 1974; Geisser, 1975.
Google Scholar
This survey was instigated by the Institute of Research on Urban Life (a Japanese research institute sponsored by Tokyo Gas Company Ltd) under the direction of H. Akuto (cf. Akuto, 1992 ).
Google Scholar
Details of the analyses in the three countries are given in Akuto (1992) and Akuto and Lebart (1992).
Google Scholar
The latinization of Japanese writing introduced a confusion that would not have occurred with Chinese characters (Kanjis). The latinized graphical form SAKE, for example, designates rice wine as well as salmon in our coding scheme. This type of distortion of basic information only takes away a little from the richness of the aggregated lexical profiles, as the reader will be able to judge.
Google Scholar
Recall that the test-value converts a critical probability into a standardized normal variable, for easier readability: the value 1.96 corresponds to the two-tailed threshold 0.05,whereas the value 5.51 corresponds to a probability on the order of 10–6.
Google Scholar
Another calculation mode consists in characterizing a response by the mean test-value of the forms which it contains. This criterion, which favors concise responses, was not used here (cf. chapter 6, section 6.2).
Google Scholar

Download references

Author information

Authors and Affiliations

Centre National de la Recherche Scientifique, Paris, France
Ludovic Lebart
Université de la Sorbonne Nouvelle, Paris, France
André Salem
L. Berry Associates, Inc., New York, USA
Lisette Berry

Authors

Ludovic Lebart
View author publications
You can also search for this author in PubMed Google Scholar
André Salem
View author publications
You can also search for this author in PubMed Google Scholar
Lisette Berry
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lebart, L., Salem, A., Berry, L. (1998). Textual Discriminant Analysis. In: Exploring Textual Data. Text, Speech and Language Technology, vol 4. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-1525-6_9

Download citation

DOI: https://doi.org/10.1007/978-94-017-1525-6_9
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-4942-1
Online ISBN: 978-94-017-1525-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics