Abstract
The previous two parts of this book considered statistical universals of language. Sequences were input to specific analysis methods to examine the behavior of words or characters. The resulting phenomena were studied from the two viewpoints of the poplulation and sequence. As shown by the thick rightward arrow in Fig. 1.1, Parts II and III studied language corpora to reveal the statistical universals.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The description of pronunciation here follows that appearing in (Harris 1955).
- 2.
Note that there are other, far less common possibilities such as “The United States of Tara,” the name of a television series, and misspelled words (see Sect. 17.3).
- 3.
Section 21.9 explains the relations of various measures of complexity through the notion of generalized entropy , including the successor count and the Shannon entropy.
- 4.
\(g(c_i^{i+n-1}\)) in this chapter is designed to follow Harris’s concept closely. Another way is instead to use \({\mathrm{H}}(X_{i+n}|X_i^{i+n-1} = c_i^{i+n-1})\), whose relation with \({\mathrm{H}}(X_n|X_1^{n-1})\) is clearer. The overall experimental result should not be different.
- 5.
This maximum value was chosen after testing with some larger values in the original work (Tanaka-Ishii and Jin, 2008). Language has long memory, as reported in Part III, but when limited to this specific task of articulation using Harris’s scheme, n = 10 was deemed sufficient to obtain maximum performance.
- 6.
This figure appeared in Tanaka-Ishii and Jin (2008). For clarity, the figure only shows part of the experimental results. In the experiment, an entropy shift was verified starting from every phoneme for a length of 10, as mentiond in the main text.
- 7.
The denotation of phonemes by capital letters here follows the CMU Pronouncing Dictionary (The Speech Group at CMU, 1998).
- 8.
The experiment involved a threshold parameter k such that candidate points at j > i were regarded as borders when \(g(c_i^{j}) - g(c_i^{j-1}) \geq k\). This k value was then varied, and f-scores were acquired for all thresholds. The value k = 1.6 gave the best results.
- 9.
This figure also appeared in Tanaka-Ishii and Jin (2008). As detailed in the original paper, the rightmost point has slightly different precision and recall values from those given in the text. The reason is that the threshold value k was varied as mentioned in the previous footnote, and this graph shows the result for another slightly different k.
References
Creutz, Mathias and Lagus, Krista (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pages 21–30.
Frantzi, Katerina T. and Ananiadou, Sophia (1996). Extracting nested collocations. In Proceedings of the 16th International conference on Computational linguistics, pages 41–46.
Goldwater, Sharon, Griffiths, Thomas L., and Johnson, Mark (2009). A Bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112, 21–54.
Hafer, Margaret A.and Weiss, Stefan F. (1974). Word segmentation by letter successor varieties. Information Storage and Retrieval, 10, 371–385.
Harris, Zellig S. (1955). From phoneme to morpheme. Language, 31(2), 190–222.
Harris, Zellig S. (1968). Mathematical Structures of Language. Interscience Publishers (John Wiley & Sons).
Harris, Zellig S. (1988). Language and Information. Columbia University Press.
Huang, Jin-Hu and Powers, David (2003). Chinese word segmentation based on contexual entropy. Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pages 152–158.
Kempe, André (1999). Experiments in unsupervised entropy-based corpus segmentation. In EACL 1999: CoNLL-99 in Computational Natural Language Learning, pages 7–13.
Martinet, André (1960). Eléments de linguistique générale. Armand Colin.
Nobesawa, Shiho, Tsutsumi, Junya, Jiang, Sun D., Sano, Tomohisa, Sato, Kengo, and Nakanishi, Masakazu (1996). Segmenting sentences into linky strings using d-bigram statistics. The 16th International Conference on Computational linguistics, pages 586–591.
Saffran, Jenny R. (2001). Words in a sea of sounds : the output of infant statistical learning. Cognition, 81, 149–169.
Tanaka-Ishii, Kumiko and Ishii, Yuichiro (2007). Multilingual phrase-based concordance generation in real-time. Information Retrieval, 10, 275–295.
Tanaka-Ishii, Kumiko and Jin, Zhihui (2008). From phoneme to morpheme: Another verification in English and Chinese using corpora-. Studia Linguistica, 62(2), 224–248.
The Speech Group at CMU (1998). The CMU pronouncing dictionary version 0.6. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Tanaka-Ishii, K. (2021). Articulation of Elements. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-59377-3_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59376-6
Online ISBN: 978-3-030-59377-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)