Articulation of Elements

Tanaka-Ishii, Kumiko

doi:10.1007/978-3-030-59377-3_11

Kumiko Tanaka-Ishii¹¹

Part of the book series: Mathematics in Mind ((MATHMIN))

671 Accesses

Abstract

The previous two parts of this book considered statistical universals of language. Sequences were input to specific analysis methods to examine the behavior of words or characters. The resulting phenomena were studied from the two viewpoints of the poplulation and sequence. As shown by the thick rightward arrow in Fig. 1.1, Parts II and III studied language corpora to reveal the statistical universals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The description of pronunciation here follows that appearing in (Harris 1955).
2.
Note that there are other, far less common possibilities such as “The United States of Tara,” the name of a television series, and misspelled words (see Sect. 17.3).
3.
Section 21.9 explains the relations of various measures of complexity through the notion of generalized entropy , including the successor count and the Shannon entropy.
4.
\(g(c_i^{i+n-1}\)) in this chapter is designed to follow Harris’s concept closely. Another way is instead to use \({\mathrm{H}}(X_{i+n}|X_i^{i+n-1} = c_i^{i+n-1})\), whose relation with \({\mathrm{H}}(X_n|X_1^{n-1})\) is clearer. The overall experimental result should not be different.
5.
This maximum value was chosen after testing with some larger values in the original work (Tanaka-Ishii and Jin, 2008). Language has long memory, as reported in Part III, but when limited to this specific task of articulation using Harris’s scheme, n = 10 was deemed sufficient to obtain maximum performance.
6.
This figure appeared in Tanaka-Ishii and Jin (2008). For clarity, the figure only shows part of the experimental results. In the experiment, an entropy shift was verified starting from every phoneme for a length of 10, as mentiond in the main text.
7.
The denotation of phonemes by capital letters here follows the CMU Pronouncing Dictionary (The Speech Group at CMU, 1998).
8.
The experiment involved a threshold parameter k such that candidate points at j > i were regarded as borders when \(g(c_i^{j}) - g(c_i^{j-1}) \geq k\). This k value was then varied, and f-scores were acquired for all thresholds. The value k = 1.6 gave the best results.
9.
This figure also appeared in Tanaka-Ishii and Jin (2008). As detailed in the original paper, the rightmost point has slightly different precision and recall values from those given in the text. The reason is that the threshold value k was varied as mentioned in the previous footnote, and this graph shows the result for another slightly different k.

References

Creutz, Mathias and Lagus, Krista (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pages 21–30.
Google Scholar
Frantzi, Katerina T. and Ananiadou, Sophia (1996). Extracting nested collocations. In Proceedings of the 16th International conference on Computational linguistics, pages 41–46.
Google Scholar
Goldwater, Sharon, Griffiths, Thomas L., and Johnson, Mark (2009). A Bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112, 21–54.
Article Google Scholar
Hafer, Margaret A.and Weiss, Stefan F. (1974). Word segmentation by letter successor varieties. Information Storage and Retrieval, 10, 371–385.
Google Scholar
Harris, Zellig S. (1955). From phoneme to morpheme. Language, 31(2), 190–222.
Article Google Scholar
Harris, Zellig S. (1968). Mathematical Structures of Language. Interscience Publishers (John Wiley & Sons).
Google Scholar
Harris, Zellig S. (1988). Language and Information. Columbia University Press.
Google Scholar
Huang, Jin-Hu and Powers, David (2003). Chinese word segmentation based on contexual entropy. Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pages 152–158.
Google Scholar
Kempe, André (1999). Experiments in unsupervised entropy-based corpus segmentation. In EACL 1999: CoNLL-99 in Computational Natural Language Learning, pages 7–13.
Google Scholar
Martinet, André (1960). Eléments de linguistique générale. Armand Colin.
Google Scholar
Nobesawa, Shiho, Tsutsumi, Junya, Jiang, Sun D., Sano, Tomohisa, Sato, Kengo, and Nakanishi, Masakazu (1996). Segmenting sentences into linky strings using d-bigram statistics. The 16th International Conference on Computational linguistics, pages 586–591.
Google Scholar
Saffran, Jenny R. (2001). Words in a sea of sounds : the output of infant statistical learning. Cognition, 81, 149–169.
Article Google Scholar
Tanaka-Ishii, Kumiko and Ishii, Yuichiro (2007). Multilingual phrase-based concordance generation in real-time. Information Retrieval, 10, 275–295.
Article Google Scholar
Tanaka-Ishii, Kumiko and Jin, Zhihui (2008). From phoneme to morpheme: Another verification in English and Chinese using corpora-. Studia Linguistica, 62(2), 224–248.
Article Google Scholar
The Speech Group at CMU (1998). The CMU pronouncing dictionary version 0.6. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

Download references

Author information

Authors and Affiliations

Research Center for Advanced Science and Technology (RCAST), The University of Tokyo, Tokyo, Japan
Kumiko Tanaka-Ishii

Authors

Kumiko Tanaka-Ishii
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tanaka-Ishii, K. (2021). Articulation of Elements. In: Statistical Universals of Language. Mathematics in Mind. Springer, Cham. https://doi.org/10.1007/978-3-030-59377-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-59377-3_11
Published: 02 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59376-6
Online ISBN: 978-3-030-59377-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics