1 Introduction

The main goal of automatic speech recognition (ASR) is translation of spoken words into a text [1]. Modern speech recognition systems require implementation of the acoustic and language modelling [2]. Both acoustic and language modelling are important parts of modern statistical speech recognition approach [3, 4]. Statistical language modelling enables to develop large vocabulary and effective speech recognition systems [5]. Language modelling can be used not only in speech recognition application, but also in other areas of speech and language processing, e.g., language recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.

The main motivation of the research on speech recognition area, is to improve automatic speech recognition process, especially for Polish language [6, 7]. Additionally, research studies have been conducted in the field of properties of Polish phonemes [8, 9], speech recognition based on it [10], speaker recognition [11, 12], speaker verification [1315], and new applications of speech recognition, e.g., automatic speech translation [16].

Particularly, a good performance of automatic speech recognition is achieved with use of speech recognition by statistical methods [17]. Therefore, the main objective of the research presented in this paper, was to perform statistical analysis of Polish language based on the orthographic and phonemic language corpus, for development of statistical word-based and phoneme-based language models, as well as applying them to improve speech recognition for Polish. The development of statistical language models helps to predict a sequence of recognized spoken words and phonemes. The use of developed language models can effectively contribute to the improvement of the automatic speech recognition effectiveness, based on statistical methods. The development of word-based and phoneme-based language models for speech recognition, built on statistical language data, requires the access to large orthographic and phonemic language corpora [18, 19].

2 Orthographic language corpus

One of the biggest orthographic Polish language corpus is the National Corpus of Polish (NCP) [20]. The NCP corpus is available for the scientific community and offers great flexibility, as well as it is extremely important in terms of scientific value. The NCP corpus provides crucial reference material reflecting the state of contemporary Polish language which meets all the requirements of modern science [21]. It can be used particularly by linguists, but also by computer scientists interested in natural language processing.

The NCP corpus contains over 1500 million of words. The corpus is searchable by means of advanced tools, developed by the Institute of Computer Science at the Polish Academy of Sciences, which analyse Polish inflection and Polish sentence structure. The list of sources for the NCP corpus, presented in Table 1, contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts [22].

Table 1 Structure of the NCP coprus [20]

The results of the statistical analyses, presented in this paper, can be considered as representative for Polish language as a whole which is justified to a certain extent, considering the corpus size. However, it is worth remembering that the NCP corpus is still primarily based on written texts. Spoken language transcripts constitute a smaller percentage of the corpus contents which might be still significant when it comes to certain specialized continuous or conversational speech recognition tasks. Table 2 presents the details of the orthographic language corpus content, obtained from the NCP corpus resources.

Table 2 Details of the orthographic language corpus content

3 Phonemic language corpus

3.1 Grapheme-to-phoneme conversion

The phonemic Polish language corpus contains words written with the use of phonemic notation, obtained on the basis of automatic grapheme-to-phoneme conversion of an orthographic text. Automatic processing of a natural language, very often requires the implementation of automatic grapheme-to-phoneme conversion. Grapheme-to-phoneme conversion determines phonemic transcriptions directly from orthographic representations [23].

Phonemes are usually written with specially designed alphabets. The most commonly used alphabet for this purpose is the International Phonetic Alphabet (IPA) [24]. It was created on the basis of phonetics and phonology of West-European languages, and it is not satisfactorily adapted into Polish. For Polish, like other Slavic languages, a special transcriptional system, called the Slavistic Phonetic Alphabet (SPA), is most frequently used [25]. The second very often used phonetic alphabet is the Speech Assessment Methods Phonetic Alphabet (SAMPA) [26]. SAMPA is a machine-readable phonetic alphabet, using 7-bit printable ASCII characters, based on the IPA alphabet. Table 3 presents a set of Polish phonemes and the examples of their occurrence in Polish, written with the use of the SPA, IPA, and SAMPA phonetic alphabets.

Table 3 A set of Polish phonemes and examples of their occurrence

Knowledge-based grapheme-to-phoneme approaches, unlike data-driven G2P approaches, exploit rules, created by humans or deriving from linguistic studies to convert the sequence of graphemes in a word to a sequence of phonemes [27]. Rule-based grapheme-to-phoneme approaches are typically formulated in the framework of finite state automata, and require the formulation of grapheme-to-phoneme conversion rules [28]. The largest contribution to solve the problem of automatic grapheme-to-phoneme conversion for Polish, were the publications of Maria Steffen-Batóg [29, 30].

Automatic grapheme-to-phoneme conversion process can be described as an F function, defined by the following formula:

$$ F(\alpha) = \beta $$
(1)

where:

$$ \alpha = \alpha_{1}\ldots\alpha_{k}\ldots\alpha_{a} ~~\wedge~~ \alpha_{k}\in X ~~ \forall ~~ (1 \leq k \leq a) $$
(2)
$$ \beta = \beta_{1}\ldots\beta_{k}\ldots\beta_{b} ~~\wedge~~ \beta_{k}\in Y ~~ \forall ~~ (1 \leq k \leq b) $$
(3)

and where a is the length of orthographic character sequence, b is the length of phonemic character sequence, X is the set of the orthographical alphabet characters in Polish, additionally with special characters, and Y is the set of the phonemic characters alphabet in Polish, described by the Slavistic Phonetic Alphabet:

(4)
(5)

Grapheme-to-phoneme conversion of correctly written orthographic texts in Polish is transformation of words written in the orthographic X alphabet to form written in the phonemic alphabet Y. Automatic grapheme-to-phoneme conversion F function can be delineated by a set of formal grapheme-to-phoneme conversion rules defining how each α word, constructed from the orthographic X alphabet, can be transformed into a new β word constructed from the phonemic alphabet defined by the Y set. The rules usually are numerous with varying degrees of complexity. The size and complexity of grapheme-to-phoneme conversion rules depend on the number of letters in the orthographical alphabet and the fact that each letter can be pronounced differently in various contexts.

A set of grapheme-to-phoneme conversion rules for Polish was developed by Maria Steffen-Batóg and it was presented in the monograph dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish [29, 30]. Knowledge included into these monographs was essential in developing implementation of the automatic grapheme-to-phoneme conversion algorithm for Polish. According to Maria Steffen-Batóg, all grapheme-to-phoneme conversion rules, relating to one orthographic letter, can be stored in one table, called grapheme-to-phoneme conversion rules table for one letter.

According to the grapheme-to-phoneme conversion rules for Polish, described in the literature [2932], the grapheme-to-phoneme conversion for Polish has been implemented in the Python programming language, as automatic grapheme-to-phoneme conversion application named TransFon [33]. The implementation includes 975 grapheme-to-phoneme conversion rules for 35 orthographic letters in Polish, additionally conversion rules for special characters and automatic grapheme-to-phoneme conversion algorithm [33]. Block diagram of the grapheme-to-phoneme conversion algorithm for a single orthographic word is presented in Fig. 1. Due to that, many words have multiple variants of the correct pronunciation and the implementation includes only the most common basic variant of the pronunciation. Implementation of additional pronunciation variants is planned in the future. The problem of foreign words and acronyms phonemic transcription have been solved by using the dictionary where phonemic transcription of foreign words and acronyms have been defined.

Fig. 1
figure 1

Block diagram of the grapheme-to-phoneme conversion algorithm for a single orthographic word

TransFon application was developed entirely, without adapting any existing similar tools. The developed grapheme-to-phoneme conversion implementation is not the only one for Polish language [3438], but only the one of them is available for free use [38]. The implementation of grapheme-to-phoneme conversion allows to apply it to any task (e.g., phonemic language corpus development for Polish).

Table 4 presents the phonemic transcription examples in Polish, written with the use of the SPA, IPA, and SAMPA phonetic alphabets [25, 26].

Table 4 Phonemic transcription examples in Polish

The TransFon application enables to create the phonemic language corpus only on the basis of the orthographic source corpus. After automatic grapheme-to-phoneme conversion of the orthographic corpus with the use TransFon application, phonemic language corpus for Polish was obtained, in order to perform statistical analysis of Polish language.

3.2 Evaluation of grapheme-to-phoneme conversion implementation

The evaluation of the automatic grapheme-to-phoneme conversion implementation is crucial. During implementation of automatic grapheme-to-phoneme conversion for Polish, it was necessary to check and to prove if it works properly.

The test procedure for automatic grapheme-to-phoneme conversion implementation consisted of:

  • Performing the test automatic grapheme-to-phoneme conversion of orthographic text corpus file containing the most frequently used 1,943,462 unique words in Polish, obtained from the National Corpus of Polish resources [20].

  • In case of doubt, validation and verification of automatic grapheme-to-phoneme conversion results for words with the use of Polish language dictionary available online, with specifying correct pronunciation of words in Polish [39].

  • Registering cases of incorrect automatic grapheme-to-phoneme conversion, conversion errors and other encountered problems.

The automatic phonemic transcription application was implemented in such way, that the conversion algorithm was stopped, if grapheme-to-phoneme conversion problem occurred (e.g., when there was no rule allowing for a correct phonemic transcription). This solution makes it easier to work on improving and developing the automatic grapheme-to-phoneme conversion application. In addition, any doubts about the correct pronunciation was solved with help of wiktionary.org service [39]. This solution obviously has some serious limitations. The dictionary of wiktionary.org service contains only 61,141 Polish words and only in their basic form. The verification was further complicated by other problems such as different variants of the correct pronunciation of words or pronunciation of foreign words in the corpus.

The causes of problems and errors in automatic grapheme-to-phoneme conversion operation were as follows:

  • errors in the implementation of the grapheme-to-phoneme conversion algorithm and conversion rules,

  • missing grapheme-to-phoneme conversion rules in the tables (i.e., rules not included in the tables) for some orthographic letters contexts,

  • grapheme-to-phoneme conversion issue of foreign words, acronyms and words, which are not present in Polish language dictionary.

The above problems were solved in the following way:

  • The errors in the implementation of the grapheme-to-phoneme conversion algorithm and in conversion rules tables have been corrected by modifications, made within an application source code in Python programming language.

  • The problem of missing grapheme-to-phoneme conversion rules in tables has been solved by adding new conversion rules to the existing tables. In order to complete the missing grapheme-to-phoneme conversion rules, new conversion rules were supplemented for the following orthographic letters “i”, “n”, “d”, “z”, “z”, “c”, “f”, “s”, in some contexts.

  • The problems of foreign words and acronyms, have been solved by using the dictionary, where phonemic transcription of foreign words and acronyms have been defined. As a result, rule-based automatic grapheme-to-phoneme conversion was complemented by dictionary-based automatic grapheme-to-phoneme conversion method.

A number of improvements made it possible to increase effectiveness of the grapheme-to-phoneme conversion implementation. Tables 5 and 6 present the word error rate (WER) values of grapheme-to-phoneme conversion implementation, before and after improvements.

Table 5 WER values of the developed G2P conversion implementation, before improvements
Table 6 WER values of the developed G2P conversion implementation, after improvements

The WER value for 1,943,462 checked unique words, was equal 0.387%. The WER value for corpus contains 230,301,313 words, was equal 0.030%. The changes of WER values, before and after improvements, testify to the fact that implemented modifications have contributed to improving the effectiveness of G2P conversion.

3.3 The developed phonemic language corpus for Polish

The phonemic language corpus for Polish was developed by automatic grapheme-to-phoneme conversion of the source orthographic language corpus file obtained from the NCP corpus resources.

Table 7 presents the details of the phonemic language corpus content.

Table 7 Details of the phonemic language corpus content

The phonemic language corpus contains the list of 1,943,462 Polish words written orthographically, their phonemic transcription written with the SAMPA phonemic alphabet and additionally, the number of word occurrence in the NCP balanced corpus. The measure of the NCP balanced corpus size is the sum of all numbers of the word occurrences, which is equal to 230,301,313 words.

A sample section of the developed phonemic language corpus for Polish is presented in Table 8. It should also be noted that the standard SAMPA for Polish includes several sequences of phonemic transcription labels that may cause ambiguity unless separated by spaces or other characters. To avoid this problem, all phonemes are separated by square brackets.

Table 8 A sample section of the developed phonemic language corpus for Polish

4 Analysis of the obtained results and discussion

4.1 Statistical analysis of the orthographic and phonemic language corpora

With the use of the orthographic and phonemic language corpora, it was possible to perform statistical analysis of Polish language which includes calculation of the following distributions:

  • the frequency of the single orthographic word occurrence,

  • the frequency of the n-word sequence occurrence for n=2,…,5,

  • the frequency of the phoneme occurrence,

  • the frequency of the n-phoneme sequence occurrence for n=2,…,5.

The frequency distribution of words in the orthographic language corpus, is presented in Fig. 2.

Fig. 2
figure 2

Frequency distribution of the word occurrence in Polish language

A sample calculated frequency of word occurrence, is presented in Table 9, where 1% corresponds to about 2303013 occurrences.

Table 9 Frequency of the word occurrence in the orthographic corpus file

A sample calculated frequency of occurrence for the two-word and the three-word sequences, are presented in Tables 10 and 11. The results for the four-word and the five-word sequences, are not presented in this paper, but they can also be helpful to develop advanced word-based language models.

Table 10 Frequency of the two-word sequence occurrence in the orthographic corpus file
Table 11 Frequency of the three-word sequence occurrence in the orthographic corpus file

The frequency distribution of the phonemes in the phonemic language corpus, is presented in Fig. 3.

Fig. 3
figure 3

Frequency distribution of the phoneme occurrence in Polish language

The frequency distributions of the n-phoneme sequences, for n=2,…,5, are presented in Fig. 4.

Fig. 4
figure 4

Frequency distribution of the n-phoneme sequence occurrence in Polish language, for n=2,…,5

4.2 Evaluation of the obtained results

The results of the research on statistical analysis of Polish language, performed with the phonemic language corpus, were compared to other results published in the literature [4046]. Summary comparisons of the obtained statistical language data, to other results, available in the literature, are presented in Tables:

  • Table 12 presents the occurrence frequency of Polish phonemes and comparison to the results published in the literature [40, 42, 44, 45],

    Table 12 Frequency of Polish phoneme occurrence—comparison to the results published in the literature [40, 42, 44, 45]
  • Table 13 presents the occurrence frequency of the two-phoneme sequences (diphones) in Polish and comparison to the results published in the literature [45],

    Table 13 Frequency of the two-phoneme sequence occurrence in Polish—comparison to the results published in the literature [45]
  • Table 14 presents the occurrence frequency of the three-phoneme sequences (triphones) in Polish and comparison to the results published in the literature [45].

    Table 14 Frequency of the three-phoneme sequence occurrence in Polish—comparison to the results published in the literature [45]

The reasons of differences among the obtained results of the language statistical analysis performed by other scientists may be: differences in used corpora (e.g., in size, quality, linguistic structure) and development of language and changes over time. Language is constantly changing, evolving, and adapting to the needs of its speakers. All languages change continually, and do so in many and varied ways (e.g., lexical changes, phonetic and phonological changes, spelling changes, semantic and syntactic changes) [47]. Therefore, a results of research performed using different corpora may be very different from each other [48, 49]. The most similar results apply statistical analysis of Polish phonemes occurrence presented in Table 12 [44, 45]. The least accurate results were obtained with much smaller language corpus a few decades ago [4042]. Taking into account the results, available in the literature, it can be concluded that performed statistical analysis of Polish language, was extensive. No results of a statistical analysis of the n-phoneme sequences occurrence in Polish for n>3 were found in the literature. On the basis of the comparison results, the following conclusion can be drawn: The developed phonemic language corpus in Polish, which was used to perform statistical analysis of Polish language, was very huge, containing 1263248497 phonemes, but not the biggest developed for Polish language [44]. The statistical analysis results obtained based on it, allow to develop statistical models of Polish language.

4.3 Frequency of the word occurrence

The frequency of word occurrence in a language is well described by Zipf’s law [50, 51]:

$$ Z_{r} = \frac{a}{r^{b}} $$
(6)

where Z r is the frequency of the word ranked r, where r is the rank of the word if frequencies are ranked from the most frequent (r=1) to the least frequent (r=n), and a and b are parameters to be estimated from obtained statistical data. The usual findings is that b is close to 1 [50]. The fit of Zipf’s equation to the ranked frequency distribution of Polish words is presented in Fig. 5.

Fig. 5
figure 5

Fit of Zipf’s equation to the ranked frequency distribution of Polish words

The ranked frequency distribution of Polish words was estimated by Zipf’s equation in the following form:

$$ Z_{r} = \frac{0.041566}{r^{0.9}} $$
(7)

The average fit of Zipf’s equation to the ranked frequency distribution of Polish words was measured by the coefficient of determination R 2 value. The coefficient of determination for fit of Zipf’s equation, presented in Equation (7), to the ranked frequency distribution of Polish words is equal:

$$ R^{2} = 0.90729 $$
(8)

Additionally, root-mean-square error RMSE value was calculated for this case and it is equal:

$$ RMSE = 7.6475\cdot10^{-6} $$
(9)

The R 2 value indicates how well statistical data fit into a statistical model. The R 2 value equals R 2=0.90729 indicates that the Zipf’s equation fits well to the obtained statistical data of the word occurrence frequency in Polish language.

On this basis and on the basis of the results available in the literature [5153], it can be concluded that the statistical data, obtained as the result of performed statistical analysis of Polish language, based on the orthographic language corpus, are correct.

4.4 Frequency of the phoneme and n-phoneme sequence occurrence

The frequency of word occurrence in a language is well described by Zipf’s law [50]. However, Zipf’s law does not describe well the distribution of the phonemes and phoneme sequences out of which words are composed. The examination of occurrence frequency in 95 languages, presented in the literature [51], shows that phoneme frequencies are best described by an equation first developed by Yule, that also describes the distribution of DNA codons [54]. The frequency of the phoneme occurrence in a language is described well by Yule’s equation formula [51]:

$$ Y_{r} = \frac{a}{r^{b}} \cdot c^{r} $$
(10)

where Y r is the frequency of the phoneme ranked r, and r is the rank of the phoneme if frequencies are ranked from the most frequent (r=1) to the least frequent (r=n), and a, b and c are parameters to be estimated from the obtained statistical data.

The fits of Zipf’s and Yule’s equations to the ranked frequency distribution of Polish phonemes are presented in Fig. 6.

Fig. 6
figure 6

Fits of Zipf’s and Yule’s equations to the ranked frequency distribution of Polish phonemes

The evaluation results of the fits of Zipf’s and Yule’s equations to the ranked frequency distribution of Polish phonemes are presented in Table 15.

Table 15 Evaluation results of the fits of Zipf’s and Yule’s equations to the ranked frequency distribution of Polish phonemes

Note that the Zipf’s equation is a special case of the Yule’s equation in which c r is neglected. It is not always possible to neglect this term. As shown in Fig. 6 and in Table 15, the Yule’s equation fits to the distribution of the phoneme frequencies in Polish much better than the Zipf’s equation. It is not an isolated case and similar regularity can be observed in other languages [51].

The same regularity was observed for frequency distributions of the n-phoneme sequence occurrence for Polish language, for n=2,...,5. The Figs. 7 and 8 present the fit of Yule’s equation to the ranked frequency distribution of Polish n-phoneme sequences for n=2 and n=3.

Fig. 7
figure 7

Fit of Yule’s equation to the ranked frequency distribution of Polish two-phoneme sequences

Fig. 8
figure 8

Fit of Yule’s equation to the ranked frequency distribution of Polish three-phoneme sequences

The summary of evaluation results of the Yule’s equation fits to the ranked frequency distribution of Polish phonemes and the n-phoneme sequences for n=2,…,5 are presented in Table 16.

Table 16 Evaluation results of the fit of Yule’s equation to the ranked frequency distribution of Polish phonemes (n=1) and n-phoneme sequences for n=2,…,5

The values of R 2, presented in Table 16, indicate that the Yule’s equation fits very well to the obtained statistical data of frequency occurrence of Polish phonemes and the n-phoneme sequences for n=2,...,5. A similar properties are observed for other languages. On the basis of the obtained results and the results available in the literature [40, 41, 4346, 51], it can be concluded that statistical data, obtained as the result of performed statistical analysis of Polish language, based on the orthographic and phonemic language corpora, are correct.

5 Example of practical application of the obtained results for language modelling

This article contains a general statistics of Polish language that can be useful for a variety of language and speech processing applications, including automatic speech recognition with language models [55].

The goal of the word-based language model, is to model the sequence of words in the context of the task, being performed by the speech recognition system. In continuous speech recognition, the incorporation of the language model is crucial to reduce the search speed of recognized words sequence W. The probability P(W) of occurrence W, sequence of n words w i , can be decomposed as [17]:

$$ P(W) = P(w_{1})\prod\limits_{i=2}^{n} P(w_{i}|w_{1},\ldots,w_{i-1}) $$
(11)

where P(w i |w 1,…,w i−1) is the conditional probability that w i will occur, given the previous word sequence w 1,…,w i−1. Unfortunately, it is impossible to compute the conditional word probabilities P(w i |w 1,…,w i−1) for all words and all sequence lengths in a given language. Even though the sequences are limited to moderate values of i, there would not be enough data to estimate reliably all of the conditional probabilities. The conditional probability can be approximated by estimating the probability only on the preceding N−1 words defined by the following formula:

$$ P(W) = P(w_{1})\prod\limits_{i=1}^{n} P(w_{i}|w_{i-N+1},\ldots,w_{i-1}) $$
(12)

This approximation is commonly referred to as N-gram model [17]. The most popular solutions published in the literature, relate to the application of N-gram language models for word-based speech recognition tasks [5659].

The language modelling may be based on modelling of words, as well as sub-words (e.g. phonemes). Statistical analysis of the phonemic corpus enables to develop statistical language models, based on phonemes.

For sequence of the phonemes Q=q 1q m , containing m phonemes q i , the probability P(Q) is given by a phoneme-based language model and the following formula:

$$ P(Q) = P(q_{1})\prod\limits_{i=2}^{m} P(q_{i}|q_{1},\ldots,q_{i-1}) $$
(13)

where P(q i |q 1,…,q i−1) is the conditional probability that q i will occur, given the previous phoneme sequence q 1,…,q i−1. The P(Q) probability approximation for N-gram phoneme-based language model is defined by the analogous formula:

$$ P(Q) = P(q_{1})\prod\limits_{i=2}^{m} P(q_{i}|q_{i-N+1},\ldots,q_{i-1}) $$
(14)

On the basis of performed statistical analysis of the orthographic language corpus, there have been developed the N-gram word-based language models for N=1,…,3, intended for Polish language. In a similar way, on the basis of statistical analysis results of the phonemic language corpus, the N-gram phoneme-based language models for N=1,…,3, intended for Polish language, were developed. The details of word-based and phoneme-based language models developing process are presented in the separate publication. This article presents only the example of language statistical analysis application to develop selected language models.

An approach to evaluate a language model is word recognition error rate [60].

However, this approach requires a working speech recognition system. Alternatively, we can measure the average number of possible words that follow any given word sequence in a language. This is the derivative measure of entropy, known as perplexity (PP) [17]. Given a language model P(W), where W is the n-word sequence, the entropy of the language model can be defined as [61]:

$$ H(W) = -\frac{1}{n}\log_{2}(P(W)) $$
(15)

For N-gram language model, H(W) entropy can be calculated with the following formula:

$$ H(W) = -\frac{1}{n}\sum\limits_{i=1}^{n} \log_{2}(P(w_{i}|w_{i-N+1},\ldots,w_{i-1})) $$
(16)

Note that as n approaches infinity, the entropy approaches the asymptotic entropy of the source defined by the measure P(W). This means that the typical length of the sequence must approach infinity, which is of course impossible. Thus, entropy H(W) should be estimated on a sufficient large n value. The perplexity PP(W) of the word-based language model is then defined as [17]:

$$ PP(W) = 2^{H(W)} $$
(17)

The comparison of perplexity PP N (W) values for the developed word-based N-gram language models for N=1,…,3, is presented in Table 17. The comparison of perplexity PP N (Q) values for the developed phoneme-based N-gram language models for N=1,…,3, is presented in Table 18.

Table 17 Comparison of perplexity PP N (W) values for the developed word-based N-gram language model for N=1,…,3
Table 18 Comparison of perplexity PP N (Q) values for the developed phoneme-based N-gram language model for N=1,…,3

The PP values, presented in Tables 17 and 18, show that the developed phoneme-based 3-gram language model has the lowest PP value equal to 7.77. The lower perplexity value for language model indicates a greater ability to predict sequence of speech components. A language model is rated as better if the perplexity PP value is less. A language models with low perplexity indicate more predictable language. However, since the perplexity is not related to the complexity of recognizing some acoustic patterns, reducing the language model, perplexity does not guarantee an improvement in automatic speech recognition performance.

5.1 Potential application of other statistical analysis results

The statistical analysis results for 4 and 5-word sequence occurrence are not presented in this paper. But these results can be helpful to develop advanced (4 and 5-gram) word-based language models for Polish. As previously written, the language modelling may be based on modelling of words, as well as sub-words (e.g., phonemes). Therefore, the statistics of higher than three-phoneme sequence can be used for developing advanced (higher than 3-gram) phoneme-based language models for Polish. The advanced word-based and phoneme-based language modelling, enables to develop a hybrid language models for out-of-vocabulary (OOV) word detection in large vocabulary conversational speech recognition (LVCSR) systems for the language [62, 63]. The language model in most state-of-the-art LVCSR systems is still the N-gram, which assigns probability to the next word based on only the N−1 preceding words [64]. But the use of an additional phoneme-based language models improves efficiency of LVCSR systems [65]. Another improvement in an LVCSR system development is the use of higher than 4-gram language models, with particular emphasis on N-gram phoneme-based language models.

6 Conclusions

This paper presents the original results of statistical analysis of Polish language, performed by means of the orthographic language text corpus, obtained from the NCP corpus and the phonemic language corpus, developed through automatic grapheme-to-phoneme conversion of the orthographic language corpus. The results of statistical analysis of Polish language, enable to develop statistical word-based and phoneme-based language models, in order to be used for automatic speech recognition.

The results of the research on statistical analysis of Polish language were compared and are consistent to other results available in the literature [4046, 66, 67]. Taking into account the results available in the literature, it can be concluded that performed statistical analysis of the language was extensive. No results of the statistical analysis of n-phoneme sequence occurrence in Polish for n>3 were found in the literature. On the basis of the comparison results, the following conclusion can be drawn: The phonemic language corpus in Polish which used to perform statistical analysis of the language, was very huge (containing 1,263,248,497 phonemes) and the statistical analysis results, obtained and based on it, allows to develop statistical models of Polish language.

Additionally, the validation and evaluation of the obtained statistical data were performed. The frequency of the word occurrence in a language is well described by Zipf’s law. The validation of statistical data for words was performed by the fit of Zipf’s equation to the ranked frequency distribution of Polish words. Similar regularity was observed for frequency distribution of the phoneme occurrence for Polish language. The examination of frequency occurrence in 95 languages, presented in the literature [51], shows that phoneme frequencies are best described by Yule’s equation [54]. The validation of the statistical data for phonemes was performed by the fit of Yule’s equations to the ranked frequency distribution of Polish phonemes and n-phoneme sequences. According to the results available in the literature [51], it can be concluded that statistical data obtained as the result of performed statistical analysis of Polish language, based on the orthographic and phonemic language corpora, are correct.

Regularity presented in this paper, it is not an isolated case and similar regularity can be observed in other languages, so also for other language corpora, reflecting the state of contemporary language [51]. It should also be noted, that it seems to be valuable to provide similar fits for existing Polish text corpora for allowing the reader to assess the quality of the created phonemic language corpus. Similarly, it seems to be very valuable to confront word error rate and the perplexity of the language models, created by means of the existing Polish corpora with respect to a common test set. However, it is difficult to perform due to lack of access to other existing Polish text corpora of appropriate size and quality, except NCP corpus. Similarly, the author does not find any available phonemic language corpus for Polish. Therefore, the author attempts to create his own phonemic language corpus with the use of G2P conversion of the existing available orthographic language corpus for Polish (NCP). Since this problem seems to be very important, the author is planning to bring this subject up in the future publications.

The developed word-based and phoneme-based language models were also presented in this paper, as an example of practical applications of the obtained statistical data of Polish language. The obtained statistical data open up further opportunities to continue research on improving automatic speech recognition in Polish. The plan for future research includes the development of statistical word-based and subword-based language models for Polish. The word-based and subword-based language modelling, enables to develop a hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition [64, 6870].