1 Introduction

With the continuous improvement of education level, China pays more and more attention to the cultivation of talents. In order to cultivate diverse talents to meet the needs of society, schools have offered various courses for students. Among them, English is one of the basic courses that need to be learned from primary school to university in China's education, and it is also one of the key projects that modern students need to investigate in job hunting and employment [1]. At the same time, English is also a language with the characteristics of internationalization and universality. In this context, all schools in China attach great importance to students' English learning. Compared with the latter, students generally have some problems in the former. This is due to the lack of English language environment in China. There are fewer opportunities to communicate in English, which leads to students' non-standard English pronunciation.

In the process of English communication, English pronunciation problems will lead to the inability to accurately express their ideas to each other, resulting in poor communication. For this reason, primary and secondary schools have begun to attach importance to the cultivation of students' English pronunciation learning, that is to say, while cultivating English language writing ability, they also attach importance to the cultivation of English language expression ability. On this basis, they also provide students with a good English language expression environment. At the same time, oral English pronunciation practice is a long-term task. At present, in the process of training, in view of students' oral English pronunciation problems, teachers mainly use the way of teaching through their own learning and long-term accumulation. They borrow their own knowledge reserves and learning ability to impart knowledge to students. But due to the deviation of different teachers' ability, knowledge reserve and language expression ability, and the inconsistency of students' understanding of knowledge, the effect of English pronunciation training is not ideal, which is mainly due to the lack of a complete set of standardized training methods.

Faced with this situation, along with the development and popularization of information technology, information technology has been well applied in many fields. In order to better help students improve oral English pronunciation and the quality of oral English, we need to use information technology to help students develop oral English pronunciation training methods and establish a unified and standardized training mode. Therefore, how to help students carry out oral English pronunciation training has become an urgent problem to be solved in universities.

Aiming at the research on the application of informatization in the auxiliary training of spoken English pronunciation, the collection, transmission and processing of spoken English pronunciation audio in the early stage can be realized by means of pickups, audio Bluetooth chips, programmable logic controllers and so on. They can all help students obtain available standardized data for spoken English pronunciation. The pronunciation audio is input into a processing program. Besides, a pronunciation audio data extraction program and a pronunciation auxiliary training program are also used to realize the application of the information technology in the oral English pronunciation auxiliary training and to form a standardized and targeted oral English pronunciation training mode.

2 Research status

With the development of information technology, some researchers in the field of education have applied information technology to oral English pronunciation training. Some researchers have collected teacher training samples and established a database of oral English pronunciation training. Students can extract pronunciation training samples from the database through the pronunciation content they need to train so as to improve the effect of oral English [2]. However, this method of oral English training is lack of unified standards, because the training sample data of different teachers are inconsistent. It is easy to form the problem of non-standard English pronunciation of database samples. There are also researchers who use BP neural network algorithm to establish the forgetting curve of oral English training. Through the analysis of the process of students' oral English training, we can get the cycle of English oral training and carry out effective training according to the key training cycle in the forgetting curve points [3]. This method only studies the process of training. The standardization of oral English pronunciation training has not been studied. Other researchers use sensor equipment to collect samples of spoken English pronunciation. Then, they screen the collected samples of spoken English pronunciation, filter out some samples with poor pronunciation effect, retain available samples with standard pronunciation effect, and then use the sample matching method to match different students in order to recommend the appropriate learning level and method [4]. But this method is more likely to match the appropriate English pronunciation samples. But there is no good study on the training and the training process.

There are also some researchers who aim at the establishment of spoken English pronunciation system. Some researchers have established the evaluation index system of spoken English pronunciation. They also made a questionnaire and entered the evaluation scoring table in the system. When students conduct the questionnaire survey according to the requirements of the system, the system will automatically give the defect level of spoken English pronunciation and give some suggestions for improvement [5]. However, this method only uses the system to realize the calculation of evaluation scores. It does not effectively analyze the training of spoken English pronunciation. There are also researchers who establish a system matching pronunciation comparison method [6]. According to the matching algorithm, they compare and analyze the pronunciation data input by the system with the pronunciation data stored in history. Then, they output the similar pronunciation data and the problems existing in the pronunciation through using the case-based reasoning technology and the matching algorithm. Therefore, the system has not been widely used.

According to the research results of oral English pronunciation training, the current research in this field mainly focuses on the establishment of pronunciation database and the evaluation of pronunciation defects through questionnaires. The use of the system is only a simple calculation of the evaluation level of pronunciation defects, which has certain limitations. Therefore, in view of the above problems, this study establishes a "data layer-logic layer-display layer" system, in which students' oral English pronunciation is recorded. Then compare with accurate pronunciation, adopt suprasegmental processing, score according to the comparison results, and finally give some correction suggestions. Realize the online automatic evaluation and correction of English speech. At the same time, this study is based on the transformation and innovation of data extraction algorithm to realize the scientific training of spoken English pronunciation. Through the design of hardware and software of the system, the application of the system in oral English pronunciation training is completed.

3 Design of English spoken pronunciation auxiliary training system based on data extraction

The fundamental function of English as a language is to communicate with others [7]. The purpose of language use and the accuracy of pronunciation are very important in the language system, but Chinese students generally have the problem of inaccurate pronunciation in English learning. To solve this problem, this paper designs an aided training system of spoken English pronunciation based on the data extraction. The system can record pronunciation, extract pronunciation characteristics, and then judge whether the pronunciation is accurate. Then, it gives pronunciation scores and points out the mistakes.

3.1 System framework design

The system framework is the overall structure of the system design, which provides guidance and reference for the whole design. The framework of this system is based on the three-layer framework of B/S three-layer structure design system. The advantage of B/S three-layer structure is that it can simplify the complexity of client design and reduce the network load. It is easy to maintain and upgrade [8]. The system framework design includes data layer, business logic layer and display layer.

  1. 1.

    Data layer: The function of the data layer is to input the spoken English pronunciation of the user. Then it forms operable and accessible data and provides data services for the other two layers [9].

  2. 2.

    Business logic layer: The function of the business logic layer is to design and run various business logics for various functional modules to deal with various problems [10].

  3. 3.

    Display layer: The display layer is used to display the operation results of business logic and provide the window for user operation [11].

3.2 System hardware design

3.2.1 English pronunciation input equipment

The purpose of oral English pronunciation training is to compare with accurate pronunciation to judge whether the user's pronunciation is accurate so as to correct the user's pronunciation [12]. The key of the training system is to use the recording device to collect the user's pronunciation. The spoken English recording device in this system is a low-noise pickup, which is composed of a microphone and an audio amplification circuit. It uses a high-fidelity low-noise processing chip to effectively suppress environmental noise through multiple frequency selection. It has a built-in automatic gain control (AGC) circuit to ensure the pure tone of the recording. The technical parameters of the equipment are shown in Table 1.

Table 1 Pickup technical parameters

3.2.2 Audio Bluetooth chip

The functions of the audio Bluetooth chip include two aspects: one is to transmit the recorded and processed spoken English pronunciation audio signal to the auxiliary training center for further processing and analysis; the other is to output the voice command of the auxiliary training center and transmit the correct spoken English pronunciation audio [13]. The audio Bluetooth chip in this system is CX950B, which is mainly used for short-distance audio signal transmission. It can be easily connected with notebook computers, mobile phones, PDAs and other devices to achieve wireless transmission of audio signals. The specific parameters of the chip are shown in Table 2:

Table 2 Audio bluetooth chip working parameters

3.2.3 Programmable logic controller

In the auxiliary training system of spoken English pronunciation, all kinds of business logic operations are needed. Therefore, the programmable logic controller is an important hardware in the system design. It uses a programmable memory, which is responsible for data acquisition and extraction, self-diagnosis, control execution, external communication and external output functions [14]. The performance characteristics of the PLC in this system are as follows:

  1. 1.

    With 32-bit 2001-2421 CPU, it has faster speed and higher performance, supports basic control instructions, and can better meet diversified control requirements;

  2. 2.

    CPU, I/O signal, communication network and power supply all adopt isolation protection measures;

  3. 3.

    A miniature embedded real-time multi-task operating system is adopted to support multi-task distribution and reasonably use CPU resources;

  4. 4.

    Open network: The 100 m Ethernet, multi-serial communication, MODBUS support and custom protocol can realize wireless data transmission with wireless terminal equipment.

3.3 System software design

The business logic program mainly analyzes the operation process of each functional module in the system. It completes the whole program of oral English pronunciation training in a logic-driven way [15]. The service logic program in the system comprises a pronunciation audio input processing program, a pronunciation audio data extraction program and a pronunciation auxiliary training program.

3.3.1 Pronunciation audio input processor

The pronunciation audio input processing program is the first important program to be run after the user logs in. The program includes the logic operation of the whole early stage from the input of spoken English pronunciation to the processing and then to the sending of audio files. The program is shown in Fig. 1 below.

Fig. 1
figure 1

Pronunciation audio input processor

3.3.2 Pronunciation audio data extraction program

The main reason for the poor effect of the previous oral English pronunciation training system is that the system can not effectively compare the differences between standard pronunciation and used pronunciation, which results that users can not fully understand their pronunciation errors. Therefore, in order to facilitate subsequent pronunciation comparison, the pronunciation audio data extraction program is the core program of the system [16]. The specific procedures are as follows:

  • Step1: Receive and decode the spoken English pronunciation audio file;

  • Step2: Segment the Oral English Pronunciation Audio Syllables

  • Step3: Extract audio features, which includes the pitch period, Mel cepstrum coefficient and formant frequency.

1) Pitch period. The pitch period characteristic parameters are extracted by the autocorrelation function. The extraction formula is as follows:

$$f_{i} \left( k \right)^{1} = \sum\limits_{t = 1}^{N} {x_{i} \left( t \right)x_{i} \left( {t + k} \right)}$$
(1)

\(f_{i} \left( k \right)^{1}\) represents the pitch period of the \(i\) audio signal; \(k\) represents the amount of time delay;\(N\) represents the frame length; \(x_{i} \left( m \right)\) represents the audio signal of spoken English pronunciation; \(t\) represents time.

2) Meyer cepstrum coefficient. The meir cepstrum coefficients of audio were extracted by Meir filter [17]. The extraction formula is as follows:

$$f_{i} \left( k \right)^{2} = \sum\limits_{i = 0}^{M} G \left( i \right) \cdot \sin \left( {\frac{n\pi }{M}} \right)$$
(2)

\(f_{i} \left( k \right)^{2}\) represents the Meir cepstrum coefficient of the \(i\) audio signal; \(G\left( i \right)\) represents the logarithmic energy output by the first Mayer filter; \(M\) represents the number of Meyer filters; \(n\) represents the order of parameters.

3) Formant frequency. Using the linear predictive coding method to extract the frequency of audio formant [18]. The extraction formula is as follows:

$$f_{i} \left( k \right)^{3} = \frac{{F\left( {L_{i} } \right)}}{2\pi } \cdot T$$
(3)

\(f_{i} \left( k \right)^{3}\) represents the formant frequency; \(T\) represents the signal sampling period; \(F\) represents the prediction error filter; \(L_{i}\) represents the bandwidth of the \(i\) audio signal.

During the research process, the first three formant peaks of each frame of audio signal can be connected together to form formant locus.

Step4: Do the multi-feature fusion and normalize the processing.

The above step is the pronunciation audio data extraction program. After the extraction, proceed to the next procedure.

3.3.3 Pronunciation aid training program

A pronunciation auxiliary training program is executed by taking that extracted pronunciation audio characteristic parameters as input. The program process is as follow:

Step 1: Input audio characteristic parameters of spoken English pronunciation; Step 2: Extract the characteristic parameters of the standard spoken English pronunciation audio;

Step 3: Calculate the similarity between the actual pronunciation audio features of the user and the standard spoken English pronunciation audio features according to a distance formula. The calculation formula is as follows:

$$d\left( {f_{i} ,f_{i}^{^{\prime}} } \right) = \frac{{\sqrt {\sum\limits_{i = 1}^{N} {\left( {f_{i} - f_{i}^{^{\prime}} } \right)^{2} } } }}{N}$$
(4)

\(d\left( {f_{i} ,f_{i}^{^{\prime}} } \right)\) represents the similarity between the user's actual pronunciation audio features \(f_{i}\) and the standard spoken English pronunciation audio features \(f_{i}^{^{\prime}}\); \(N\) represents the frame length.

Step 4: calculate the corresponding pronunciation score according to the characteristic distance. The calculation formula is as follows:

$$S = 1 + \frac{100}{{\delta \cdot d^{\gamma } }}$$
(5)

\(S\) represents the pronunciation score; \(\delta\) and \(\gamma\) are constant parameters. The value ranges are from 0 to 1: \(\delta + \gamma = 1\).

Step5: Calculate the comprehensive score of the user's oral English pronunciation and audio. The calculation formula is as follows:

$$S^{\prime} = w_{1} S_{1} + w_{2} S_{2} + w_{3} S_{3}$$
(6)

\(S^{\prime}\) represents the comprehensive score; \(w_{1}\),\(w_{2}\), and \(w_{3}\) represent the value of representation; \(S_{1}\), \(S_{2}\) and \(S_{3}\) represent the pronunciation fraction corresponding to pitch period, Meyer cepstrum coefficient and formant frequency.

Step6: judge whether the user's pronunciation is qualified according to the comprehensive score, as shown in Table 3 below:

Table 3 Standard pronunciation scale

Step 7: Further determine the types of pronunciation errors according to the SVM classifier, and mine the error rules.

Step 8: Give the corresponding report of pronunciation standard;

Step 9: According to the report, the user identifies the pronunciation errors and performs corrective exercises corresponding to the standard pronunciation.

Repeat the above process of recording, scoring and correcting until the pronunciation reaches the qualified standard or above to complete the whole pronunciation assistant training process [16].

4 System implementation and testing

The key of oral English pronunciation training is to find out the user's pronunciation errors accurately and correct them. In the experimental analysis, the samples are trained by collecting test data. The characteristics of the data are analyzed. The results of the system are tested.

4.1 System test sample collection

Twenty students were taken as test samples. 20 oral English pronunciation samples with the same test content (see Fig. 3) were collected in the environment as shown in Fig. 2.

Fig. 2
figure 2

System test sample collection environment

Fig. 3
figure 3

Oral English pronunciation test content

4.2 Training sample

The user's oral English pronunciation training sample of SVM classifier is selected from the TIMIT speech database. The training sample set is shown in Fig. 4 below.

Fig. 4
figure 4

Training sample set

4.3 Data characteristics

The pronunciation audio data extraction program is executed to extract three features of the audio data, namely a pitch period, a Mel cepstrum coefficient and a formant frequency. Take the standard spoken English pronunciation sample as an example. The three data characteristics are shown in Fig. 5.

Fig. 5
figure 5

Data characteristics

4.4 System test results

Pronunciation training program was carried out to calculate the distance between the audio features of 20 students' spoken English pronunciation and the standard pronunciation. The score was calculated. Then, the SVM classifier was used to further classify the types of wrong pronunciation and mine the error rules. The results are shown in Table4:

Table 4 System application test results

It can be seen from Table4 that under the application of the system, the oral English pronunciation of 12 of the 20 students is above the qualified line. Among them, 3 are excellent, 5 are good, 4 are qualified, and the remaining 8 students are unqualified. They are with common problems of stress errors and light reading errors.

5 Conclusion

Students in China generally have the problem of inaccurate English pronunciation, which leads to difficulties in English communication. In order to improve students' oral English pronunciation level, this paper designs an auxiliary training system of oral English pronunciation based on data extraction. The conclusions are as follows:

  1. 1.

    The system proposed in this study picks up students' pronunciation audio. Compare and analyse the system with standard pronunciation. Then determine the inaccuracy of pronunciation and give corrective suggestions.

  2. 2.

    The system proposed in this study has been repeatedly recorded and corrected to achieve the purpose of training. After testing, the auxiliary training function of the system is good and can meet the system design objectives.