Keywords

1 Introduction

Japanese sentences contain various kinds of characters, such as Kanji (Chinese character), Hiragana, Katakana, numbers, and alphabets, which makes it difficult to learn. Japanese speakers usually learn Hiragana first in their school days because the number of characters is much smaller than the Kanji; Hiragana has 46 characters, and Japanese uses thousands of Kanji. Most Japanese sentences are composed of all kinds of characters. These sentences are called Kanji-Kana mixed sentences. However, it is difficult for many non-Japanese speakers to learn thousands of Kanji, so children and new Japanese language learners use Hiragana sentences.

Unlike Western languages, Japanese and Chinese do not have word boundaries, so word segmentation is necessary for the natural language processing of these languages. MeCabFootnote 1 and ChasenFootnote 2 are morphological analyzers for Japanese that segment Japanese sentences into words. The existing Japanese morphological analyzers’ performances are very high. However, it is difficult to segment sentences written almost entirely in HiraganaFootnote 3 into words using these systems because they are designed for Kanji-Kana mixed sentences. When almost all sentences are written in Hiragana, which makes it challenging to identify the location of words to be segmented.

This paper developed two types of BERT [1] (Bidirectional Encoder Representations from Transformers) models for Hiragana sentences: unigram and bigram BERT models (see Sect. 3). We utilized a large amount of automatically tagged data for pre-training and used manually tagged data for fine-tuning (see Sect. 4). In addition, to compare with the performances of our Hiragana BERT-baed word segmentation models, we developed a Hiragana word segmentation model using KyTeaFootnote 4, a toolkit developed for analyzing text, with a focus on languages requiring word segmentation. The experiments revealed that the unigram BERT model outperformed the bigram BERT model and KyTea model (see Sect. 6). We discussed the reason (see Sect. 7) and conclude this paper (see Sect. 8).

The contributions of this paper are as follows:

  1. 1.

    We developed Hiragana unigram and bigram BERT models,

  2. 2.

    We showed that automatically tagged data is effective for pre-training of Hiragana BERT models, and

  3. 3.

    We discussed why the Hiragana unigram BERT-based word segmentation model outperformed the Hiragana bigram BERT-based or KyTea-based word segmentation models.

2 Related Work

There are some studies on word segmentation and morphological analysis of Hiragana sentences, which include the following. First, Kudo et al. [5] modeled the process of generating Hiragana-mixed sentences, which are sentences that include words written in Hiragana but usually written in different characters such as Kanji, using a generative model. They proposed a method to improve the accuracy of parsing Hiragana-mixed sentences by estimating its parameters using a large Web corpus and an EM algorithm. Hayashi and Yamamura [2] reported that adding Hiragana words to the dictionary improves the accuracy of morphological analysis. Izutsu et al. [3] converted MeCab’s ipadic dictionary into Hiragana and used a corpus consisting only of Hiragana to perform morphological analysis of Hiragana-only sentences. In addition, Izutsu and Komiya [4] performed a morphological analysis of Hiragana sentences using the Bi-LSTM CRF model and reported how the accuracy of morphological analysis could be changed by training and fine-tuning over multiple domains of sentences. Moriyama et al. [6] also performed the morphological analysis of plain Hiragana sentences using the Recurrent Neural Language Model (RNNLM) and reported that its accuracy significantly outperformed conventional methods in the strictest criterion where the answers were deemed as correct when all word segmentation and word features were correct. Furthermore, Moriyama and Tomohiro [7] proposed a sequential morphological analysis method for Hiragana sentences using Recurrent Neural Network and logistic regression and reported that the performance was improved and the system was speed-up.

In addition, this study creates and uses a Hiragana BERT, which is a BERT model specialized for Hiragana sentences, to create a word segmentation model for Hiragana sentences. Examples of the creation of Japanese domain-specific BERTs include Suzuki et al. [8]. This paper reports the creation of a BERT specialized for sentences of the financial domain using financial documents. The paper also examines the effectiveness of fine-tuning using a financial corpus for a BERT model pre-trained from a general text corpus.

3 Proposed Method

BERT is a pre-trained language model based on Transformer [9]. In this paper, we generated two types of Hiragana BERT models specialized for Hiragana sentences and used each to develop a word segmentation system for Hiragana sentences. The first model is the unigram BERT model, which is a BERT model trained from sentences consisting of the Hiragana character unigrams. The second model is the bigram BERT model, which is trained from sentences consisting of the Hiragana character bigrams. We created a word segmentation system for Hiragana sentences by creating unigram and bigram BERT models and fine-tune them using data for word segmentation of Hiragana sentences. We compared the performances of these two Hiragana sentence word segmentation systems. In addition, we created a model of word segmentation of Hiragana sentences using KyTea and compared it to the models we proposed. These Hiragana word segmentation systems are useful for children and new Japanese language learners.

3.1 Unigram BERT Word Segmentation System

Unigram BERT is a BERT model trained with sentences composed of Hiragana unigrams. We converted Wikipedia’s Kanji-Kana mixed sentences into Hiragana, reformed them into character unigrams, and used the Hiragana-character unigram data for training of the BERT model. Since Wikipedia does not have data only written in Hiragana, the reading data from MeCab’s analysis results were used as pseudo-correct answers. The reading data is Hiragana data based on the words’ pronunciation. They are usually used for Hiragana writing.

The vocabulary size of the unigram BERT is 300. It includes Hiragana, Katakana, alphabets, numbers, and multiple symbols.

We also created a word segmentation system for Hiragana sentences by fine-tuning the unigram BERT using data on the word segmentation of Hiragana sentences. We refer to this system as the unigram BERT word segmentation system. Depending on the experiment, we used either the Balanced Corpus of Contemporary Written Japanese (BCCWJ), which is a corpus used for many Japanese research, or Wikipedia data for fine-tuning.

3.2 Bigram BERT Word Segmentation System

Bigram BERT is a BERT model that is trained with sentences composed of Hiragana bigrams. We converted Wikipedia’s Kanji-Kana mixed sentences into Hiragana, reformed them into character bigrams, and used the Hiragana-character bigram data for training of the BERT model. Because there is no Hiragana-only data available, like pre-training of the unigram BERT word segmentation system, the bigram BERT model was trained with the reading data from MeCab’s analysis results, which were used as pseudo-correct answers.

The vocabulary size of the bigram BERT is 80,956. It includes Hiragana, Katakana, alphabets, numbers, and any two combinations of multiple symbols.

We also created a word segmentation system for Hiragana sentences by fine-tuning the bigram BERT using data on the word segmentation of Hiragana sentences. We refer to this system as the bigram BERT word segmentation system. Depending on the experiment, we used either BCCWJ or Wikipedia data for fine-tuning.

4 Data

4.1 Pre-training Data from Wikipedia

We used Wikipedia for pre-training to create two types of Hiragana BERT models: unigram and bigram BERT models. This data was extracted from the Japanese Wikipedia home page.Footnote 5 Footnote 6

Since Wikipedia consists of Kanji-Kana mixed sentences, we converted them into Hiragana texts. For the conversion, as mentioned in Sect. 3.1, we utilized MeCab. MeCab is a morphological analyzer, which segments Kanji-Kana mixed texts into words. It also outputs reading data of the words. The reading data is based on pronunciation and it is usually used for Hiragana writing. Therefore, the reading data from MeCab could be deemed as pseudo-correct word segmentation for Hiragana texts. However, please note that the pseudo-correct answers have errors because Japanese has many homographs, i.e., words with ambiguous pronunciations. For example, “” could be pronounced as KONNICHIWA, which means “hello,” or KYOWA, which means “As for today,” according to contexts. We employed Unidic as the dictionary for MeCab.

After we obtained Hiragana texts, we converted them into character unigrams and bigrams, respectively. The data converted to character unigrams were used as pre-training data for the unigram BERT, while the data of character bigrams were used as pre-training data for the bigram BERT. However, we added the character string “*” to the end of the bigrams to align the number of tokens with unigrams. Finally, we assigned [CLS] and [SEP] tags to the beginning and end of the sentence, respectively.

Table 1 shows example pre-training data for the unigram and bigram BERT models.

Table 1. Example of pre-training data

We generated 3 million sentences of Wikipedia data for pre-training through these steps. The data contents are identical, except for the representation as either bigrams or unigrams.

4.2 Word Segmentation Data for Hiragana Sentences from Wikipedia

From Wikipedia, we generated data for word segmentation of Hiragana sentences to fine-tune unigram BERT and bigram BERT. We obtained Hiragana texts from Wikipedia using MeCab as described in Sect. 4.1. MeCab outputs not only reading data but also the word boundaries. Therefore, we used the output of MeCab to train the word segmentation system again. However, we did not add [CLS] and [SEP] tags for the data of fine-tuning.

We also created tag information consisting of 0 s and 1 s. We set the first unigram/bigram of the word’s reading to 1 and the rest to 0. These data are the labels for the word segmentation task. Tag 1 represents the word boundary.

Table 2 shows an example of word segmentation data for Hiragana sentences.

Table 2. An example of word segmentation of Hiragana sentences

We generated 1,000,000 Wikipedia word segmentation data for Hiragana sentences through these steps. The contents of the generated data are identical, except for the representation as either unigrams or bigrams.

4.3 Word Segmentation Data for Hiragana Sentences from BCCWJ

In contrast to Wikipedia, the core data of the BCCWJ has information on word boundaries and reading. Since they are automatically tagged and manually revised, the word boundaries are accurate. However, the reading data of monographs are sometimes unknown. In these cases, the reading data are determined by the annotators. Also, sometimes there were some guidelines to determine the reading data. We utilized the core data of the BCCWJ for testing and fine-tuning of word segmentation.

We extracted the reading data of BCCWJ core data and converted them into character unigrams and bigrams. We also created tag information consisting of 0 s and 1 s to show the word boundaries followed procedures described in Sect. 4.2. The data format of word segmentation data for Hiragana sentences from the BCCWJ is the same as that from Wikipedia, as shown in Table 2. The above operations resulted in 40,928 sentences of BCCWJ Hiragana data that are segmented into words.

4.4 Data for the Hiragana KyTea Word Segmentation System

To train a word segmentation system for Hiragana sentences using KyTea, we used the reading data of the BCCWJ core data. Because KyTea does not train pre-trained language models, we did not pre-train the KeyTea model.

Table 3 shows an example of the data used to train the Hiragana KyTea word segmentation system. Hiragana words with word boundaries are directly used to train the Hiragana KyTea word segmentation system.

Table 3. Data used to train the Hiragana KyTea word segmentation system

5 Experiment

We conducted two experiments to test how the accuracy of word segmentation of Hiragana sentences varies with the amount and type of data used in fine-tuning in the two types of BERTs. In the experiments, the accuracies of the unigram, bigram BERT word segmentation systems, and Hiragana KyTea word segmentation system were compared.

5.1 Experiment 1: Fine-Tuning with BCCWJ

The first experiment was a fine-tuning experiment using BCCWJ. This experiment compared three word segmentation systems using accurate segmentation information for Hiragana sentences. We used 3 million sentences from Wikipedia to pre-train the Hiragana BERT models and 40,928 sentences from BCCWJ to fine-tune and test the BERT models using five-fold cross-validation. The ratio of data for fine-tuning, validation, and testing is 3:1:1.

We assessed the Hiragana KyTea word segmentation system using 40,928 sentences from BCCWJ with five-fold cross-validation. These data are the same as those used for the experiments with the two BERT models. However, Wikipedia data were not used in the KyTea word segmentation system. The ratio of training to test data is 4:1.

Tables 4 and 5 list the parameters used in BERT pre-training and fine-tuning, respectively. These parameters were determined through preliminary experiments using the validation data.

Table 4. Parameters in pre-traning
Table 5. Parameters in fine-tuning

5.2 Experiment 2: Fine-Tuning with Wikipedia

The second was a fine-tuning experiment conducted on Wikipedia, which used a large amount of pseudo-data (Wikipedia word segmentation information) to test the accuracy of the three word segmentation systems. In this experiment, we used 3 million sentences from Wikipedia to pre-train the BERT models and 1 million sentences from Wikipedia to fine-tune the word segmentation of Hiragana sentences. The pre-training and fine-tuning data did not overlap; however, the pre-training data used in Experiments 1 and 2 were identical. The data used for training the Hiragana KyTea word segmentation system were the same as the fine-tuning data for the unigram and bigram BERT word segmentation systems. We used 400,000 sentences from Wikipedia and 40,928 sentences from BCCWJ, both word-segmented Hiragana sentences, as test data. Wikipedia data used as the test data did not overlap with the pre-training data for the BERT models.

In Experiment 2, the parameters used for BERT pre-training and fine-tuning were identical to those used in Experiment 1, except for the number of epochs. The number of epochs in Experiment 2 was 24.

5.3 Evaluation Methods

Unigram and bigram BERT word segmentation systems accept sentences as inputs. The input data formats were character unigrams for the unigram BERT word segmentation system and character bigrams for the bigram BERT word segmentation system (Table 2). Word segmentation systems estimate and output 0 and 1 tag information based on whether to segment Hiragana sentences for each character unigram or bigram. Tag-based accuracy, word-boundary-based precision, recall, and F-measures were evaluated.

The Hiragana KyTea word segmentation system directly outputs word boundary information, instead of 0 and 1 tags. Therefore, we converted the outputs into 0 and 1 tags and evaluated tag-based accuracy. In addition to tag-based accuracy, word-boundary-based precision, recall, and F-measure were evaluated for the Hiragana KyTea word segmentation system.

6 Results

Table 6 lists the accuracy, precision, recalls, and F-measure of the five-fold cross-validation tests for each system in Experiment 1: fine-tuning with BCCWJ.

Table 6. Experiment 1: Results of each system in the fine-tuning experiments with BCCWJ

As summarized in Table 6, the unigram BERT word segmentation system improves the F-measure by 4.64 points compared with the Hiragana KyTea word segmentation system. Compared with the Hiragana KyTea word segmentation system, the bigram BERT word segmentation system improved the F-measure by 2.92 points. Furthermore, comparing the F-measures of the unigram and bigram BERT word segmentation systems, the unigram BERT word segmentation system has an F-measure of 1.72 points higher than the bigram BERT word segmentation system.

Table 7 summarizes the results of Experiment 2: Fine-tuning with Wikipedia.

Table 7. Experiment 2: Results of each system in the fine-tuning experiments with Wikipedia

As summarized in Table 7, the unigram BERT word segmentation system improved the F-measure by 5.69 points when testing on Wikipedia and 4.59 points when testing on the core BCCWJ data, compared with the Hiragana KyTea word segmentation system. The bigram BERT word segmentation system also exhibited a 4.99-point improvement in F-measure when tested on Wikipedia and 3.77-point improvement when tested on the BCCWJ core data, compared with the Hiragana KyTea word segmentation system. Furthermore, when comparing the F-measures of the unigram and bigram BERT word segmentation systems, the F-measure of the unigram BERT word segmentation system was higher. The difference in F-measure was 0.70 points when testing on Wikipedia and 0.82 points when testing on BCCWJ core data.

7 Discussion

From Table 7, we can confirm that the F-measures of the two Hiragana BERT word segmentation systems were higher than those of the Hiragana KyTea word segmentation system in Experiment 1. Furthermore, as summarized in Table 7, the F-measures of the two Hiragana BERT word segmentation systems were higher than those of the Hiragana KyTea word segmentation system in Experiment 2. This result is expected because the Hiragana KyTea word segmentation system did not use any large language models that were pre-trained with a large amount of data.

Comparing the F-measures of the unigram and bigram BERT word segmentation systems in Tables 6 and 7, we can confirm that the F-measures of the unigram BERT word segmentation system are higher than those of the bigram BERT word segmentation system. Because bigrams are more informative than unigrams, we expected the bigram BERT word segmentation system to outperform the unigram BERT word segmentation system. However, the results were the opposite. A possible reason for this is the difference in the training data required in response to the model size. The number of Hiragana BERT words used in this study was 300 for the unigram BERT and 80,956 for the bigram BERT. In other words, the vocabulary used for bigram BERT was approximately 270 times larger than that used for unigram BERT. The difference in vocabulary size makes the model more significant, thereby requiring more training data. However, the data used in the pre-training of the two Hiragana BERT word segmentation systems was 3 million sentences in the both cases. In other words, there may be more training data for bigram BERT than the amount of training data required for the model size, which may explain why the results of the unigram BERT word segmentation system exceeded those of the bigram BERT.

Noting the significant difference in vocabulary, we calculated the results of each system using the test data from Experiment 1, excluding symbols and rare character types, such as emojis. The character types that were not removed from the test data in Experiment 1 were Hiragana, Katakana, punctuation marks, dashes for long vowels, and spaces. Therefore, we calculated the results for each system by inputting sentences comprising only the aforementioned character types into each system. In other words, sentences containing character types other than those listed were not evaluated. Table 8 lists the accuracy, precision, recall, and F-measure of the five-fold cross-validation for each system in this additional experiment.

Table 8. Experiment 1: Accuracy of each system when symbols and rare character types are removed from the test data in the fine-tuning experiment by BCCWJ.

When the results in Table 8 are compared with those in Table 6, the results of Experiment 1 show that the results were improved by restricting the character types. In addition, as listed in Table 8, the difference in the F-measures between the unigram and bigram BERT word segmentation systems is 1.55 points. The difference in F-measures between the unigram and bigram BERT word segmentation systems in Table 6 was 1.72 points, indicating that restricting the character types in the test data reduces the difference between the F-measures of bigram and unigram BERT.

Next, we compared the results of Experiments 1 and 2, which are the results of fine-tuning experiments with BCCWJ and Wikipedia testing on BCCWJ (Tables 6 and 7). The results of the unigram/bigram Hiragana BERT word segmentation system in Experiment 1 were better than those in Experiment 2. We believe that this is because the fine-tuning data in Experiment 1 were BCCWJ, the same as the test data, whereas Experiment 2 used Wikipedia data. In addition, the quality of Wikipedia data is considered lower than that of the BCCWJ data because BCCWJ uses accurate readings and word segmentation delimitation information, whereas Wikipedia uses pseudo-data. Considering that BCCWJ used in Experiment 1 had approximately 45,000 data points, whereas the Wikipedia data used in Experiment 2 had 1 million data points, it is clear that increasing the amount of pseudo-data in fine-tuning does not come close to the exact data in the same domain as the test data.

However, when given a large amount of Wikipedia pseudo-data, the accuracy of the unigram/bigram Hiragena BERT segmentation system for the same Wikipedia test data exceeded 99% (Table 7). Therefore, it is observed that fine-tuning with a large amount of data in the same domain as the test data and word segmentation information that is consistent with the test data can produce word segmentation with reasonably high accuracy.

Finally, the amount of pre-trained data for BERT used in this study was 3 million data points; however, increasing this amount may improve the accuracy of the Hiragana BERT word segmentation system. Therefore, we will consider this for future studies.

This research has some limitations. It takes time to train the unigram and bigram BERT. Our method relies on the Kanji-Kana to Hiragana translator to preprocess the sentences. We did not compare our method with methods used in other languages where word boundaries do not exist. We did not test trigrams or more lengths n-gram models.

8 Conclusions

In this study, we created word segmentation systems using two types of BERT trained specifically for Hiragana sentences: unigram and bigram BERT word segmentation systems. For the pre-training of BERT, we used character unigrams or character bigrams created from Wikipedia Hiragana sentence data using MeCab. Thereafter, each BERT was fine-tuned using the word segmentation data of the Hiragana sentences. We conducted fine-tuning experiments using BCCWJ and Wikipedia. For the fine-tuning experiment with BCCWJ, we evaluated the systems using a five-fold cross-validation. For the fine-tuning experiment with Wikipedia, we tested the systems on BCCWJ and Wikipedia data. In these experiments, the accuracy, precision, recall, and F-measure of the unigram/bigram Hiragana BERT word segmentation systems outperformed those of the Hiragana KyTea word segmentation system. Additionally, the results of the unigram Hiragana BERT word segmentation system surpassed those of the bigram Hiragana BERT word segmentation system. We believe that this is because the amount of pre-training data for the bigram BERT word segmentation system is smaller than that for the unigram BERT word segmentation system when comparing their vocabulary size. The experiments also showed that a small amount of in-domain data was better for fine-tuning than a large amount of out-of-domain pseudo-data.