Absract
This paper compares the pretraining of a transformer on natural language texts and on sentences of a synthetic pseudo-language. The artificial texts are automatically generated according to the rules written in a context-free grammar. The results of fine-tuning to complete tasks of the RussianSuperGLUE project statistically reliably showed that the models had the same scores. That is, the use of artificial texts facilitates the AI safety, because it can completely control the composition of the dataset. In addition, at the pretraining stage of a RoBERTa-like model, it is enough to learn recognizing only the syntactic and morphological patterns of the language, which can be successfully created in a fairly simple way, such as a context-free grammar.
Notes
REFERENCES
D. Yu. Turdakov, A. I. Avetisyan, K. V. Arkhipenko, A. V. Antsiferova, D. S. Vatolin, S. S. Volkov, A. V. Gasnikov, D. A. Devyatkin, M. D. Drobyshevsky, A. P. Kovalenko, M. I. Krivonosov, N. V. Lukashevich, V. A. Malykh, S. I. Nikolenko, I. V. Oseledets, A. I. Perminov, I. V. Sochenkov, M. M. Tikhomirov, A. N. Fedotov, and M. Yu. Khachay, “Trusted artificial intelligence: Challenges and promising solutions,” Dokl. Math. 106, Suppl. 1, S9–S13 (2022).
I. Shumailov, Z. Shumaylov, D. Kazhdan, Y. Zhao, N. Papernot, M. A. Erdogdu, and R. Anderson, “Manipulating SGD with data ordering attacks (2021). https://doi.org/10.48550/arXiv.2104.09667
H. Kataoka, K. Okayasu, A. Matsumoto, E. Yamagata, R. Yamada, N. Inoue, A. Nakamura, and Y. Satoh, “Pre-training without natural images” (2020). https://doi.org/10.48550/arXiv.2101.08515
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach” (2019). https://doi.org/10.48550/arXiv.1907.11692
JSpeech Grammar Format. www.w3.org/TR/2000/NOTE-jsgf-20000605. Accessed May 8, 2023.
M. T. Baranov, T. A. Kostyaeva, and A. V. Prudnikova, Russian Language: Reference Materials, Manual for Students, Ed. by N. M. Shanskii, 4th ed. (Prosveshchenie, Moscow, 1988) [in Russian].
N. V. Lukashevich, Thesauri in Information Retrieval Tasks (Mosk. Gos. Univ., Moscow, 2011) [in Russian].
T. Shavrina and O. Shapovalova, “To the methodology of corpus construction for machine learning: ‘Taiga’ syntax tree corpus and parser,” in Proceedings of the International Conference “Corpus Linguistics-2017” (2017), pp. 78–84.
T. Shavrina, A. Fenogenova, A. Emelyanov, D. Shevelev, E. Artemova, V. Malykh, V. Mikhailov, M. Ti-khonova, A. Chertok, and A. Evlampiev, “RussianSuperGLUE: A Russian language understanding evaluation benchmark,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), pp. 4717–4726.
M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages,” Analysis of Images, Social Networks and Texts (Springer, Cham, 2015), pp. 320–332. https://doi.org/10.48550/arXiv.1503.07283
M. Straka and J. Straková, “Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe” (2017). https://doi.org/10.18653/v1/K17-3009
M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing” (2017).
JSGFTools: Some tools for JSGF grammar expansion. https://github.com/syntactic/JSGFTools. Accessed May 10, 2023.
Funding
This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors of this work declare that they have no conflicts of interest.
Additional information
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
APPENDIX A
1.1 DESCRIPTION OF THE CREATED RULES OF SYNTHETIC PSEUDO-LANGUAGE
We will give examples of rules for each group and provide comments.
“Subject group.” The grammar provides rules for each gender and number of the subject. The person is also taken into account, which is necessary for the convenience of working with predicate tenses in the future.
Figure 3 shows some rules for one of the subject types. Here,
<agreed_attribute_s_masculine> represents a rule for an agreed definition to the subject, <masculine_nouns> stands for a noun in the masculine singular nominative case acting as the subject, <inconsistent_attribute_s> is for an inconsistent definition to the subject. According to the JSGF grammar syntax format, elements in square brackets are optional.
Because the JSGFTools module does not currently support the Kleene closure operator, we believe that using the consistent definition of zero times, once, or twice is a compromise solution that allows refecting the features of the Russian sentence in the grammar without complicating the work of the module itself.
Each rule that is used to specify a part of speech is a list of words of the form <masculine_nouns> = (elephant|hamster|cat), that is, the so-called set of alternatives.
To obtain other word orders in the subject group, all possible permutations of elements are performed [<agreed_attribute_s_masculine>], <masculine nouns>, and [<inconsistent_attribute_s>].
The rules for other types of subjects are specified similarly. A plural subject can be expressed not only by a plural noun, but also by a combination of the form (<masculine_nouns>|<feminine_nouns>|<neuter nouns>) with <ablative_nouns>.
“Predicate group.” The grammar presents rules according to the predicate, which can be simple, compound verb or compound nominal.
Figure 4 shows an example of a rule for a predicate of one of the types. Here, <adverbial_pr> represents a rule for the adverbial of a course of action, <transitive_verbs_second_person> stands for a transitive verb in the 2nd person <auxiliary_transitive_verbs_second_person> is for a transitive auxiliary verb in the 2nd person, singular, present, or future tense, indicative mood, <infinitives> denotes the infinitive, <imperative_transitive_verbs_second_person> designates a transitive verb in the 2nd person, singular, imperative mood.
As with the agreed definitions, due to the peculiarities of the generation module, we had to look for an alternative to the Kleene closure operator.
We can see that the rule reflects all possible orders at once. Each rule representing a part of speech, as in the subject group, is specified using a list of words and a vertical bar. The rules for other types of subjects are specified similarly. To obtain the conditional mood, add to the rules with verbs in the past tense [<particles_for_conditional>], defining particles @#бы#@ and @#б#@. A compound nominal predicate is obtained by combining the rules of the auxiliary verb and the nominal part, which is obtained using a noun in the instrumental case or conjunctions @#как, будто, словно, точно#@ with a noun, adjective, pronoun, participle, or numeral in the nominative case.
“Object group.” The grammar provides rules for objects in one of five oblique cases. Let us look at the example in Fig. 5. Here, <agreed_attribute_o_masculine_genetive>, <agreed_attribute_o_feminine_genetive>, <agreed_attribute_o_neuter_genetive>, <agreed_attribute_o_plural_genetive> are rules for an agreed adverbial; <masculine_nouns_genetive>, <feminine_nouns_genetive>, <neuter_nouns_genetive>, and <plural_nouns_genetive> represent the object rule, and <inconsistent_attribute_o> corresponds to an inconsistent adverbial. All these rules are set similarly to similar rules in a group of nouns. Other orders are obtained by rearranging the elements into <object_group_direct_order_genetive>.
Adverbial group. Figure 6 shows an example of a rule for the adverbial with one of the group orders. Here <gerunds> is the rule of gerunds, <particles> is the rule for a particle, and <adverbial_modifiers_of_time_place> denotes the rule for the adverbials of place and time, which is specified through a preposition with the semantics of place or time and a noun in the corresponding case. Other orders are obtained again by rearranging the elements.
Rules from which sentences are formed. They can be divided into two groups: sentences with a transitive verb, which requires the use of an object in the accusative case, and sentences with an intransitive verb, which requires a combination of a preposition and an object in the appropriate case. The number of rules is related to the morphological characteristics of the predicate and the number of permutations required to generate all possible word orders.
Let us look at the example in Fig. 7. You can see that in the subject group one of the orders is selected, similarly in the predicate, object, and adverbial groups. The adverbial and particle do not have a fixed position within a sentence and can be located anywhere between other groups.
To generate the object of a transitive verb, a combination of the rules of the preposition and the object group in a certain case is used, for example:
<prepositions_genetive> (<object_group_direct_order_genetive> |
<object_group_indirect_order_1_genetive>| object_group_indirect_order_2_genetive>).
To start generating using a postscript public, an active rule has been created containing a set of alternatives from all the rules for generating a proposal.
APPENDIX B
1.1 PRETRAINING PARAMETERS
For pretraining of both models, we specified the same algorithms and parameters.
We used the byte-level byte pair encoding (BBPE) from the Transformers library as a tokenizer.Footnote 5 When it was launched, we specified the following parameters:
\( \bullet \) vocab_size = 40 000 is the volume of the future dictionary, the number of components into which the tokenizer will break the input text.
\( \bullet \) min_frequency = 2 is the minimum frequency threshold.
After obtaining the tokenizer files containing the concatenated tokenized substrings and their indices, the model configuration was determined. We chose the basic RoBERTa model for preliminary training, which received the following parameters:
\( \bullet \) vocab_size = 40 000 is the volume of the dictionary created by the tokenizer.
\( \bullet \) max_position_embeddings = 512 is the maximum sequence length.
\( \bullet \) num_attention_heads = 12 is the number of attention heads for each attention layer in the encoder.
\( \bullet \) num_hidden_layers = 12 is the number of hidden layers in the encoder.
\( \bullet \) type_vocab_size = 1 is the vocabulary.
In addition, we defined a function DataCollatorForLanguageModeling(), which should mask words in the data with a probability of 15%. The training algorithm was 8-bit Adam from the bitsandbytes library,Footnote 6 which combines the efficiency of the original Adam algorithm with more economical memory consumption due to a quantized (8-bit) representation of the history of the loss function gradient by weights.
Rights and permissions
About this article
Cite this article
Gorbacheva, T.E., Bondarenko, I.Y. Safe Pretraining of Deep Language Models in a Synthetic Pseudo-Language. Dokl. Math. 108 (Suppl 2), S494–S502 (2023). https://doi.org/10.1134/S1064562423701636
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1064562423701636