Skip to main content
Log in

Safe Pretraining of Deep Language Models in a Synthetic Pseudo-Language

  • Published:
Doklady Mathematics Aims and scope Submit manuscript

Absract

This paper compares the pretraining of a transformer on natural language texts and on sentences of a synthetic pseudo-language. The artificial texts are automatically generated according to the rules written in a context-free grammar. The results of fine-tuning to complete tasks of the RussianSuperGLUE project statistically reliably showed that the models had the same scores. That is, the use of artificial texts facilitates the AI safety, because it can completely control the composition of the dataset. In addition, at the pretraining stage of a RoBERTa-like model, it is enough to learn recognizing only the syntactic and morphological patterns of the language, which can be successfully created in a fairly simple way, such as a context-free grammar.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.

Notes

  1. https://github.com/GorbachevaTaisia/JSGF_generative_grammar.

  2. https://spacy.io/models/ru.

  3. https://github.com/GorbachevaTaisia/JSGFTools__.

  4. https://huggingface.co/tay-yozhik/NaturalRoBERTahttps://huggingface.co/tay-yozhik/SyntheticRoBERTa.

  5. https://huggingface.co/docs/transformers/index.

  6. https://github.com/TimDettmers/bitsandbytes.

REFERENCES

  1. D. Yu. Turdakov, A. I. Avetisyan, K. V. Arkhipenko, A. V. Antsiferova, D. S. Vatolin, S. S. Volkov, A. V. Gasnikov, D. A. Devyatkin, M. D. Drobyshevsky, A. P. Kovalenko, M. I. Krivonosov, N. V. Lukashevich, V. A. Malykh, S. I. Nikolenko, I. V. Oseledets, A. I. Perminov, I. V. Sochenkov, M. M. Tikhomirov, A. N. Fedotov, and M. Yu. Khachay, “Trusted artificial intelligence: Challenges and promising solutions,” Dokl. Math. 106, Suppl. 1, S9–S13 (2022).

    Article  Google Scholar 

  2. I. Shumailov, Z. Shumaylov, D. Kazhdan, Y. Zhao, N. Papernot, M. A. Erdogdu, and R. Anderson, “Manipulating SGD with data ordering attacks (2021). https://doi.org/10.48550/arXiv.2104.09667

  3. H. Kataoka, K. Okayasu, A. Matsumoto, E. Yamagata, R. Yamada, N. Inoue, A. Nakamura, and Y. Satoh, “Pre-training without natural images” (2020). https://doi.org/10.48550/arXiv.2101.08515

  4. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach” (2019). https://doi.org/10.48550/arXiv.1907.11692

  5. JSpeech Grammar Format. www.w3.org/TR/2000/NOTE-jsgf-20000605. Accessed May 8, 2023.

  6. M. T. Baranov, T. A. Kostyaeva, and A. V. Prudnikova, Russian Language: Reference Materials, Manual for Students, Ed. by N. M. Shanskii, 4th ed. (Prosveshchenie, Moscow, 1988) [in Russian].

    Google Scholar 

  7. N. V. Lukashevich, Thesauri in Information Retrieval Tasks (Mosk. Gos. Univ., Moscow, 2011) [in Russian].

    Google Scholar 

  8. T. Shavrina and O. Shapovalova, “To the methodology of corpus construction for machine learning: ‘Taiga’ syntax tree corpus and parser,” in Proceedings of the International Conference “Corpus Linguistics-2017” (2017), pp. 78–84.

  9. T. Shavrina, A. Fenogenova, A. Emelyanov, D. Shevelev, E. Artemova, V. Malykh, V. Mikhailov, M. Ti-khonova, A. Chertok, and A. Evlampiev, “RussianSuperGLUE: A Russian language understanding evaluation benchmark,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), pp. 4717–4726.

  10. M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages,” Analysis of Images, Social Networks and Texts (Springer, Cham, 2015), pp. 320–332. https://doi.org/10.48550/arXiv.1503.07283

    Book  Google Scholar 

  11. M. Straka and J. Straková, “Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe” (2017). https://doi.org/10.18653/v1/K17-3009

  12. M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing” (2017).

  13. JSGFTools: Some tools for JSGF grammar expansion. https://github.com/syntactic/JSGFTools. Accessed May 10, 2023.

Download references

Funding

This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to T. E. Gorbacheva or I. Y. Bondarenko.

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

APPENDIX A

1.1 DESCRIPTION OF THE CREATED RULES OF SYNTHETIC PSEUDO-LANGUAGE

We will give examples of rules for each group and provide comments.

“Subject group.” The grammar provides rules for each gender and number of the subject. The person is also taken into account, which is necessary for the convenience of working with predicate tenses in the future.

Figure 3 shows some rules for one of the subject types. Here,

Fig. 3.
figure 3

Rules for masculine singular 3rd person subjects with direct word order.

<agreed_attribute_s_masculine> represents a rule for an agreed definition to the subject, <masculine_nouns> stands for a noun in the masculine singular nominative case acting as the subject, <inconsistent_attribute_s> is for an inconsistent definition to the subject. According to the JSGF grammar syntax format, elements in square brackets are optional.

Because the JSGFTools module does not currently support the Kleene closure operator, we believe that using the consistent definition of zero times, once, or twice is a compromise solution that allows refecting the features of the Russian sentence in the grammar without complicating the work of the module itself.

Each rule that is used to specify a part of speech is a list of words of the form <masculine_nouns> = (elephant|hamster|cat), that is, the so-called set of alternatives.

To obtain other word orders in the subject group, all possible permutations of elements are performed [<agreed_attribute_s_masculine>], <masculine nouns>, and [<inconsistent_attribute_s>].

The rules for other types of subjects are specified similarly. A plural subject can be expressed not only by a plural noun, but also by a combination of the form (<masculine_nouns>|<feminine_nouns>|<neuter nouns>) with <ablative_nouns>.

“Predicate group.” The grammar presents rules according to the predicate, which can be simple, compound verb or compound nominal.

Figure 4 shows an example of a rule for a predicate of one of the types. Here, <adverbial_pr> represents a rule for the adverbial of a course of action, <transitive_verbs_second_person> stands for a transitive verb in the 2nd person <auxiliary_transitive_verbs_second_person> is for a transitive auxiliary verb in the 2nd person, singular, present, or future tense, indicative mood, <infinitives> denotes the infinitive, <imperative_transitive_verbs_second_person> designates a transitive verb in the 2nd person, singular, imperative mood.

Fig. 4.
figure 4

The rule for a predicate expressed by a transitive verb in the 2nd person, singular, present or future tense, indicative or imperative.

As with the agreed definitions, due to the peculiarities of the generation module, we had to look for an alternative to the Kleene closure operator.

We can see that the rule reflects all possible orders at once. Each rule representing a part of speech, as in the subject group, is specified using a list of words and a vertical bar. The rules for other types of subjects are specified similarly. To obtain the conditional mood, add to the rules with verbs in the past tense [<particles_for_conditional>], defining particles @#бы#@ and @#б#@. A compound nominal predicate is obtained by combining the rules of the auxiliary verb and the nominal part, which is obtained using a noun in the instrumental case or conjunctions @#как, будто, словно, точно#@ with a noun, adjective, pronoun, participle, or numeral in the nominative case.

“Object group.” The grammar provides rules for objects in one of five oblique cases. Let us look at the example in Fig. 5. Here, <agreed_attribute_o_masculine_genetive>, <agreed_attribute_o_feminine_genetive>, <agreed_attribute_o_neuter_genetive>, <agreed_attribute_o_plural_genetive> are rules for an agreed adverbial; <masculine_nouns_genetive>, <feminine_nouns_genetive>, <neuter_nouns_genetive>, and <plural_nouns_genetive> represent the object rule, and <inconsistent_attribute_o> corresponds to an inconsistent adverbial. All these rules are set similarly to similar rules in a group of nouns. Other orders are obtained by rearranging the elements into <object_group_direct_order_genetive>.

Fig. 5.
figure 5

Rule for object in the genitive case with direct word order (other cases are specified similarly).

Adverbial group. Figure 6 shows an example of a rule for the adverbial with one of the group orders. Here <gerunds> is the rule of gerunds, <particles> is the rule for a particle, and <adverbial_modifiers_of_time_place> denotes the rule for the adverbials of place and time, which is specified through a preposition with the semantics of place or time and a noun in the corresponding case. Other orders are obtained again by rearranging the elements.

Fig. 6.
figure 6

Rules for adverbial.

Rules from which sentences are formed. They can be divided into two groups: sentences with a transitive verb, which requires the use of an object in the accusative case, and sentences with an intransitive verb, which requires a combination of a preposition and an object in the appropriate case. The number of rules is related to the morphological characteristics of the predicate and the number of permutations required to generate all possible word orders.

Let us look at the example in Fig. 7. You can see that in the subject group one of the orders is selected, similarly in the predicate, object, and adverbial groups. The adverbial and particle do not have a fixed position within a sentence and can be located anywhere between other groups.

Fig. 7.
figure 7

Rule for a sentence with a transitive verb with direct word order for 1 person, singular.

To generate the object of a transitive verb, a combination of the rules of the preposition and the object group in a certain case is used, for example:

<prepositions_genetive> (<object_group_direct_order_genetive> |

<object_group_indirect_order_1_genetive>| object_group_indirect_order_2_genetive>).

To start generating using a postscript public, an active rule has been created containing a set of alternatives from all the rules for generating a proposal.

APPENDIX B

1.1 PRETRAINING PARAMETERS

For pretraining of both models, we specified the same algorithms and parameters.

We used the byte-level byte pair encoding (BBPE) from the Transformers library as a tokenizer.Footnote 5 When it was launched, we specified the following parameters:

\( \bullet \) vocab_size = 40 000 is the volume of the future dictionary, the number of components into which the tokenizer will break the input text.

\( \bullet \) min_frequency = 2 is the minimum frequency threshold.

After obtaining the tokenizer files containing the concatenated tokenized substrings and their indices, the model configuration was determined. We chose the basic RoBERTa model for preliminary training, which received the following parameters:

\( \bullet \) vocab_size = 40 000 is the volume of the dictionary created by the tokenizer.

\( \bullet \) max_position_embeddings = 512 is the maximum sequence length.

\( \bullet \) num_attention_heads = 12 is the number of attention heads for each attention layer in the encoder.

\( \bullet \) num_hidden_layers = 12 is the number of hidden layers in the encoder.

\( \bullet \) type_vocab_size = 1 is the vocabulary.

In addition, we defined a function DataCollatorForLanguageModeling(), which should mask words in the data with a probability of 15%. The training algorithm was 8-bit Adam from the bitsandbytes library,Footnote 6 which combines the efficiency of the original Adam algorithm with more economical memory consumption due to a quantized (8-bit) representation of the history of the loss function gradient by weights.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gorbacheva, T.E., Bondarenko, I.Y. Safe Pretraining of Deep Language Models in a Synthetic Pseudo-Language. Dokl. Math. 108 (Suppl 2), S494–S502 (2023). https://doi.org/10.1134/S1064562423701636

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1064562423701636

Keywords:

Navigation