1 Introduction

Parliamentary debates provide a platform for democratic discourse where parliamentary representatives hold discussions, debate various legislative matters, provide insights into democratic procedures and reflect the nature of democratic decision-making - parliamentary debates are a fundamental part of democracy (Proksch & Slapin, 2014). In this day and age, when global crises and political unrest demand accountability and transparency of legislative processes and parliamentary representatives, parliamentary corpora, consisting of transcripts of parliamentary debates and rich socio-demographic metadata, provide a comprehensive and openly accessible resource that supports the research efforts of various disciplines such as history, linguistics and social sciences. The corpora also provide an efficient resource to support discourse analysis, which lends itself to unravelling the complexity of these debates, and offer a wide range of perspectives from which the debates and other related phenomena can be studied using different approaches. Some examples of such research include examining language use of political party speakers of different orientations, politics (e.g., populism vs. non-populism) and ideologies (Truan, 2019; Fišer et al., 2023; Rheault & Cochrane, 2020), or the development of tools to analyse other, more implicit factors of speech, such as sentiment and emotion (Abercrombie & Batista-Navarro, 2018), to name but a few - a more detailed overview is given by Skubic and Fišer (2022).

In this paper, we present the ongoing work on the Slovenian parliamentary corpus siParl - a corpus of transcripts of Slovenian parliamentary debates, which in its current version covers the period between 1990 and 2022. The corpus has a long history of development and has formed the basis for the development of various related initiatives and encoding guidelines (which are explained in more detail in Sect. 2). The transcripts cover not only over 30 years of Slovenian parliamentary history, but are also rich in metadata on speakers and parliamentary organisations (political parties, parliamentary groups and other legislative bodies) as well as speech content. The corpus is encoded according to the Parla-CLARIN recommendations for encoding parliamentary debates (Erjavec & Pančur, 2019a, 2022), including the encoding of interruptions and other breaks in the speeches. Since the development of the siParl corpus is closely linked to the compilation of its “sister” corpus developed within the ParlaMint project, the ParlaMint-SI corpus, this paper has two aims: to present the history and compilation process of the siParl corpus, focusing on the process of encoding the data, and to compare the two closely related corpora, especially concerning the differences in data encoding, and to present the reasons for the two.

The rest of the work is structured as follows: Sect. 2 outlines the history of the development of the siParl corpus and presents its influence on related initiatives. Section 3 then describes in detail the compilation process (data collection process, content, structure, encoding and linguistic annotation) of the current version of the siParl corpus. Section 4 provides an overview of the entire corpus and offers a comparison between the siParl and ParlaMint-SI corpora, and finally, Sect. 5 discusses the current state of the siParl corpus and presents both short- and long-term plans for future extensions and enrichments of the siParl corpus.

2 Development of Slovenian parliamentary corpora and related initiatives

The development of Slovenian parliamentary corpora has a fairly long tradition and had a significant influence on the development of various related initiatives, which we overview in this section.

The development began in 2016 when a project to digitise existing parliamentary documents was launched to provide historians with better access to the growing amount of digital historical sources in Slovenia (Pančur & Šorn, 2016). The corpus building was done following several principles, in particular multidisciplinarity (the corpus must be useful for as many disciplines as possible), inclusiveness (planning to include also documents other than parliamentary debates), collaboration (work of experts from different research fields), long-term funding (activities should be financed as part of the work of long-term research infrastructures), and open-science principles (FAIR accessibility).

The first result of the project was the corpus SlovParl 1.0 (Pančur et al., 2016), one of the first parliamentary corpora of the Slavic-speaking countries. It consists of 2.7 million words of parliamentary debates in the Chamber of Associated Labour of the Assembly of the Republic of Slovenia (1990 – 1992). This period was chosen because it is a particularly interesting time in Slovenian history, as it covers the period before, during and after Slovenia’s transition to independence in 1991 and the associated transition from socialism to democracy, which was also reflected in the structure and functioning of parliament. SlovParl 1.0 was encoded according to the Text Initiative Encoding Guidelines (TEI Consortium, 2020), originally using the TEI module for performative texts (TEI Drama), which was then transformed to use the elements of the TEI module for transcribed speech (TEI Speech) (Pančur, 2016) to better enable later linguistic annotations. The decision to use TEI Speech was later validated through other initiatives proposing TEI Speech to encode parliamentary corpora (Truan & Romary, 2022).

This sample corpus, which covered only a small part of the available material, was the starting point for the compilation of the Slovenian parliamentary debates and provided a preview of the future close collaboration between historians and language technology experts to facilitate the content and technical aspects of the resource. Further versions soon followed:

  • SlovParl 2.0 (Pančur et al., 2017) added minutes of the other two chambers of the Assembly of the Republic of Slovenia for the same legislative periods from 1990 to 1992 (Pančur et al., 2018).

  • siParl 1.0 (Pančur et al., 2019), which was been extended to 2018, to include the minutes of the National Assembly of the Republic of Slovenia from the first to the seventh legislative period (1992 – 2018), the minutes of the working bodies of the National Assembly and the minutes of the Council of the President of the National Assembly from the second to the seventh legislative period (1996 – 2018). In contrast to the two previous corpora, the added parliamentary debates reflect the modern Slovenian parliamentary system.

  • siParl 2.0 (Pančur et al., 2020) contains the same texts and the same coverage as siParl 1.0 but with improved manually checked metadata and new linguistic annotation processing with the then state-of-the-art tools (Pančur & Erjavec, 2020).

  • siParl 3.0 (Pančur et al., 2022) was expanded to include the minutes of the Slovenian parliament for the 8th legislative period (2018 – 2022), and thus covers 32 years (1990–2022). As this is the latest version of the siParl corpus, a detailed overview is provided in Sect. 4.

Version 3.0 of the corpus is supplemented by the minutes of the Slovenian parliament of the 8th legislative period. A new version, currently under development will include the minutes of the Council of the President of the Slovenian Parliament and the minutes of the working bodies for the 8th legislative term.

In parallel with the release of siParl 1.0, the Parla-CLARIN recommendations for the encoding of parliamentary corpora (Erjavec & Pančur, 2019a, b, 2022) were published, which were largely based on the encoding of siParl. Parla-CLARIN contains best practice guidelines and define a schema based on the Text Encoding Initiative Guidelines (TEI Consortium, 2020). Later, a tutorial on the use of corpora in political discourse research (Fišer & Pahor de Maiti, 2020), which relied on siParl 1.0 as an empirical basis, revealed various shortcomings of the first version, which were corrected in the encoding and structure of siParl 2.0, which also led to a new and bilingual English-Slovenian version of the tutorial (Fišer & Pahor de Maiti, 2021), as well as to revisions of the Parla-CLARIN recommendations. The siParl 2.0 corpus was thus the first corpus to be fully Parla-CLARIN encoded, followed by parliamentary corpora from other countries, such as Iceland (Steingrímsson et al., 2020), Austria (Wissik, 2022), and Finland (Hyvönen et al., 2022).

Following the publication of Parla-CLARIN, the ParlaMint I project (2020 – 2021) (CLARIN ERIC, 2020) was launched to transform many existing national parliamentary corpora into a set of comparable and interoperable corpora. While these corpora used Parla-CLARIN as annotation guidelines, their encoding had to be significantly restricted to unify the corpora so that the same scripts could be used on any of them. The ParlaMint I project was completed with the publication of a set of 17 national corpora (Erjavec et al., 2022), including the Slovenian ParlaMint-SI corpus based on siParl 2.0. The second phase of the ParlaMint project (2022 – 2023) concluded with the publication of the 29 corpora of parliamentary debates (Erjavec et al., 2023b) (including the linguistically annotated version (Erjavec et al., 2023a) and the machine-translated version (Kuzman et al., 2023)), again including ParlaMint-SI, which was here based on siParl 3.0. Given the importance of the ParlaMint corpora and interesting differences between siParl and ParlaMint-SI, a comparison of the two corpora is made in Sect. 4.1.

The aforementioned developments related to siParl focused on the international impact, however, recent work also aimed to extend the coverage of Slovenian parliamentary corpora to the past, when Slovenia was part of the Kingdom of Yugoslavia between the first and second World War, and part of the Austro-Hungarian Empire before the end of the first World War:

  • Parliamentary corpus of the first Yugoslavia yu1Parl 1.0: The corpus contains minutes of sessions of the National Representation of the Kingdom of Yugoslavia from 1919 to 1939 (Kavčič et al., 2023b). The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. The Serbo-Croatian is typeset in the Cyrillic (Serbian) or Latin (Croatian) alphabet.

  • The Carniolan Provincial Assembly corpus Kranjska 1.0: the corpus of Carniolan Provincial Assembly meeting proceedings from 1861 to 1913 (Kavčič et al., 2023a). The documents are also bilingual, in Slovenian and German depending on the speaker, with the German typeset first in Gothic script (Kurrent) and later in Latin (Antiqua).

Both corpora were first OCR-processed and heuristics were used to extract titles, agenda, attending, start and end of the session, speakers and comments. Subsequently, the language was recognised at sentence level and an automatic linguistic annotation consisting of tokenisation, PoS tagging and lemmatisation was performed. Both corpora are encoded according to the Parla-CLARIN schema.

Finally, another resource was developed in connection with siParl, namely the Corpus of political party programs for the Slovenian parliamentary elections in 2022, Programi2022 (Polanič & Dobranić, 2022). The 19 programs of the corpus (approx. 300,000 words) were collected from the Internet, cleaned and automatically annotated at the same linguistic levels as above. The metadata for each text consists of the name of the party, its URL and the URL of the program. With this corpus, the first step has been taken in gathering not only parliamentary proceedings but also relevant accompanying documents.

3 Corpus compilation

The siParl corpus compilation procedure consists of several steps, as described in the following subsections, starting with the data collection and cleaning process, the structuring and encoding of the corpus, and the linguistic annotation of the texts.

The data for version 3.0 of the corpus was collected from the Assembly website,Footnote 1 with the process being semi-automated, as transcripts were manually downloaded and then processed with XSLT scripts to, first, remove unnecessary HTML tags and characters and, second, to convert HTML into the Parla-CLARIN encoding. For the latest version, which is currently still under development, the scraping process has been fully automated using the Python library SeleniumFootnote 2.

The conversion from HTML to Parla-CLARIN has to rely not only on the HTML elements of the sources, but also to a large extent on regular expressions that identify textual patterns, which are then converted into XML elements, e.g:

  • speakers (e.g., MAG. MAJA HOSTNIK KALIŠEK:)

  • agenda items (e.g., 1. TOČKA DNEVNEGA REDA)

  • gaps (marked as “...” in the transcript)

  • individual speech utterances and segments (grouping texts in <font> HTML tags with appropriate speakers)

  • interruptions and comments (marked as text in /.../ or (...) brackets).

The clean-up and initial encoding of the transcription structures accounted for most of the work, as the transcripts contained numerous transcription errors as well as inconsistent HTML tags to mark individual structures. The metadata for the speakers and political parties was recorded manually and the list of speakers (<listPerson>) and the list of organisations (<listOrg>) were created. These are two of the building blocks of the XML corpus structure, a specialisation of the Parla-CLARIN recommendations (Pančur & Erjavec, 2020), which can be decomposed as follows:

  1. 1.

    Corpus root (<teiCorpus>):

    1. (a)

      Metadata for the complete corpus (<teiHeader>):

      1. i.

        Bibliographic information on the complete corpus (<fileDesc>)

      2. ii.

        Taxonomies for organizations, committees, terms & meeting types, speaker types (<taxonomy>)

      3. iii.

        List of speakers for the entire corpus (<listPerson>)

      4. iv.

        List of organizations for the entire corpus (<listOrg>)

    2. (b)

      XIncludes for files with ind. sessions for entire corpus (<xi:include>)

  2. 2.

    Term roots (for the different bodies covered by siParl) (<teiCorpus>):

    1. (a)

      Metadata for the term (<teiHeader>):

      1. i.

        Same structure as for the complete corpus, but including only information relevant to the term

    2. (b)

      XInclude for files with ind. sessions for the term (<xi:include>)

  3. 3.

    Individual sessions, by session type and day (<TEI>)

    1. (a)

      Session metadata (<teiHeader>)

      1. i.

        Bibliographic information about the session (<fileDesc>)

      2. ii.

        Session agenda (<abstract>)

    2. (b)

      Session transcript (<text>):

      1. i.

        Bibliographic notes, e.g., session title, date, time, chair (<head>, <note>)

      2. ii.

        Speeches (<u>):

        1. A.

          Transcriber notes (<note>, <gap>, <vocal>, <kinesic>, <incident>)

        2. B.

          Paragraphs (<seg>)

The structure specifies the corpus and term roots at the same level, i.e., the two are independent. In other words, either the entire corpus or an individual term can be considered as a separate XML document. The reason for this structure is twofold. First, it is technically impossible to have term roots as sub-elements of the corpus root, since both contain lists of speakers and organisations as well as the taxonomies, and each speaker, organisation or taxonomy can only be defined once. Secondly, this structure allows us to easily update the corpus with new data, since a subcorpus for new data can be developed separately and integrated into the overall corpus when such a corpus is ready.

As for the files and the directory structure, the corpus root and each term root correspond to a file. It should be noted that we have two roots for each term, depending on which body was in session, either the National Assembly (SDZ) or one of the working bodies of the National Assembly (SDT). Each session is also saved as a file, with the sessions of a term/body being grouped in a subdirectory.

The meetings are then differentiated according to their type and the date on which they took place. The types are regular, extraordinary or joint sessions, with extraordinary sessions being convened by the President of the National Assembly at the request of at least a quarter of the deputies or the President of the Republic and generally dealing with time-sensitive and urgent matters (Državni zbor Republike Slovenije, 2020a). Each meeting has an agenda containing the items to be discussed, which are discussed in the form of speeches (which can sometimes be interrupted by non-verbal content such as information on the outcome of the vote, actions such as applause, etc.). We illustrate some of the more interesting structures in the remainder of this section.

3.1 Corpus and term headers

The beginning of the corpus or term header contains its ID, the main title and the subtitles, which for the root corpus contain all included legislative bodies with Slovenian and English translations. The <meeting> element contains the information on all terms contained in the corpus that are referenced via the attribute @ana (e.g., ana=“#parl.term #SK.11”), as shown in Fig. 1.

Fig. 1
figure 1

Example of the encoding of the corpus header beginning

Fig. 2
figure 2

Example of a political party, its relations and coalition encoding

The header also contains taxonomies to provide controlled values (in both Slovenian and English, encoded in the <taxonomy> element) and to give them an identifier so that they can be linked to other elements. The corpus contains taxonomies with definitions for legislative terms and a taxonomy for speaker types, while the linguistically annotated version also contains other taxonomies related to the linguistic annotations.

The header also encodes the list of all organisations represented in the debates – by “organisations” we mean primarily political parties, but also other legislative bodies such as Parliament (or Assembly) and working bodies. The organisations are encoded in the element <listOrg>, and each organisation contains several important pieces of information: its identifier, which can be used as a reference, its full name and abbreviation (both in Slovenian and English), information about its existence (date from and to, or only the date from if the organisation still exists or its existence never completely ended because it was merged into another one) and, if available, a link to the Wikipedia pages. The additional element <listRelation> contains a list of organisational changes (mainly of political parties) that have taken place, mostly relating to the changes of their names, mergers with another party, and parties that were in the governing coalition (as well as changes in these coalitions during the duration of the government). The encoding of the two elements is illustrated in Fig. 2, where the first part shows the encoding of a political party ZaSLD (Alliance of Social Liberal Democrats) in <listOrg> and the second part shows its renaming to ZaAB (Alliance of Alenka Bratušek) in 2016 and then another renaming to SAB (Party of Alenka Bratušek), which then became part of the coalition in the 13th Government, all noted in the element <listRelation>.

Fig. 3
figure 3

Example of speaker encoding

Another essential part of the headers, which is contained in the element <listPerson>, is the list of speakers, which consists of information about the individual speakers in the debates. In addition to the first and last name and ID, the list also contains information about the speaker’s gender, date of birth and place of residence (if available), information about the speaker’s affiliation with certain organisations (especially political parties) and whether the person was a member of parliament (MP) or a minister, as shown in Fig. 3.

3.2 Individual sessions

Individual session transcripts begin with the root element <TEI>, which initially contains the metadata of the session in the element <teiHeader>, which also contains the list of agenda items in the element <profileDesc>. These agenda items in turn not only serve as a table of contents for the session but also link individual segments with speeches on that specific agenda item, as shown in Fig. 4.

Fig. 4
figure 4

Example of encoding and linking of agenda-related segments

Fig. 5
figure 5

Example of speech encoding

The TEI header is then followed by the transcription of the debate (enclosed by the <text> element), which may also contain annotations or headings in addition to the actual speeches. Speeches are encoded as utterances (<u>, which also contains a reference to the ID of the speaker and the speaker type (chair or regular speaker), while the speech itself is encoded in segments (<seg>), which corresponds to a paragraph in the transcription. The transcriptions may also contain transcriber notes, which are either encoded as a general element <note>, possibly qualified by an attribute @type, or as more specific TEI elements for different types of interruptions (<vocal>, <kinesic>, <incident>) as well as indicators for missing parts of the transcriptions (<gap>). An example of speech encoding is shown in Fig. 5.

3.3 Linguistic annotation

In addition to the described “plain-text” version of the siParl corpus, the corpus is also available in a linguistically annotated version. This version is identical to the “plain-text” one in terms of content, metadata and encoding, but adds linguistic annotations to the segments. The added annotation consists of text tokenisation, sentence segmentation, part-of-speech tagging, lemmatisation, dependency parsing and named-entity recognition.

Fig. 6
figure 6

Example of linguistic annotation of sentence “Greetings and thank you for the floor

The text in the siParl 3.0 corpus was annotated with the CLASSLA-StanfordNLP pipelineFootnote 3 (Ljubešić & Dobrovoljc, 2019) version 1.2.1 for the standard Slovenian, which reports an F1 score of 99.17 for lemmatization, 97.38 for XPOS (MULTEXT-East) tagging, F1 score 98.69 for UPOS tagging and an LAS score of 92.05Footnote 4. The models used in this version of the pipeline have an estimated F1 score of about 97.06 for the morphosyntactic XPOS annotations (Ljubešić & Krsnik, 2022b), an estimated F1 score of 99.7 for the lemma annotations (Ljubešić & Krsnik, 2022a) and an estimated LAS score of 92.7 for the UD parser.

Since then, several new versions of CLASSLA-StanzaFootnote 5 have been released. The latest version (2.1, at the time of writing) has an F1 score of 97.08 for morphosyntactic tagging, F1 score of 98.97 for lemmatisation and F1 score of 90.57 for dependency parsing (Terčon & Ljubešić, 2023a). The models for standard Slovenian included in the pipeline show an estimated F1 score of 98.27 for the morphosyntactic XPOS annotations (Ljubešić et al., 2023), an estimated F1 score of 99.11 for the lemma annotations (Terčon et al., 2023) and an estimated LAS score of 91.11 for the UD parser (Terčon & Ljubešić, 2023b) . While the new pipeline observes a slightly lower (but still very high) accuracy compared to the previous version, the models were trained on the basis of the more extensiveFootnote 6, expert-curated training dataset SUK 1.0 for the linguistic annotation of the modern standard Slovene (Arhar Holdt et al., 2023, 2024), which we intend to use for the re-annotation of the new, upcoming version of the siParl corpus. However, the accuracy results reported for the pipelines and models might differ from the model performance on this particular corpus, for which a comprehensive evaluation would be useful and which we plan to introduce for the next version of the siParl corpus.

The excerpt in Fig. 6 shows the linguistic annotation of the sentence “Greetings and thank you for the floor”.

4 Overview of siParl 3.0

The siParl 3.0 corpus is available for downloadFootnote 7 under the CC BY licence, as well as via the CLARIN.SI concordancers (NoSketch Engine log-in,Footnote 8 NoSketch EngineFootnote 9 and KonTextFootnote 10).

Table 1 Basic statistics of the corpus

The overview of the statistics for the corpus is presented in Table 1.

Table 2 Basic quantitative characteristics of the individual legislative periods, containing the legislative periods covered (Period), the number of days on which sessions were held per individual legislative body (Days) and the total no. of days for the entire period (Days-Term total), as well as no. of political parties (PP), other legislative bodies or organisations (Org.), number of speakers (Speakers) and number of Members of Parliament MPs

Overall, the corpus contains sessions of the following entities:

  • Assembly of the Republic of Slovenia (Skupščina Republike Slovenije), 1990–1992, 11th legislative period;

  • National Assembly of the Republic of Slovenia (Državni zbor Republike Slovenije), 1992–2022, 1st–8th legislative period;

  • Working bodies of the National Assembly of the Republic of Slovenia (Delovna telesa Državnega zbora Republike Slovenije), 1996–2022, 2nd–7th legislative period;

  • Council of the President of the National Assembly (Kolegij predsednika Državnega zbora), 1996–2022, 2nd–7th legislative period.

The Slovenian Parliament consists of two chambers, with the National Assembly being the highest representative and legislative body of the Republic of Slovenia. It exercises the legislative function and adopts the most important legal acts. In addition, the National Assembly performs electoral and supervisory functions and consists of 90 deputies, two of whom represent the Italian and Hungarian ethnic groups.

Working Bodies of the National Assembly: In addition to the plenary sessions of the National Assembly, the corpus also contains the meetings of the National Assembly’s working bodies, which were set up to monitor the situation in individual areas, prepare political decisions in these areas and examine proposals for laws or legislative acts. When the National Assembly is not in plenary session, MPs work in these working bodies (Državni zbor Republike Slovenije, 2020a), which include committees and commissions for specific areas (e.g., the Committee of Foreign Affairs or the Commission for National Communities) as well as committees of inquiry.

Council of the President of the National Assembly: The Council of the President of the National Assembly is an advisory body to the President of the National Assembly and also has decision-making authority in certain cases. It consists of the President and the Vice-Presidents of the National Assembly, the chairmen of the parliamentary groups and the deputies of the ethnic groups (i.e., the deputies of the Italian and Hungarian ethnic groups).

Assembly of the Republic of Slovenia: Finally, the corpus includes the sessions of the 11th legislative period of the Assembly of the Republic of Slovenia, which represents a very specific and important historical context for the development of the modern parliamentary system in Slovenia, which was part of Yugoslavia from 1945 to 1991 and in which the parliament reflected the socialist system. The first multi-party elections were held in April 1990, and in 1992 the members of the Assembly adopted the new Constitution, which formally ended the era of the Socialist Assembly of Slovenia and introduced the new classical parliament. The sessions of the Parliament therefore offer a unique insight into the period before, during and after Slovenia’s independence. (Pančur et al., 2018).

Table 2 contains a quantitative breakdown for each legislative period. The first row contains the information for each legislative period and is further subdivided to show information for each legislative body in the corpus. The first column contains all legislative bodies, the second column gives the beginning and end of the respective legislative period,Footnote 11 while the next three columns provide an overview of the number of session days and the total number for each legislative period. In addition to the time-related information, the next four columns provide information on the number of main actors in the parliamentary debates: the number of political parties, the number of all speakers (regardless of their role) and the number of MPs. It should be noted that the first legislative body, the SSK11 (Assembly), is structured quite differently from the modern National Assembly and consists of three individual legislative bodies or chambers: The Chamber of Associated Labour, the Chamber of Municipalities, and the Socio-Political Chamber shown in the Org column, while all others have either one (National Assembly) or multiple organisations (Working bodies). In addition, due to the different electoral system, the number of deputies (or, as they were called, delegates) was not 90, as is usual in the modern Slovenian parliament, but 242 (Državni zbor Republike Slovenije, 2020b).

4.1 siParl and ParlaMint-SI

As mentioned in Sect. 2, the ParlaMint project was launched to transform existing national parliamentary data into comparable and interoperable resources (CLARIN ERIC, 2020).

Table 3 Main characteristics of the siParl 3.0 and ParlaMint-SI 4.0 corpus

The Parla-CLARIN recommendation formed the basis for the encoding of these corpora, although the encoding schema was considerably restricted in order to make the corpora maximally interoperable (Erjavec et al., 2022). The source of the Slovenian ParlaMint corpus, i.e., ParlaMint-SI, was siParl, so the two are closely related. As mentioned in Sect. 1, one of the aims of the paper is also to compare the siParl and ParlaMint-SI corpora. Table 3 shows the main differences between the latest versions of the two corpora.

Although the corpora differ in terms of their scope (both in terms of size and period covered), they are relatively similar. However, there are also some differences between the corpora in terms of encoding and structure; we present some of the most notable examples.

Structure: siParl preserves the structure based on terms and legislative bodies (as explained in Sect. 3), while ParlaMint-SI structures the corpus, like all ParlaMint corpora, on a yearly basis. This is mainly because the siParl structure would not be useful for comparison, as the legislative periods of different countries/parliaments naturally do not coincide. The year-based structure therefore allows a comparison and divides the mentioned legislative periods (and the sessions in these periods) into specific years. The comparison between different legislative periods in ParlaMint-SI is reasonably maintained by adding metadata about the legislative period to which each session belongs.

Encoding: The speeches (including interruptions and other structures belonging to speech, such as utterances or segments) are mostly coded the same, but there are some differences between the encoding of both corpora—they are mostly minor changes, such as different ways used to encode speakers and organisations (some tags present in siParl are not used in ParlaMint and vice versa):

  1. 1.

    Unknown speakers: siParl includes the use of unknown speakers (by inserting the attribute @who, with the value “unknown”), while for unknown speakers in ParlaMint this attribute is not used, which then also gives the more accurate number of known speakers in the corpus statistics, while in siParl the status “unknown” is mainly used to obtain information about the speeches of unknown speakers, which can be important information for researchers in historiography (for which siParl was originally compiled). However, this will be examined in more detail in the next version, as the number of speakers is not readily apparent from this tag.

  2. 2.

    Political parties and parliamentary groups: Heidar and Koole (2000) define the parliamentary group as an “organised group of persons of a representative body elected either under the same party label or under the label of different parties that do not compete with each other in elections”. In the Slovenian parliament, most parliamentary groups coincide 1:1 with the corresponding political party, but this is not always the case. Moreover, there are sometimes changes in the composition of parliamentary groups, which makes accurately tracking and encoding this distinction a difficult task. In ParlaMint-SI, the attributes role="politicalParty" and role="parliamentaryGroup" were implemented to try to encode this distinction. For siParl, the @role="political_party" was used to denote the role of the organisation, as the corpus also includes parties/speakers outside the National Assembly or even outside individual political parties.

  3. 3.

    Organisations: Since siParl contains minutes of the working bodies, these are also listed as individual organisations, with the additional attribute @ana="#parla.committee" for all working bodies, but one: siParl encodes the organisational coalition DEMOS (Democratic Opposition Slovenia), a coalition of several political parties that won the first democratic multi-party elections and formed the first democratic government in Slovenia.Footnote 12 Therefore, although it was a coalition of parties in the first government, it also represented a kind of independent political party and was therefore included in the <listOrg> with the value“coalition”in the attribute @role, which was not (and could not be anticipated) in the ParlaMint schema.

  4. 4.

    Coalition and opposition: As mentioned in Sect. 3.1, we use the <listRelation> to capture organisational changes of political parties, one of which refers to the coalition status of a party. siParl transcribes only the notion of a coalition, but not the parties in opposition, since the determination of the parties in the coalition is known due to the signed coalition agreement, while for the opposition this knowledge is more or less only implicit since the notion of opposition is not necessarily formed. However, the coalition and opposition status is recorded in ParlaMint-SI, where we checked the information in the official sources published by the government (e.g., the annual reports on the work of the Assembly for each year and the overview of the mandate).

As shown, both corpora are relatively similar, but differ in their logic and purpose: the main goal of ParlaMint-SI is to make national parliamentary corpora as interoperable, standardised and comparable as possible, while preserving as many features of the parliamentary system as possible, although this can sometimes be hindered by the more restrictive encoding of the ParlaMint schema. In contrast, siParl focuses on preserving the features of the Slovenian parliamentary system by using the more flexible Parla-CLARIN encoding scheme. These corpora should not be seen as duplicates, but as two sides of the same coin, allowing researchers from different disciplines to leverage their strengths for a variety of parliamentary data analyses.

5 Conclusions

The paper presented the Slovenian parliamentary corpus siParl, the latest version of which contains parliamentary debates from the period before and during Slovenia’s move to independence in 1991 and up to 2022 and includes not only the plenary sessions of the National Assembly but also the debates of various legislative bodies. The creation process of the corpus had a significant impact on other developments, in particular on the Parla-CLARIN recommendations for encoding of parliamentary debates (and thus indirectly also on the encoding of the ParlaMint corpora) as well as on related Slovenian parliamentary resources.

For siParl, the short-term plan for the next version is to add the minutes of the working bodies of the 8th legislative period (currently under development) to complete the missing part of the corpus, as well as to extend the corpus with the current 9th legislature. We also plan to then re-annotate the corpus linguistically with the latest version of the CLASSLA-Stanza models for Slovenian, as has been presented in more detail in Section 3.3, which would ensure the consistency and reliability of the whole corpus and avoid possible discrepancies between sections annotated with different tool versions. In addition to the expansion of the data collected, this version will also contain several updates in the form of an updated data processing pipeline and a fully automated data collection procedure. We also plan to add automatically assigned sentence-level sentiment annotations to siParl, which will be computed based on the model trained with ParlaSent 1.0 (Mochtak et al., 2023), a recently released manually annotated sentiment dataset with examples of parliamentary debates in Bosnian, Croatian, Czech, English, Serbian, Slovak and Slovenian. After the release of version 4, we plan to update siParl regularly with current debates and at the same time supplement and improve the metadata. Further work on the transcription notes would also lead to a more informative corpus, e.g., by explicitly annotating voting results and linking them to the text being voted on. We also hope to include additional parliamentary material in siParl, e.g., documents accompanying individual sessions, and additional information from assembly websites or legislation.

In the longer term, we would like to complete the historical Slovenian parliamentary corpora. There already exists the Kranjska corpus from the time when Slovenia was a province of Austria-Hungary, and the yu1Parl corpus when it was a part of the Kingdom of Yugoslavia, but there is so far no corpus from Word War II to 1991, i.e., from time of the Socialist Republic of Yugoslavia, and this should be added to largely complete the historical data. This collection of corpora would be further improved by adding more metadata to the corpora, better OCR methods and post-processing, more sophisticated linguistic processing (including modernisation and machine translation into contemporary Slovenian), and by merging them with siParl so that they could be distributed and, most importantly, analysed as one corpus.