Simplicity matters: user evaluation of the Slovene reference corpus

Abstract

The latest reference corpus of written Slovene, the Gigafida corpus, was created as part of the ‘Communication in Slovene’ project. In the same project, a web concordancer was designed for the broadest possible use, and tailored to the needs and abilities of user groups such as translators, writers, proofreaders and teachers. Two years after the corpus was published within the new tool, its features were assessed by the users. With an average rate of 4.36 on a scale between 1 and 5 (1 = I strongly disagree, 5 = I strongly agree), the results indicate that most survey participants agreed or strongly agreed with positive statements about the new implementations (e.g. “The corpus results are displayed in a clear manner”). This is a considerable improvement in user experience from the previous reference corpus of Slovene, i.e. the FidaPLUS corpus within the ASP32 concordancer (rated with 3.67). In the user feedback, the simplicity of search options and the interface clarity are highlighted as the main advantages, while for the future development, advanced visualizations of corpus data and improved search of word-phrases are suggested. The evaluation also highlighted some relevant user habits, such as not taking the time to learn systematically about the tool before they start using it. The findings will be implemented in future editions of the Gigafida corpus, but are relevant to any project that aims at facilitating a wider use of reference corpora and corpus-based resources.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    Information about the project is available at http://eng.slovenscina.eu/.

  2. 2.

    Since the Slovene declaration of independence in 1991, five large-scale corpora of written Slovene were compiled: FIDA (in 2000), FidaPLUS (2006), Gigafida (2012), Beseda (2000), and Nova Beseda (2011). As the names suggest, they represent two different series of corpora: the first three were built by consortiums of research institutions, and the last two were compiled by the Fran Ramovš Institute of Slovene Language (for a more detailed description of Gigafida see Logar 2017, also Gorjanc 2006; Logar Berginc and Krek 2012). The most recently published corpora are the ones currently in use.

  3. 3.

    To the linguistic community, Gigafida was made freely available also in the NoSketch Engine corpus analysis tool (https://www.clarin.si/noske/; Erjavec 2013) and under licence in The Sketch Engine software (Kilgarriff et al. 2014). Although there is no data available on Gigafida usage within these two specialised tools, the initial log analyses of the SSJ concordancer suggest the project website http://www.gigafida.net/ is the default corpus entry point for a large number of users. For example, 16,244 queries have been recorded in August 2014 (launch of the survey), i.e. an average of 524 queries per day. Given that queries have only been recorded for users that had accepted the cookie consent, the actual number of SSJ concordancer users is presumed to be much higher.

  4. 4.

    Among the target user groups, ‘Gigafida’ is typically perceived as an entity comprising annotated texts, the corresponding concordancer, and the web user interface. Not to confuse the participants, only the term corpus was used in the survey.

  5. 5.

    The questionnaire was compiled using the freely available tools at 1KA, OneClick Survey: http://english.1ka.si/.

  6. 6.

    The estimated time for completing the survey (also stated in its Introduction) was 10–15 min, placing it in the category of medium-long surveys. According to the Basic Recommendations of the 1KA survey tool (How long should my survey be?), this meant that “in addition to interesting topics, respondents require an additional motivation/…/, encouragement or incentive”. In our case, we relied upon the motive of interest, acknowledging that the extensive estimated time may discourage users from participating (we return to this question in Sect. 4). It turned out, however, the average time for valid responses was 6 min and 5 s. This substantial reduction as to the estimated time was presumably caused by the participants’ omission of non-obligatory open format questions.

  7. 7.

    The previous questionnaire was very similar to the one described in this paper in terms of content, length and complexity (Kosem 2012; Arhar 2009). Naturally, as each of the evaluations focused on the corresponding concordancer and interface, certain questions differed.

  8. 8.

    In the first evaluation, additional effort was dedicated to promote the survey among university students, while for the new survey there was intentionally no focused recruiting of any user group.

  9. 9.

    As a reminder, an additional short explanation of each of the listed features was provided in the survey.

  10. 10.

    Such was, for example, a user survey on the usefulness of different genres included in the Corpus of Contemporary Arabic conducted among language teachers and language engineers (Al-Sulaiti and Atwell 2006: 19–25).

References

  1. Agarwal, R., & Venkatesh, V. (2002). Assessing a firm’s web presence: A heuristic evaluation procedure for the measurement of usability. Information Systems Research, 13(2), 168–186.

    Article  Google Scholar 

  2. Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(1), 1–36.

    Article  Google Scholar 

  3. Arhar, Š. (2009). Uporabniška evalvacija korpusa FidaPLUS: zasnova vprašalnika, prvi rezultati. In M. Stabej (Ed.), Infrastruktura slovenščine in slovenistike (pp. 19–26). Ljubljana: Znanstvena založba Filozofske fakultete.

    Google Scholar 

  4. Arhar, Š., Gorjanc, V., & Krek, S. (2007). FidaPLUS corpus of Slovenian: The new generation of the Slovenian reference corpus: Its design and tools. In M. Davies (Ed.), Proceedings of the corpus linguistics conference CL2007 (pp. 1–12). Birmingham: University of Birmingham.

    Google Scholar 

  5. Arhar Holdt, Š., Kosem, I., & Gantar, P. (2017). Corpus-based resources for L1 teaching: The case of Slovene. In A. Marcus-Quinn & T. Hourigan (Eds.), Handbook on digital learning for K-12 schools (pp. 91–113). Berlin: Springer.

    Google Scholar 

  6. Bryman, A. (2012). Social research methods. Oxford: Oxford University Press.

    Google Scholar 

  7. Erjavec, T. (2013). Slovene corpora for corpus linguistics and language technologies. In K. Gajdošová & A. Žáková (Eds.), Proceedings of the seventh international conference SLOVKO 2013 (pp. 51–62). Bratislava: Slovenská académia vied.

    Google Scholar 

  8. Erjavec, T., Fišer, D., Krek, S., & Ledinek, N. (2010). The JOS linguistically tagged corpus of Slovene. In N. Calzolari, et al. (Eds.), Proceedings of the 7th international conference on language resources and evaluation (pp. 1806–1809). Paris: ELRA.

    Google Scholar 

  9. Flowerdew, L. (2009). Applying corpus linguistics to pedagogy: A critical evaluation. International Journal of Corpus Linguistics, 14(3), 393–417.

    Article  Google Scholar 

  10. Frankenberg-Garcia, A. (2012). Raising teachers’ awareness of corpora. Language Teaching, 45(4), 475–489.

    Article  Google Scholar 

  11. Gorjanc, V. (2006). Tracking lexical changes in the reference corpus of Slovene texts. In A. Wilson, D. Archer, & P. Rayson (Eds.), Corpus linguistics around the world (pp. 91–100). Amsterdam, New York: Rodopi.

    Google Scholar 

  12. Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik. In T. Erjavec & J. Žganec Gros (Eds.), Proceedings of the eighth language technologies conference (pp. 89–94). Ljubljana: Institut “Jožef Stefan”.

    Google Scholar 

  13. Groves, M. R., Fowler, F. J., Jr., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2004). Survey methodology. Hoboken, NJ: Wiley.

    Google Scholar 

  14. Hardie, A. (2012). CQPweb—Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409.

    Article  Google Scholar 

  15. Hewson, C., Vogel, C., & Laurent, D. (2016). Internet research methods. Los Angeles: Sage.

    Google Scholar 

  16. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., et al. (2014). The sketch engine: Ten years on. Lexicography, 1(1), 7–36.

    Article  Google Scholar 

  17. Kilgarriff, A., Rundell, M., & Dhonnchadha, E. U. (2006). Efficient corpus development for lexicography: Building the New Corpus for Ireland. Language Resources and Evaluation, 40(2), 127–152.

    Article  Google Scholar 

  18. Kosem, I. (2012). User-friendly interfaces for corpora of Slovene. Prace Filologiczne, 63, 167–180.

    Google Scholar 

  19. Krek, S. (2012). The Slovene language in the digital age. Berlin, Heidelberg: Springer.

    Google Scholar 

  20. Logar, N. (2017). Reference corpora revisited: Expansion of the Gigafida corpus. In V. Gorjanc, et al. (Eds.), Dictionary of modern Slovene: Problems and solutions (pp. 96–119). Ljubljana: Ljubljana University Press, Faculty of Arts.

    Google Scholar 

  21. Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., & Krek, S. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: Gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko, Fakulteta za družbene vede.

    Google Scholar 

  22. Logar Berginc, N., & Krek, S. (2012). New Slovene corpora within the communication in Slovene project. Prace Filologiczne, 63, 197–207.

    Google Scholar 

  23. Pérez-Paredes, P., Sánchez-Tornel, M., & Calero, J. M. A. (2012). Learners’ search patterns during corpus-based focus-on-form activities: A study on hands-on concordancing. International Journal of Corpus Linguistics, 17(4), 482–515.

    Article  Google Scholar 

  24. Renouf, A., & Kehoe, A. (2013). Filling the gaps: Using the WebCorp Linguist’s Search Engine to supplement existing text resources. International Journal of Corpus Linguistics, 18(2), 167–198.

    Article  Google Scholar 

  25. Santos, D., & Frankenberg-Garcia, A. (2007). The corpus, its users and their needs: A user-oriented evaluation of COMPARA. International Journal of Corpus Linguistics, 12(3), 335–374.

    Article  Google Scholar 

  26. Soehn, J.-Ph., Zinsmeister, H., & Rehm, G. (2008). Requirements of a user-friendly, general-purpose corpus query interface. In A. Witt, et al. (Eds.), Proceedings of the LREC 2008 workshop ‘sustainability of language resources and tools for NLP (pp. 27–32). ELRA: Paris.

    Google Scholar 

Download references

Acknowledgements

The resources described in this paper were funded within the national project ‘Communication in Slovene’ (2008–2013), financed by the European Social Fund and the Slovene Ministry of Education, Science and Sports (Grant No. 3311-08-986003). The evaluation was supported by the infrastructure programme (ARRS-I0-0051) at the Centre for Applied Linguistics (Trojina), and the reference corpus upgrade funded by the Slovene Ministry of Culture (2015–2018) (Grant No. 33400-15-141007). Authors are also grateful to all reviewers for their very constructive input and comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Nataša Logar.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Arhar Holdt, Š., Dobrovoljc, K. & Logar, N. Simplicity matters: user evaluation of the Slovene reference corpus. Lang Resources & Evaluation 53, 173–190 (2019). https://doi.org/10.1007/s10579-018-9429-8

Download citation

Keywords

  • Reference corpus
  • Corpus concordancer
  • Gigafida
  • Usability assessment
  • User evaluation
  • User satisfaction