Skip to main content

Managing, storing, and sharing long-form recordings and their annotations

Abstract

The technique of long-form recordings via wearables is gaining momentum in different fields of research, notably linguistics and neurology. This technique, however, poses several technical challenges, some of which are amplified by the peculiarities of the data, including their sensitivity and their volume. In this paper, we begin by outlining key problems related to the management, storage, and sharing of the corpora that emerge when using this technique. We continue by proposing a multi-component solution to these problems, specifically in the case of daylong recordings of children. As part of this solution, we release ChildProject, a Python package for performing the operations typically required by such datasets and for evaluating the reliability of annotations using a number of measures commonly used in speech processing and linguistics. This package builds upon an annotation management system, which allows the importation of annotations from a wide range of existing formats, as well as upon data validation procedures, which assert the conformity of the data, or, alternatively, produce detailed and explicit error reports. Our proposal could be generalized to populations other than children and beyond linguistics.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Data availibility

All data used to produce the paper are provided along with the source.

Code availability

The present paper can be reproduced from its source, which is hosted on GIN at https://gin.g-node.org/LAAC-LSCP/managing-storing-sharing-paper. The ChildProject package is available on GitHub at https://github.com/LAAC-LSCP/ChildProject. A step-by-step tutorial to launch annotation campaigns on Zooniverse is published along with the source code at https://doi.gin.g-node.org/10.12751/g-node.k2h9az (Gautheron, 2021c). We provide scripts and templates for DataLad managed datasets at http://doi.org/10.17605/OSF.IO/6VCXK (Gautheron, 2021b). We also provide a DataLad extension to extract metadata from corpora of long-form recordings (Gautheron, 2021a).

Notes

  1. In order to demonstrate how our proposal could foster reproducible research on long-form recordings of children, we have released the source code of the paper as well as the code used to build the figures in Sect. 4.

  2. https://github.com/homebankcode/.

  3. Replicability is typically defined as the effort to re-do a study with a new sample, whereas reproducibility relates to re-doing the exact same analyses with the exact same data. Reproducibility is addressed in another section.

  4. https://sites.google.com/view/aclewdid/home.

  5. For most ongoing research projects of which we are aware, there is no central annotation system; instead, annotators work in parallel on separate files. Some researchers may prefer to have the “final” version of the annotations in a merged format that represents the “current best guess”. For transparency and clarity, however, such merged formats will emerge at a secondary stage, with a first stage represented by independent files including information about the independent listeners’ judgments. Our package provides a solution that considers the current practice of working in parallel, but will adapt easily to alternative habits based on merged or collaborative formats.

  6. This grossly underestimates overall costs, because the best way to do any kind of field research is through maintaining strong bonds with the community and helping them in other ways throughout the year, not only during our visits [read more about ethical fieldwork on Broesch et al. (2020)]. A successful example for this is that of the UNM-UCSB Tsimane’ Project (http://tsimane.anth.ucsb.edu/), which has been collaborating with the Tsimane’ population since 2001. They are currently funded by a 5-year, 3-million US$ NIH grant https://reporter.nih.gov/project-details/9538306.

  7. Many good features of Databrary have been highlighted in Soska et al. (2021).

  8. https://homebank.talkbank.org.

  9. https://databrary.org.

  10. https://osf.io.

  11. https://archive.mpi.nl/tla/, which holds a CLARIN certificate B.

  12. https://lscp.dec.ens.fr/en/research/teams-lscp/language-acquisition-across-cultures.

  13. We believe a reasonable unit of bundling is the collection effort, for instance a single field trip, a full bout of data collection for a cross-sectional sample, or a set of recordings done more or less at the same time in a longitudinal sample. Given the possibilities of versioning, some users may decide they want to keep all data from a longitudinal sample in the same dataset, adding to it progressively over months and years, to avoid having duplicate children.csv files. That said, given DataLad’s system of subdatasets (see Sect. 3.3), one can always define different datasets, each of which contains the recordings collected in subsequent time periods.

  14. https://childproject.readthedocs.io/en/paper/annotations.html.

  15. https://childproject.readthedocs.io/en/paper/samplers.html.

  16. https://childproject.readthedocs.io/en/paper/elan.html.

  17. https://childproject.readthedocs.io/en/paper/zooniverse.html.

  18. https://childproject.readthedocs.io/en/paper/processors.html.

  19. https://docs.babycloudlab.com/.

  20. https://childproject.readthedocs.io/en/paper/metrics.html.

  21. http://docs.datalad.org/en/stable/metadata.html.

  22. https://gin.g-node.org/LAAC-LSCP/managing-storing-sharing-paper.

  23. https://github.com/datalad-datasets/human-connectome-project-openaccess.

  24. https://git-annex.branchable.com/special_remotes/..

  25. https://childproject.readthedocs.io/en/paper/vandam.html.

  26. https://git-annex.branchable.com/special_remotes/.

  27. https://git-annex.branchable.com/encryption/.

  28. https://travis-ci.com/.

  29. https://docs.github.com/en/actions.

  30. https://github.com/LAAC-LSCP/datasets

  31. https://gin.g-node.org/.

  32. https://gin.g-node.org/EL1000/EL1000.

  33. https://github.com/LAAC-LSCP/vandam-daylong-demo.

References

  • Bergelson, E., Warlaumont,A., Cristia,A., Casillas, M., Rosemberg, C., Soderstrom, M., Rowland, C., Durrant, S., & Bunce, J. (2017). Starter-aclew. https://doi.org/10.17910/B7.390, http://databrary.org/volume/390.

  • Bird, S. (2020). Decolonising speech and language technology. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 3504–3519).

  • Boersma, P. (2006). Praat: Doing phonetics by computer. http://www.praat.org/.

  • Borne, K. D. (2011). Zooniverse team the Zooniverse: A framework for knowledge discovery from citizen science data. In AGU Fall Meeting Abstracts.

  • Brase, J. (2010). Datacite—a global registration agency for research data. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.1639998.

    Article  Google Scholar 

  • Bredin, H. (2017). pyannote.metrics: A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association. http://pyannote.github.io/pyannote-metrics.

  • Broesch, T., Crittenden, A. N., Beheim, B. A., Blackwell, A. D., Bunce, J. A., Colleran, H., Hagel, K., Kline, M., McElreath, R., Nelson, R. G., et al. (2020). Navigating cross-cultural research: Methodological and ethical considerations. Proceedings of the Royal Society B, 287(1935), 20201245.

    Article  Google Scholar 

  • Casillas, M., Bergelson, E., Warlaumont, A. S., Cristia, A., Soderstrom, M., VanDam, M., & Sloetjes, H. (2017). A new workflow for semi-automatized annotations: Tests with long-form naturalistic recordings of childrens language environments. In Proc. Interspeech 2017 (pp. 2098–2102) https://doi.org/10.21437/Interspeech.2017-1418.

  • Casillas, M., & Cristia, A. (2019). A step-by-step guide to collecting and analyzing long-format speech environment (LFSE) recordings. Collabra: Psychology, 5(1), 24. https://doi.org/10.1525/collabra.209.

    Article  Google Scholar 

  • Christakis, D. A., Gilkerson, J., Richards, J. A., Zimmerman, F. J., Garrison, M. M., Xu, D., Gray, S., Yapanel, U., et al. (2009). Audible television and decreased adult words, infant vocalizations, and conversational turns: A population-based study. Archives of Pediatrics & Adolescent Medicine, 163(6), 554–558.

    Article  Google Scholar 

  • Cychosz, M., & Cristia A. (2021). Using big data from long-form recordings to study development and optimize societal impact. OSF Preprints.

  • Cychosz, M., Romeo, R., Soderstrom, M., Scaff, C., Ganek, H., Cristia, A., Casillas, M., de Barbaro, K., Bang, J. Y., & Weisleder, A. (2020). Longform recordings of everyday life: Ethics for best practices. Behavior Research Methods, 52(5), 1951–1969. https://doi.org/10.3758/s13428-020-01365-9

    Article  Google Scholar 

  • development team T. P. (2020). pandas-dev/pandas: Pandas. https://doi.org/10.5281/zenodo.3509134.

  • Eglen, S. J., Marwick, B., Halchenko, Y. O., Hanke, M., Sufi, S., Gleeson, P., Silver, R. A., Davison, A. P., Lanyon, L., Abrams, M., Wachtler, T., Willshaw, D. J., Pouzat, C., & Poline, J. B. (2017). Toward standard practices for sharing computer code and programs in neuroscience. Nature Neuroscience, 20(6), 770–773. https://doi.org/10.1038/nn.4550

    Article  Google Scholar 

  • European Organization For Nuclear Research, OpenAIRE. (2013). Zenodo. https://doi.org/10.25495/7GXK-RD71, https://www.zenodo.org/.

  • ffmpeg Developers. (2021). ffmpeg tool. http://ffmpeg.org/.

  • Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. https://doi.org/10.1037/h0031619

    Article  Google Scholar 

  • Futaisi, N. A., Zhan, Z., Cristia, A., Warlaumont, A., & Schuller B. (2019). VCMNet: Weakly supervised learning for automatic infant vocalisation maturity analysis. In 2019 International Conference on Multimodal Interaction, ACM. https://doi.org/10.1145/3340555.3353751.

  • Gautheron, L. (2021a). Datalad extension for child-centered in-situ recordings. https://doi.org/10.17605/OSF.IO/C2J5A, https://osf.io/c2j5a/.

  • Gautheron, L. (2021b). Datalad procedures for the management of long-form recordings. https://doi.org/10.17605/OSF.IO/6VCXK, https://osf.io/6vcxk/.

  • Gautheron, L. (2021c). Launching a campaign of annotations on zooniverse with childproject. https://doi.org/10.12751/g-node.k2h9az.

  • Gilkerson, J., & Richards, J. (2008). The power of talk (LENA Foundation technical report ltr-01-2).

  • Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., Flandin, G., Ghosh, S. S., Glatard, T., Halchenko, Y. O., Handwerker, D. A., Hanke, M., Keator, D., Li, X., Michael, Z., Maumet, C., Nichols, B. N., Nichols, T. E., Pellman, J., et al. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific Data, 3(1), 1–9. https://doi.org/10.1038/sdata.2016.44.

    Article  Google Scholar 

  • Halchenko, Y., Meyer, K., Poldrack, B., Solanky, D., Wagner, A., Gors, J., MacFarlane, D., Pustina, D., Sochat, V., Ghosh, S., Mönch, C., Markiewicz, C., Waite, L., Shlyakhter, I., de la Vega, A., Hayashi, S., Häusler, C., Poline, J. B., Kadelka, T., Skytén, K., Jarecka, D., Kennedy, D., Strauss, T., Cieslak, M., Vavra, P., Ioanas, H. I., Schneider, R., Pflüger, M., Haxby, J., Eickhoff, S., Hanke, M., et al. (2021). DataLad: Distributed system for joint management of code, data, and their relationship. Journal of Open Source Software, 6(63), 3262. https://doi.org/10.21105/joss.03262.

    Article  Google Scholar 

  • Hanke, M., Pestilli, F., Wagner, A. S., Markiewicz, C. J., Poline, J. B., & Halchenko, Y. O. (2021). In defense of decentralized research data management. Neuroforum. https://doi.org/10.1515/nf-2020-0037.

    Article  Google Scholar 

  • King, G. (2007). An introduction to the dataverse network as an infrastructure for data sharing. Sociological Methods and Research, 36, 173–199.

    Article  Google Scholar 

  • Krippendorff, K. (2013). Content analysis: An introduction to its methodology. Los Angeles London: SAGE.

    Google Scholar 

  • Lavechin, M., Bousbib, R., Bredin, H., Dupoux, E., & Cristia, A. (2020). An open-source voice type classifier for child-centered daylong recordings. Interspeech.

  • Levin, H. I., Egger, D., Andres, L., Johnson, M., Bearman, S. K., & de Barbaro, K. (2021). Sensing everyday activity: Parent perceptions and feasibility. Infant Behavior and Development, 62, 101511.

    Article  Google Scholar 

  • Loper, E., & Bird, S. (2002). NLTK: The Natural Language Toolkit. CoRR cs.CL/0205028. http://dblp.uni-trier.de/db/journals/corr/corr0205.html#cs-CL-0205028.

  • Lubbers, M., & Torreira, F. (2013–2021). pympi-ling: A Python module for processing ELANs EAF and Praats TextGrid annotation files. https://pypi.python.org/pypi/pympi-ling, version 1.70.

  • MacEwan, S. (2019). Homebank its file anonymizer. https://github.com/HomeBankCode/ITS_annonymizer.

  • MacWhinney, B. (2000a). The CHILDES project: The database (Vol. 2). Psychology Press.

    Google Scholar 

  • MacWhinney, B. (2000b). The CHILDES project: Tools for analyzing talk (third edition): Volume I: Transcription format and programs, Volume II: The database. Computational Linguistics, 26(4), 657. https://doi.org/10.1162/coli.2000.26.4.657.

    Article  Google Scholar 

  • Mathet, Y., Widlöcher, A., & Métivier, J. P. (2015). The unified and holistic method gamma (\(\upgamma \)) for inter-annotator agreement measure and alignment. Computational Linguistics, 41(3), 437–479. https://doi.org/10.1162/coli_a_00227

    Article  Google Scholar 

  • McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proc. Interspeech 2017 pp. 498–502. https://doi.org/10.21437/Interspeech.2017-1386.

  • McKinney, W. (2010). Data structures for statistical computing in Python. In van der Walt, S., & Millman, J. (Eds.) Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a.

  • Mehl, M. R., & Pennebaker, J. W. (2003). The sounds of social life: A psychometric analysis of students’ daily social environments and natural conversations. Journal of Personality and Social Psychology, 84(4), 857–870. https://doi.org/10.1037/0022-3514.84.4.857

    Article  Google Scholar 

  • Mehl, M. R., Pennebaker, J. W., Crow, D. M., Dabbs, J., & Price, J. H. (2001). The electronically activated recorder (EAR): A device for sampling naturalistic daily activities and conversations. Behavior Research Methods, Instruments, & Computers, 33(4), 517–523. https://doi.org/10.3758/bf03195410

    Article  Google Scholar 

  • Nee, J. (2021). Understanding the effects of language revitalization workshops using long-format speech environment recordings. Proceedings of the Linguistic Society of America, 6(1), 213. https://doi.org/10.3765/plsa.v6i1.4967

    Article  Google Scholar 

  • Perkel, J. M. (2019). 11 ways to avert a data-storage disaster. Nature, 568(7750), 131–132. https://doi.org/10.1038/d41586-019-01040-w

    Article  Google Scholar 

  • Pisani, S., Gautheron, L., & Cristia, A. (2021). Long-form recordings: From a to z. https://bookdown.org/alecristia/exelang-book/.

  • Poldrack, R. A., & Gorgolewski, K. J. (2014). Making big data open: Data sharing in neuroimaging. Nature Neuroscience, 17(11), 1510–1517. https://doi.org/10.1038/nn.3818.

    Article  Google Scholar 

  • Powell, K. (2021). The broken promise that undermines human genome research. Nature, 590(7845), 198–201. https://doi.org/10.1038/d41586-021-00331-5

    Article  Google Scholar 

  • Räsänen ,O., Seshadri, S., Lavechin, M., Cristia, A., & Casillas, M. (2020). ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordings. Behavior Research Methods.

  • Riad, R., Titeux, H., Lemoine, L., Montillot, J., Bagnou, J. H., Cao, X. N., Dupoux, E., & Bachoud-Lévi, A. C. (2020). Vocal markers from sustained phonation in huntington’s disease. Interspeech.

  • Ryant, N., Church, K., Cier, I. C., Cristia, A., Du, J., Ganapathy, S., & Liberman, M .(2018). First dihard challenge evaluation plan. Tech Rep.

  • Ryant, N., Church, K., Cieri ,C., Cristia, A., Du, J., Ganapathy, S., & Liberman, M. (2019). The second dihard diarization challenge: Dataset, task, and baselines. arXiv preprint arXiv:190607839.

  • Ryant, N., Church, K., Cieri ,C., Du, J., Ganapathy, S., & Liberman, M. (2020). Third DIHARD challenge evaluation plan. arXiv preprint arXiv:200605815.

  • Schuller, B., Steidl, S., Batliner, A., Bergelson, E., Krajewski, J., Janott, C., Amatuni, A., Casillas, M., Seidl, A., Soderstrom, M., et al. (2017). The interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring. In Interspeech.

  • Semenzin, C., Hamrick, L., Seidl, A., Lynne Kelleher, B., & Cristia, A. (2020a). Describing vocalizations in young children: A big data approach through citizen science annotation. https://doi.org/10.31219/osf.io/z6exv.

  • Semenzin, C., Hamrick, L., Seidl, A., Lynne Kelleher, B., & Cristia, A. (2020b). Towards large-scale data annotation of audio from wearables: Validating zooniverse annotations of infant vocalization types. https://doi.org/10.31219/osf.io/gpxf5,https://doi.org/10.31219/osf.io/gpxf5.

  • Soderstrom, M., Casillas, M., Bergelson, E., Rosemberg, C., Alam, F., Warlaumont, A. S., & Bunce, J. (2021). Developing a cross-cultural annotation system and metacorpus for studying infants’ real world language experience. Collabra: Psychology, 7(1), 23445.

    Article  Google Scholar 

  • Soska, K., Xu, M., Gonzalez, S., Hertzberg, O., Gilmore, R. O., Tamis-LeMonda, C., & Adolph, K. E. (2021). (hyper) active data curation: A video case study from behavioral science. PsyArXiv. https://psyarxiv.com/89rcb/download?format=pdf.

  • Titeux, H., Riad, & R. (2021). pygamma-agreement: Gamma γ measure for inter/intra-annotator agreement in Python. https://hal.archives-ouvertes.fr/hal-03144116, working paper or preprint.

  • VanDam, M. (2015). Homebank vandam public 5-minute corpus. https://doi.org/10.21415/T5388S, http://homebank.talkbank.org/access/Public/VanDam-5minute.html.

  • VanDam, M., Warlaumont, A. S., Bergelson, E., Cristia, A., Soderstrom, M., De Palma, P., & MacWhinney, B. (2016). Homebank: An online repository of daylong child-centered audio recordings. Seminars in Speech and Language, NIH Public Access, 37, 128.

    Article  Google Scholar 

  • VanDam, M., Warlaumont, A., MacWhinney, B., Soderstrom, M., & Bergelson, E. (2018). Vetting manual: Preparation of recordings for unrestricted publication in homebank (version 1.1).

  • Van Essen, D. C., Smith, S. M., Barch, D. M., Behrens, T. E., Yacoub, E., Ugurbil, K., Consortium ftWMH. (2013). The WU-Minn human connectome project: An overview. NeuroImage, 80, 62–79.

  • Wagner, A. (2020). datalad-handbook/repro-paper-sketch: A template to create a reproducible paper with latex, makefiles, python, and datalad. Retrieved April 30, 2021, fromhttps://github.com/datalad-handbook/repro-paper-sketch/.

  • Wagner, A. S., Waite, L. K., Meyer, K., Heckner, M. K., Kadelka, T., Reuter, N., Waite, A. Q., Poldrack, B., Markiewicz, C. J., Halchenko, Y. O., Vavra, P., Chormai, P., Poline, J. B., Paas, L. K., Herholz, P., Mochalski, L. N., Kraljevic, N., Wiersch, L., Hutton, A., et al. (2020). The DataLad Handbook. Zenodo. https://doi.org/10.5281/ZENODO.3608612, https://zenodo.org/record/3608612.

  • Walker, S., Grosjean, P., & Cristia, A. (2019). Long-form, child-centered audio-recordings collected in the Solomon Islands in 2019, unpublished private dataset.

  • Warlaumont, A. S., Richards, J. A., Gilkerson, J., & Oller, D. K. (2014). A social feedback loop for speech development and its reduction in autism. Psychological Science, 25(7), 1314–1324.

    Article  Google Scholar 

  • Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J. W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., Gonzalez-Beltran, A., Gray, A. J., Groth, P., Goble, C., Grethe, J. S., Heringa, J., t’ Hoen, P. A., Hooft, R., Kuhn, T., Kok, R., et al. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3(1), 1–9. https://doi.org/10.1038/sdata.2016.18.

    Article  Google Scholar 

  • Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). ELAN: A professional framework for multimodality research. In 5th International Conference on Language Resources and Evaluation (LREC 2006) (pp. 1556–1559).

  • Wu, R., Liaqat, D., de Lara, E., Son, T., Rudzicz, F., Alshaer, H., Abed-Esfahani, P., & Gershon, A. S. (2018). Feasibility of using a smartwatch to intensively monitor patients with chronic obstructive pulmonary disease: Prospective cohort study. JMIR mHealth and uHealth, 6(6), e10046. https://doi.org/10.2196/10046

    Article  Google Scholar 

  • Xu, D., Yapanel, U., Gray, S., & Baer, C. (2008). The LENA language environment analysis system: The interpretive time segments (its) file. LENA Research Foundation Technical Report LTR-04-2.

  • Zevin, M., Coughlin, S., Bahaadini, S., Besler, E., Rohani, N., Allen, S., Cabero, M., Crowston, K., Katsaggelos, A. K., Larson, S. L., Lee, T. K., Lintott, C., Littenberg, T. B., Lundgren, A., Østerlund, C., Smith, J. R., Trouille, L., Kalogera, V., et al. (2017). Gravity spy: Integrating advanced LIGO detector characterization, machine learning, and citizen science. Classical and Quantum Gravity, 34(6), 064003. https://doi.org/10.1088/1361-6382/aa5cea.

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank Elika Bergelson, Anne Warlaumont, Brian MacWhinney, Federica Bulgarelli, Rick Gilmore, Maria Cruz Blandon and Okko Räsänen for feedback on the paper and/or the project; Elika Bergelson's lab members and the pre-PI DARCLE group for feedback on presentations of these materials; Camila Scaff for feedback on the package, presentations, tutorial, and much more; and Martin Frébourg for the MFA.

Funding

This work has benefited from funding and/or institutional support from Agence Nationale de la Recherche (ANR-17-CE28-0007 LangAge, ANR-16-DATA-0004 ACLEW, ANR-14-CE30-0003 MechELex, ANR-17-EURE-0017); and the J. S. McDonnell Foundation Understanding Human Cognition Scholar Award. We also benefited from code developed in the Bridges system, which is supported by NSF Award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC), using the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant number OCI-1053575. Additionally, we benefited from processing in GENCI-IDRIS, France (Grant-A0071011046). Some capabilities of our software depend on the Zooniverse.org platform, the development of which is funded by generous support, including a Global Impact Award from Google, and by a Grant from the Alfred P. Sloan Foundation. The funders had no impact on this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lucas Gautheron.

Ethics declarations

Conflict of interest

The authors have no conflict of interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Examples of storage strategies

Appendix: Examples of storage strategies

Appendix 1: Example 1—sharing a dataset within the lab

In the first example, Alice is hosting large datasets of a few terabytes of recordings and annotations and she wants to share them with Bob—a collaborator from her own institution - in a secure manner. Alice and Bob are familiar with GitHub, and they like its user-friendly features such as issues and pull requests. However, GitHub cannot handle such amounts of data.

Alice decides to store the git repository itself on GitHub—or a GitLab instance, it would not matter—thus allowing to benefit from their nice features while hosting the large files—the recordings and annotations—elsewhere. Alice’s laboratory has its own cluster, with a large storage capacity. Thus, she decides to host the files there for free rather than using a Cloud provider.

Since Bob has been given SSH access to the cluster and belongs to the right UNIX group, he can download recordings and annotations from their joint institution cluster. Alice also made sure to configure the dataset in a way that makes sure every change published to GitHub is also published to the cluster, with DataLad’s “publish-depends” option.

For backup purposes, a third sibling is hosted on Amazon S3 Glacier—which is cheaper than S3 at the expense of higher retrieval costs and delays—as a git-annex special remoteFootnote 26. Special remotes do not store the git history and they cannot be used to clone the dataset. However, they can be used as a storage support for the recordings and other large files. In order to increase the security of the data, Alice uses encryption. Git-annex implements several encryption schemesFootnote 27. The hybrid scheme allows to add public GPG keys at any time without additional decryption/encryption steps. Each user can then later decrypt the data with their own private key. This way, as long as at least one private GPG key has not been lost, data are still recoverable. This is especially valuable in that in naturally ensures redundancy of the decryption keys, which is critical in the case of encrypted backups.

By default, file names are hashed with an HMAC algorithm, and their content is encrypted with AES-128—GPG’s default, although another algorithm could be selected.

This setup ensures redundancy of git files (hosted on both GitHub and the cluster) as well as large files (stored on both the cluster and Amazon Deep Glacier). It also allows Bob to signal and correct errors he finds, and/or to add annotations in a straightforward manner, benefiting Alice. By virtue of having siblings, they can make sure that their local dataset is organized in an identical manner, facilitating collaboration and reproducibility in their analyses.

Table 5 illustrates such a strategy. In this example, users install the dataset from a private GitHub repository. Continuous testing is configured with Travis CIFootnote 28, in order to ensure the integrity of the dataset at every step. GitHub Actions could also be used for that purposeFootnote 29.

We used this strategy—minus the Glacier backups—to maintain and deliver 4 datasets with 8700 h of audioFootnote 30 for several months. The associated scripts can be found on Gautheron (2021b). We have now transitioned to using GIN for our main site, with our cluster as the backup. The scripts associated to this set-up can be found at the same location.

Table 5 Example 1—Storage strategy example relying on GitHub and a cluster to deliver the data

Appendix 2: Example 2—sharing large datasets with outside collaborators (S3)

The previous strategy is not suitable when complex permissions are required, since SSH remotes only handle Unix-style permissions (user, group, all).

Moreover, Alice may want to share the dataset with collaborators outside her lab, without giving them SSH access to its cluster. Or, she may not even own the infrastructure that would allow her to store and share such large amounts of data.

Instead, she decides to use Amazon S3 together with GitHub. Authorized users are provided their own Amazon S3 API key and secret, which are managed with Amazon’s Identity and Access Manager (IAM). The GitHub is stripped from all confidential data, which are stored in the S3 annex only, allowing to manage access permissions entirely through IAM. This strategy is used by the Human Connectome Project (see footnote 23).

Furthermore, Alice makes sure to encrypt GDPR relevant data, using strong symmetric encryption (AES-128). This strategy is illustrated in Table 6.

Table 6 Example 2—Storage strategy example relying on GitHub and Amazon S3

Amazon is superior to most alternatives for a number of reasons, including that it is highly tested, developed by engineers with a high-level of knowledge of the platform, and widely used. This means that the code is robust even before it is released, and it is widely tested once it is released. The fact that there are many users also entails that issues or questions can be looked up online. In addition, in the context of data durability, Amazon is a good choice because it is “too big to fail”, and thus probably available for the long-term. Moreover, in sheer terms of flexibility and coverage, Amazon provides a whole suite of tools (for data sharing, backups, and processing), which may be useful for researchers with little access to high-capacity infrastructures. Additionally, it is not very costly (see comparison table on https://childproject.readthedocs.io/en/paper/vandam.html?highlight=amazon#where-to-publish-my-dataset).

Appendix 3: Example 3—sharing large datasets with outside collaborators and multi-tier access (GIN)

Due to legislation in some countries, there are researchers who may not be authorized to store their data on Amazon. If they also do not have access to a local cluster (see Example 1) and/or even in the case that they have a local cluster, but need finer control of access permissions, there are alternatives which can be used as a workaround.

Finding herself in this setting, Alice decides to use the G-Node Infrastructure (GIN)Footnote 31, which is dedicated to providing “Modern Research Data Management for Neuroscience”. GIN is similar to GitLab and GitHub in many aspects, except that it also supports git-annex and thus can directly host the large files that required third-party providers while using those platforms.

Just like GitLab or GitHub, it can handle complex permissions, at the user or group-level, thus surpassing Unix-style permissions management.

In this case, Alice needs three permission tiers: (1) read-only access to anonymized data, (2) read-only access to confidential data, and (3) read and write access to the whole data. In order to achieve this, she creates two GIN siblings per dataset: origin and confidential. The dataset is configured to publish all the files whose path contains /confidential/ to the confidential repository, while the rest of the data is published to origin. Alice could then great read-only access to origin to both Bob and Carol, while restricting the access to confidential to Bob only.

Since Alice has not been allowed to use a cloud provider, and is lacking a local infrastructure, she needs an alternate solution for her backups. She may use external hard drives, as DataLad allows to push data to a local storage as with any other kind of storage.

Table 7 sums up this strategy, which is currently used to deliver the EL1000 datasetFootnote 32—except for the backup, which is located at our cluster. The EL1000 is a composite dataset, created by the contribution of 18 different teams that collected data independently but using comparable methods.

Table 7 Example 3—Storage strategy example relying solely on GIN to deliver the data

Appendix 4: Example 4—Sharing smaller datasets (OSF)

The Open Science Framework (OSF) is especially interesting because it supports DOI registration, providing permanent URLs to access the datasets. Moreover, an extension of DataLad has specifically been developed to work with OSF, which may host both the git repository and the large files (see Table 3). In addition, Shibboleth credentials can be used with OSF.

Low quotas are an important downside with OSF. Public projects are limited to 50 GB, and private projects cannot exceed 5 GB, which is too low for most long-form datasets. However, OSF could be used only to host the git repository, effectively providing a permanent URL from which the dataset can be installed, as long as the content of the large files remains available from a third-party provider, e.g. with Amazon S3. Table 8 illustrates such a strategy.

Table 8 Example 4—Storage strategy example relying on OSF and Amazon S3 to deliver the data

We use a reversed approach for our demo datasetFootnote 33 based on (VanDam 2015), by hosting the git repository on GitHub, and hosting the large files on OSF. This is possible only because of the small size of the dataset.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gautheron, L., Rochat, N. & Cristia, A. Managing, storing, and sharing long-form recordings and their annotations. Lang Resources & Evaluation (2022). https://doi.org/10.1007/s10579-022-09579-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10579-022-09579-3

Keywords

  • Daylong recordings
  • Speech data management
  • Data distribution
  • Annotation evaluation
  • Inter-rater reliability
  • Reproducible research