Dataset for file fragment classification of audio file formats

Khodadadi, Atieh; Teimouri, Mehdi

doi:10.1186/s13104-019-4856-1

Dataset for file fragment classification of audio file formats

Data note
Open access
Published: 21 December 2019

Volume 12, article number 819, (2019)
Cite this article

Download PDF

You have full access to this open access article

BMC Research Notes Aims and scope Submit manuscript

Dataset for file fragment classification of audio file formats

Download PDF

2474 Accesses
10 Citations
Explore all metrics

Abstract

Objectives

File fragment classification of audio file formats is a topic of interest in network forensics. There are a few publicly available datasets of files with audio formats. Therewith, there is no public dataset for file fragments of audio file formats. So, a big research challenge in file fragment classification of audio file formats is to compare the performance of the developed methods over the same datasets.

Data description

In this study, we present a dataset that contains file fragments of 20 audio file formats: AMR, AMR-WB, AAC, AIFF, CVSD, FLAC, GSM-FR, iLBC, Microsoft ADPCM, MP3, PCM, WMA, A-Law, µ-Law, G.726, G.729, Microsoft GSM, OGG Vorbis, OPUS, and SPEEX. Corresponding to each format, the dataset contains the file fragments of audio files with different compression settings. For each pair of file format and compression setting, 210 file fragments are provided. Totally, the dataset contains 20,160 file fragments.

Objective

A considerable amount of Internet traffic is used for exchanging audio file formats. As the sizes of these files are usually much bigger than the maximum network packet size, the files are segmented into fragments. The fragments generated by various users are transmitted over the network. Some of these fragments can be received by the network surveillance unit. The network surveillance unit may wish to detect the file format of each fragment for network forensics purposes.

Some researches have been carried in the field of file fragment classification of audio file formats [1,2,3,4]. There are a few publicly available datasets of files with different formats [5,6,7]. Therewith, there is no public dataset for file fragments of audio file formats. This makes it difficult for other researchers to compare the proposed methods with the existing methods.

In this study, we present a dataset that contains file fragments of 20 audio file formats: Adaptive Multi-Rate (AMR), Adaptive Multi-Rate Wideband (AMR-WB), Advanced Audio Coding (AAC), Audio Interchange File Format (AIFF), Continuously Variable Slope Delta modulation (CVSD), Free Lossless Audio Codec (FLAC), Global System for Mobile Communications Full Rate (GSM-FR), Internet Low Bitrate Codec (iLBC), Microsoft Adaptive Differential Pulse Code Modulation (ADPCM), MPEG Audio Layer-3 (MP3), Pulse-Code Modulation (PCM); Windows Media Audio (WMA), A-Law, µ-Law, G.726, G.729, Microsoft GSM, OGG Vorbis, OPUS, and SPEEX. Corresponding to each format, the dataset contains the file fragments of audio files with different compression settings.

Data description

First, the whole set of the uncoded (raw) dataset of speech files is taken from www.voxforge.org [8]. These raw files are then converted in order to obtain audio files in 20 different formats: AMR, AMR-WB, AAC, AIFF, CVSD, FLAC, GSM-FR, iLBC, Microsoft ADPCM, MP3, PCM, WMA, A-Law, µ-Law, G.726, G.729, Microsoft GSM, OGG Vorbis, OPUS, and SPEEX. For each audio file format, different compression settings are considered. The raw data for all compression settings of a specific format is the same. However, there is no overlap between the raw data used for different formats.

96 pairs of file format and compression setting are considered. For each pair of file format and compression setting, we have 210 compressed audios. So, totally we have 20,160 audio files. Each of these files is segmented into 1 Kbyte (i.e. 1024 bytes) fragments. Then, one fragment is randomly selected among the fragments of each file. Before randomly selecting the fragments, 12.5% of the initial fragments and 12.5% of the final fragments of each file are discarded. This is to ensure that the fragments do not contain the file headers or trailers.

For each pair of file format and compression setting, we have 210 file fragments. So, the dataset of file fragments contains 20,160 file fragments. The dataset is partitioned according to 20 different file formats. Each partition is represented by an individual data file shown in Table 1. For example, data file 1 (i.e. aac.zip) contains 7 sub data files: aac-8.dat, aac-16.dat, aac-32.dat, aac-48.dat, aac-64.dat, aac-80.dat, and aac-96.dat. Sub data files are provided in a generic binary data file format with .dat file extension. Each sub data file contains 210 fragments.

Table 1 Overview of data files/data files

Full size table

Data file 21 (i.e. SettingsTable.pdf) contains a table that specifies 96 pairs of file format and compression setting. In this table, the software program employed for generating each file format is also specified. Data file 22 (i.e. ConversionSettings.zip) contains several screenshots of the software programs that display the employed compression settings. Data file 23 (i.e. ReadFragments.m) is a script in MATLAB language that reads all the fragments from one or more sub data files. By running this script and selecting some sub data files, the fragments contained in these sub data files are read and stored in a variable name Dataset. Variable Dataset is a MATLAB cell array with two rows. Each column in this cell array corresponds to one of the selected sub data files. The first element of each column is a string value that specifies the sub data file name. The second element of each column is a structure array with only one field named fragments. Dataset {2, i}(j).fragments (j = 1,2,…,210) is a cell array with length one that contains one fragment of the jth file in the selected sub data file i.

Limitations

The size of the fragments is considered to be fixed and equal to 1024 bytes.
A defined subset of file formats and compression settings are considered.

Availability of data materials

The data described in this Data note can be freely and openly accessed on OSF at https://doi.org/10.17605/OSF.IO/AHCYU [9]. Please see Table 1 and reference list for details and links to the data.

Abbreviations

AMR:: adaptive multi-rate
AMR-WB:: adaptive multi-rate wideband
AAC:: advanced audio coding
AIFF:: audio interchange file format
CVSD:: continuously variable slope delta modulation
FLAC:: free lossless audio codec
GSM-FR:: Global System for Mobile Communications Full-Rate
iLBC:: internet low bitrate codec
ADPCM:: adaptive differential pulse code modulation
MP3:: MPEG audio layer-3
PCM:: pulse-code modulation
WMA:: windows media audio

References

Hicsonmez S, Sencar HT, Avcibas I. Audio codec identification from coded to transcoded audios. Digit Signal Process. 2013;23(5):1720–30.
Article Google Scholar
Din M, Ratan R, Bhateja AK, Bhateja A. Multimedia classification using ANN approach. In Proceedings of the second International Conference on soft computing for problem solving (SocProS 2012), Dec 28–30, 2012, 2014 (pp. 905–910). Springer: New Delhi.
Asthana R, Verma N, Ratan R. Classification of distorted text and speech using projection pursuit features. In 2015 International Conference on Advances in computing, communications and informatics (ICACCI) 2015 Aug 10 (pp. 1408–1413). IEEE.
Maithani S, Din M. Speech systems classification based on frequency of binary word features. In 2004 International Conference on signal processing and communications, 2004. SPCOM’04. 2004 Dec 11 (pp. 193–197). IEEE.
Grajeda C, Breitinger F, Baggili I. Availability of datasets for digital forensics–And what is missing. Digit Invest. 2017;22:S94–105.
Article Google Scholar
Fakouri R, Teimouri M. Dataset for file fragment classification of image file formats. BMC Res Notes. 2019;12:774. https://doi.org/10.1186/s13104-019-4812-0.
Article PubMed PubMed Central Google Scholar
Mansouri Hanis F, Teimouri M. Dataset for file fragment classification of textual file formats. BMC Res Notes. 2019;12:801. https://doi.org/10.1186/s13104-019-4837-4.
Article Google Scholar
VoxForg Speech Corpus [Internet]. http://www.voxforge.org/. Accessed 10 May 2019
Khodadadi A, Teimouri M. Audio File Fragments Dataset and Code [Internet]. OSF; 2019. https://doi.org/10.17605/OSF.IO/AHCYU.

Download references

Acknowledgements

Not applicable.

Funding

The authors declare no source of funding.

Author information

Authors and Affiliations

Information Theory and Coding Laboratory, University of Tehran, Tehran, Iran
Atieh Khodadadi & Mehdi Teimouri

Authors

Atieh Khodadadi
View author publications
You can also search for this author in PubMed Google Scholar
Mehdi Teimouri
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MT designed the study. AK collected the data. MT and AK wrote the code. MT wrote the original draft of the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Mehdi Teimouri.

Ethics declarations

Ethics approval and consent to participate

No human subjects were part of this study and permission was thus not required according to the Institutional Review Board guidelines of author one.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Khodadadi, A., Teimouri, M. Dataset for file fragment classification of audio file formats. BMC Res Notes 12, 819 (2019). https://doi.org/10.1186/s13104-019-4856-1

Download citation

Received: 31 October 2019
Accepted: 12 December 2019
Published: 21 December 2019
DOI: https://doi.org/10.1186/s13104-019-4856-1

Dataset for file fragment classification of audio file formats

Abstract

Objectives

Data description

Objective

Data description

Limitations

Availability of data materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dataset for file fragment classification of audio file formats

Abstract

Objectives

Data description

Objective

Data description

Limitations

Availability of data materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation