1 Introduction

Despite the unique characteristics and potential applications of terahertz radiation, its practical exploitation only began in the late 1980 s with the groundbreaking development of subpicosecond photoconductive antennas by Smith et al. [1]. These antennas played a pivotal role in overcoming the challenges associated with generating and accurately detecting terahertz radiation, which were the primary obstacles to its practical use. Building upon these advancements, Hu and Nuss [2] further emphasised the exceptional opportunities provided by terahertz time-domain imaging, extending its scope beyond spectroscopy. This capability has accelerated the expansion of terahertz technology into non-destructive testing applications, including art conservation, industrial product quality testing, and concealed explosive detection. Today, terahertz time-domain spectroscopy (THz-TDS) is widely applied in various fields spanning fundamental science to industrial engineering applications [3,4,5]. Following its introduction as a highly specialised tool by a small group of research laboratories, terahertz time-domain spectroscopy has evolved into a large field of study with a user base that ranges from expert scientists with decades of experience in time-domain technology to general laboratory technicians running samples on commercial turnkey THz-TDS instruments.

A distinct advantage of THz-TDS is its ability to simultaneously measure the amplitude and the phase information of the electric field. This distinguishes it from most infrared spectroscopy techniques since it allows the direct extraction of the complex refractive index and the complex dielectric constant without relying on the Kramers-Kronig relation. The working principle of THz-TDS involves acquiring a time-domain waveform followed by data processing to transform the time-domain data into a frequency-domain spectrum. This spectral information is heavily affected by the parameter settings in data acquisition and processing. Therefore, an in-depth understanding of the signal processing routine and the parameters used is essential to achieve repeatable and reproducible spectral analysis. While commercially available THz-TDS systems often provide a bundled software package for analysing the measured data, it is not always transparent what steps are carried out precisely, what assumptions are made, and what parameters are used. This lack of transparency can result in unintended variations in data analysis methodology and the resultant spectral data for measurements on instruments from different vendors for the same sample but measured or processed using different software [6].

As a result, many research groups in the terahertz time-domain field develop their own analysis tools. However, the use of a multitude of incompatible data structures complicates the exchange and application of these tools. In the case of custom-built spectrometers, simple ASCII text files are commonly used to store data for individual measurements. This approach requires manual differentiation between sample and reference data for each measurement. Furthermore, essential metadata, such as sample thickness, temperature, or concentration, is typically logged manually in laboratory notebooks and is not captured in the digital file, making it challenging to re-analyse old data or share it with colleagues from different groups. It is also a hurdle to accommodate increasingly common funder’s mandates to make available all data that is associated with a publication. The open access requirements often stipulate for such data to be provided in a machine-readable, accessible, described, and re-usable format that ideally contains unmodified and complete data [7]. For commercial systems, some instruments utilise binary file structures with varying degrees of complexity. Still, the often proprietary nature of these file formats, combined with the undocumented file architecture that can change between software package releases, makes exchanging information difficult and renders it impossible to re-analyse archived data once the software package has been updated. A standardised data format is needed to facilitate collaboration, reproducibility, and the long-term accessibility of terahertz spectroscopy data.

Our research group has utilised a set of in-house developed MATLAB script tools that have gradually evolved over decades. While these tools have provided us with excellent flexibility in data analysis, they have also resulted in redundant code and posed challenges in properly documenting the code and maintaining a comprehensive understanding of the algorithms. Additionally, with the growth of the group and the availability of more instruments, we have faced the increasing burden of managing large volumes of data.

To address these issues, we recently decided to enhance the usability of our tools with a graphical user interface (GUI) for more intuitive, interactive, and efficient analysis. However, when we shared these newly developed tools with collaborators, compatibility issues arose due to diverse data formats used by different commercial and home-built spectrometers. Similarly, collaborations among individuals and groups in the terahertz community are often limited to users of a specific TDS system or require laborious and manual data conversion to utilising existing signal processing routines. Such barriers hinder progress within the scientific community.

To overcome these limitations, we propose a solution by introducing a standardised dotTHz format for terahertz time-domain data, the Cambridge THz Converter (CaTx, see Section B) to facilitate the adoption of this data format, and the Cambridge THz Spectrum Analyser (CaTSper, see Section C) as a simple GUI-based processing platform for THz-TDS data analysis. Both software tools have been released as open source under the MIT licence [8, 9]. We are also actively developing additional tools that will be shared in due course. Moreover, comprehensive information including processing methods, step-by-step user guides, and inline code annotations can be accessed through the online documentation [10].

2 The dotTHz Data Format

2.1 Format Structure

Terahertz time-domain waveforms comprise a series of numeric values representing the amplitude of the electric field as a function of time. To extract the optical constants from such data for a specific sample, it is necessary to record both the time-domain waveform of the sample and a reference waveform, along with essential information about the measurement settings and the sample. This implies the need to manage and store at least a pair of data files for each measurement. For the sake of simple and efficient data management, the dotTHz project adopts the hierarchical data format version 5 (HDF5) [11]. The HDF5 format was initially developed by a collaboration between the US National Center for Supercomputing Applications (NCSA) and the US Department of Energy’s Advanced Simulation and Computing Program (ASC) to deal with extensive and complex data. By embracing the same principle, the dotTHz data format delivers the following key advantages to users:

  1. 1.

    Simple data structure for easy handling.

  2. 2.

    Logical data organisation for efficient data retrieval and referencing.

  3. 3.

    Direct attachment of essential metadata for convenient automated processing and analysis.

  4. 4.

    Ability to process specific subsets of data from large files.

  5. 5.

    Ability to store different types (e.g. time-domain waveforms, spatial coordinates, metadata) of data in a single dataset.

  6. 6.

    High-speed performance with contiguous and uncompressed datasets.

  7. 7.

    Wide platform support as an open-source format.

  8. 8.

    Easy data sharing with all information stored in a single file.

The dotTHz file follows a specific structure: for each measurement, a group of datasets corresponding to sample and reference measurements is stored together with the attributes that contain the metadata, as illustrated in Fig. 1. The attributes can have various forms, such as numeric value, numeric vector, and string (Table 1), enabling efficient extraction and referencing of information during subsequent analysis and data processing.

Fig. 1
figure 1

The hierarchical structure of the dotTHz data format: multiple measurements with associated metadata can be stored in a single dotTHz file

It is essential to emphasise that a single dotTHz file has the capability to accommodate multiple measurements. This enables the consolidation of data pertaining to a time series of measurements or variable temperature measurements of the same sample within a single file. Furthermore, this approach facilitates and simplifies the archiving and sharing of experimental data.

Table 1 The dotTHz file datasets and attributes and the minimum requirement for a dataset as defined by this standard

2.2 Example Use Cases

In the following, we would like to outline a selection of representative use case scenarios of how we envisage the dotTHz file format being used in the terahertz community going forward.

2.2.1 THz-TDS Measurement of Pellet in Transmission

For a typical THz-TDS experiment of a single sample, the file will contain the time-domain waveform of the sample and one reference. The minimum metadata required will comprise of the sample thickness. It is expected that the metadata also contains a suitable identifier such as ‘TX’ to denote the measurement carried out in transmission.

Optionally, a single dotTHz file can contain the measurements of multiple samples and references or multiple measurements of the same sample and reference under varying conditions, such as a function of time or temperature for dynamic observations, and the conditions can be conveniently stored as additional metadata to facilitate subsequent analysis.

2.2.2 THz-TDS Measurement of Thin Film or Layered Structure

For measurements of thin films or multilayered materials, the metadata will contain information about each layer’s thickness, either as individual values within multiple slots or as a single numeric vector within one slot, to facilitate compatibility with subsequent analysis tools.

2.2.3 THz Pump-Probe Measurement

Since terahertz pump-probe measurements require two references, three datasets can be used for each measurement as dataset 1, dataset 2, and dataset 3 for sample, reference, and pumped reference, respectively.

2.2.4 THz Time-Domain Imaging

Terahertz time-domain imaging (THz-TDI) datasets consist of terahertz measurement data, specifically sample, reference, and baseline measurements, along with associated coordinates and timestamps for location-dependent and time-dependent scanning, respectively. The coordinates and timestamps can be stored in the ‘Date and Time’ and ‘Coordinates’ attributes in Table 1. Typically, THz-TDI datasets have a considerably large size due to their raster scanning nature, and this size can be effectively reduced by eliminating redundant data. CaTx offers an option that only stores the differentiating coordinates or time attributes from each scan.

2.2.5 Potential Use Cases With Non-Time-Domain Data

The dotTHz Dataset space can be used for any matrix form of datasets, providing compatibility to non-time-domain data. However, it will be necessary to set up a minimum outline for the dataset allocation for each application domain to keep its consistency and compatibility with subsequent analysis tools. The following is an example of two frequency-domain cases, and these can be updated for better applications along with analysis tool development.

Vector Network Analyser (VNA) Applications Four sets of S-parameter datasets can be stored in datasets 1 to 4, and each dataset will contain three rows for frequency, amplitude, and phase vectors.

Frequency-Modulated Continuous Wave (FMCW) Applications Similar to VNA measurement datasets, frequency, in-phase, and quadrature signals can be grouped as a dataset. While datasets can currently store up to four sets, this limitation is due to the current converter tool’s display space and can be easily extended with minor modification of the tool.

3 Conclusion

The dotTHz project was initiated to reduce terahertz data analysis efforts and at the same time foster collaborations in the terahertz community. We have taken the initiative in designing and introducing CaTx and CaTSper, which aim to standardise the processing and analysis of terahertz data obtained from different terahertz instruments. These tools were successfully deployed as part of the data analysis routine in the group. We hope that the dotTHz format may facilitate the development of many other advanced data analysis tools within our community, building on the excellent work by many colleagues [12,13,14] as well as facilitating establishing databases, reference datasets and supporting standardised testing approaches of novel devices and technologies [15] in the future.

The dotTHz project is an ongoing endeavour, and additional open-source standardised terahertz analysis tools for different applications and data manipulation methods will be developed in the future. We invite researchers from the terahertz community to join and contribute to this development. We also strongly encourage scientists, engineers, and developers to download the tools from the online repository, thoroughly test them, make necessary modifications, and contribute back to enrich the dotTHz project. Through the dotTHz project, we aim to bring the terahertz community closer together, foster collaborations, and facilitate further advancements in the terahertz field. We firmly believe that by standardising and simplifying data analysis and processing, we can attract and encourage more individuals to explore the vast potential of terahertz technology and its numerous applications.