PhysioScripts: An extensible, open source platform for the processing of physiological data
- 469 Downloads
A commonality across research involving physiological measures is the need to process large amounts of data. Such data processing typically involves the use of software tools to achieve several methodological steps, including identifying and correcting artifacts and defining epochs of time for the reduction and analysis of one or more physiological measures. This article describes a new tool to aid in the processing of physiological data: PhysioScripts. Key elements of PhysioScripts include a graphical interface to view and edit the results of processing steps, as well as a flexible framework to automate the creation of uniform or variable length epochs. The software comprises freely available scripts implemented in the R computing environment. Consequently, PhysioScripts can be readily modified to process other data types through the addition of new subroutines that can be plugged into the existing data processing framework. For illustrative purposes, we describe the steps involved in two data processing examples: (1) heart rate variability from the electrocardiogram and (2) respiratory rate derived from a chest strain gauge. The software, accompanying documentation, and an example data set are available online at israelchristie.com/software.
KeywordsSoftware R Open source Physiological data processing
Numerous disciplines within the psychological sciences and allied fields rely, in large part, upon the processing and analysis of continuous physiological time series as a major source of data. Whether the goal is the description of heart rate reactivity during affective picture processing (Bradley, Codispoti, Cuthbert, & Lang, 2001), the derivation of baroreflex function during stress and the underlying brain systems involved (Gianaros, Onyewuenyi, Sheu, Christie, & Critchley, 2012), or the quantification of heart rate variability as an index of cardiac autonomic control during hot flashes in midlife women (Thurston, Christie, & Matthews, 2012), a commonality across such research goals is the need to ensure that physiological time series are adequately cleaned (i.e., artifacted) if valid inferences are to be made (see, e.g., Berntson & Stowell, 1998). In addition, a typical data processing step is to break the continuous time series into time periods or epochs based on either fixed time points or events of interest. Both of these tasks, artifacting and epoching, can be laborious, particularly when recordings are of longer duration and/or the probability of artifacts is high (e.g., extended ambulatory recordings in human or nonhuman animal studies). Moreover, the analysis of physiological time series is nearly invariably performed with the assistance of software tools, either proprietary packages bundled with the data recording equipment or third-party applications.
The goal of this article is to describe a collection of freely available scripts1 for the processing and editing physiological data, called PhysioScripts, which consist primarily of (1) a graphical interface to facilitate the viewing and artifacting of data; (2) a highly flexible framework to automate the creation of uniform or variable-length epochs based on information provided by the user in simple text files called epoch lists; and (3) modules that handle specific types of physiological signals that perform either preliminary preprocessing steps [e.g., the conversion of electrocardiogram to interbeat interval (IBI) series] or the derivation of summary measures within epochs [e.g., the estimation of heart rate variability (HRV)]. The software is designed to be easily extended to handle other data types through the addition of new processing subroutines that can be plugged into the existing framework. Two such sets of subroutines, one for the analysis of heart rate and HRV and the other for the analysis of respiration, will be described here for illustrative purposes. The software, accompanying documentation, and an example data set are available online (israelchristie.com/software).
A brief note regarding what distinguishes this software from other options is warranted. While a number of similar software packages and routines are available, encompassing both single-purpose applications and more extensive software suites, the PhysioScripts package is arguably unique in that it is written using, and runs within, the open source and freely available R environment, as opposed to a commercial environments like MATLAB (The MathWorks Inc., Natick, MA) or Labview (National Instruments, Austin, TX). This feature alone represents a saving in costs stemming from licensing fees. The readily accessible R syntax, as well as its open source nature, should lend itself to the rapid development of additional modules with which to expand the PhysioScripts collection. Finally, for those who choose to also use R for their data analysis needs, it is difficult to overstate the convenience that comes with using the same statistical package for both data processing and analysis.
The R computing environment (Ihaka & Gentleman, 1996; R Development Core Team, 2010) is a free, open source, cooperatively developed implementation of the S statistical programming language developed at Bell Laboratories (formerly AT&T, now Lucent Technologies; Chambers, 1998). The term “environment” is intentionally used when discussing R because it connotes a fully planned and coherent system, as opposed to a collection of specific and inflexible tools characteristic of other data analysis software. Hence, R is designed around a true computer language and affords users a high degree of extensibility through the creation of user-defined functions and the installation of contributed packages. The base R installation comes equipped with capabilities roughly comparable to, for example, a basic installation of SPSS or SAS, but it can be augmented by the addition of contributed packages, which currently number nearly 3,000. Since its introduction in the mid-1990s, R has rapidly become one of the most widely used platforms for statistical computing and is considered to have broader coverage of statistical methods than any other statistical software (Fox, 2006). Because of its open source nature, usage statistics are difficult to obtain. But, recent estimates have placed user numbers in the range of 1–2 million (Vance, 2009), across both business and academic domains. Another benefit is the fact that R is available for a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), as well as for Windows and MacOS. At present, we are unaware of any other integrative set of scripts for R that have been developed for the purposes of physiological data reduction and analysis, as described in this report.
About the PhysioScripts functions
Because PhysioScripts is not presently distributed as a formal package, there is no need to “install it,” per se. Rather, the PhysioScripts functions are distributed within a single binary RData file (e.g., “PhysioScripts.####.RData,” where #### is a unique version identifier). PhysioScripts is started by simply double-clicking the RData file, which initiates a new R session and loads the functions into memory. Most functions contain embedded help and usage instructions and can be viewed by simply typing the name of the function into the R console, without parentheses, and pressing enter. Also, as a general rule, data paths to data files should not contain spaces.
Interactive versus batch processing modes
PhysioScripts functions are designed to operate in one of two processing modes: interactive or batch. Interactive mode, the default, involves manual selection of data files via a graphical interface and should be immediately familiar to any user. Batch mode provides a more efficient means of handling a large number of files by performing the processing task on all files contained in a given directory, which is identified either through the graphical interface or by specification of a path during the function call. In addition, the default behavior for PhysioScripts functions (which can be overridden in the function call) is not to overwrite existing files or processing steps. This not only protects prior work, but also makes the batch processing mode more useful, in that a single function call can perform a given processing step on all unprocessed files within a directory.
PhysioScripts data and resource files
The default data format used by PhysioScripts is comma-delimited text file, with variable names (e.g., “time,” “ecg”) in the first row. By default, files are compressed using the gzip format, although uncompressed files can also be employed, and may be more useful if input data files are created by hand. Most modern physiological recording software will allow users to export raw data as ASCII text. The preparation of data files to use with the PhysioScripts functions should be a trivial, although perhaps inefficient, task using a text editor.2 An example function3 is included that converts exported ASCII text to the PhysioScripts file format and can be modified to meet the data formatting characteristics of specific recording equipment. Presently, all columns in the input data files are presumed to represent independent channels or physiological signals, each sampled at identical (uniform) sampling rates. A time column should be included in all data files and should be expressed in seconds, with the time point “zero” referenced to either the beginning of the file or midnight on the day of recording onset, the latter being used when recordings of more than 12 h are being processed.
name, the label used to identify epochs of a given block in the resulting output file;
time, the time of onset for the first epoch in a given block, which can be specified as seconds, clock time using the format “HH:MM:SS,” or date and time using the format “YYYY-MM-DD HH:MM:SS”;
length, the duration, specified in either seconds or the “HH:MM:SS” format, of all epochs in a given block; and
before and after, which specify the number of uniform-sized epochs to be created prior to and after the initial epoch created by the time and length variables, respectively.
A final point that should be addressed in reference to data and resource files is the adherence to a uniform naming convention. All files read into or output by the PhysioScripts functions should conform to the three-field format “###.type.ext,” where “###” is a unique file identifier. This identifier is typically a subject number and can include alphanumeric characters, as well as underscores and dashes, but should not include periods or commas. The second field, “type”, identifies the contents of the file with “phys” used for (possibly multichannel) raw input data files. The output data files created by some of the processing steps will create file names with the appropriate type identifiers (e.g., “ibi” for IBI data files, “hrv” for HRV results files, or “resp” for respiration results files). The third field, “ext”, identifies the format of the file and should be “gz” for compressed data files, “csv” for uncompressed data files, and “txt” for info files and epoch lists.
PhysioScripts data viewer
Example data processing: Heart rate variability
Heart rate variability, particularly the portion of variability linked to respiration (respiratory sinus arrhythmia; RSA), has become a widely used index of autonomic control of the heart and has been related to both physical health (Thayer & Lane, 2007) and a range of psychological phenomena, ranging from attentional capacity (Porges, 1992) and emotion regulation (Calkins & Johnson, 1998) to depression (Rottenberg, 2007) and anxiety (Friedman, 2007). The highly readable review by Allen and colleagues (Allen, Chambers, & Towers, 2007) serves as an excellent introduction to the theory- and measurement-related issues surrounding HRV and also describes, in greater detail than is suitable for presentation here, the band-limited variance method employed by PhysioScripts to obtain HRV estimates.
The following example describes the processing steps in a typical study involving heart rate variability, from the raw electrocardiogram (ECG) data file to the finished HRV data file. All function calls are initiated using the interactive mode (i.e., input data files are selected using a graphical interface), and default arguments for functions are not printed, though they are readily available in both accompanying documents and source code.
The ECG is one of the most ubiquitous biomedical signals in psychophysiological research and serves as the most accurate measure of chronotropic cardiac function. As such, it serves as a basis for many studies involving heart rate and nearly all investigations of HRV. As a first processing step, the ECG is passed through a detection algorithm identifying QRS waves, the electrical signature of ventricular depolarization and the fiducial point for beat detection. The QRS detection algorithm employed by PhysioScripts uses both the amplitude of the digitally filtered ECG waveform and its first derivative, and is based on filtering and detection methods shown to be resistant to sources of noise typically encountered in ECG recordings (Friesen et al., 1990).
IBI extraction and artifacting
The resulting merged data are written to disk in a comma-delimited ASCII text file with variable names in the first row. This output can be easily imported into any spreadsheet program or statistical analysis software.
Example data processing: Respiration
One point of consideration in the study of RSA, that component of HRV intrinsically linked to respiration, is the topic of respiration itself. Specifically at issue is whether within- and between-subjects differences in respiratory parameters may confound the interpretation of RSA. Opinions vary considerably as to the degree and nature of the experimental and/or statistical control of respiration necessary to validly interpret RSA, and the issue remains a point of contention among many methodologists (see Allen et al., 2007, for a review). Respiration data are routinely collected alongside the ECG so that respiratory parameters can be used to confirm that subjects are breathing within the expected frequency range and to possibly be used as covariates in subsequent analyses.
The PhysioScripts package includes basic functions to derive several indices of respiratory rate using a custom algorithm. Briefly, the respiratory waveform is bandpass filtered, and local maxima and minima, labeled as inspirations and expirations, respectively, are identified within a specified time window based on the shortest expected respiratory period (i.e., the fastest expected respiratory rate). Unbalanced inspirations and expirations—that is, two inspirations with no intervening expiration, or vice versa—are then corrected by removing the member of the paired values with the lesser absolute magnitude (i.e., the smaller inspiration or the larger expiration).
Limitations and future directions
The functionality of PhysioScripts is presently limited to the cardiorespiratory variables discussed in this report, and when presented with a different data type (e.g., pupillometry), the end user’s needs may be more readily met by other existing software. The software repository maintained by the Society for Psychophysiological Research (sprweb.org/repository) provides a list of available software for working with physiological data of varying types and may aid in the identification of suitable tools, although it should be noted that nearly all of these require MATLAB, and many do not provide cross-platform support (e.g., they run only in Windows). Lacking suitable existing software options, users may be required to develop their own applications, and can, of course, choose among many programming languages. It is our hope that the open source nature of PhysioScripts, particularly the core functionality (e.g., data visualization and file import/export), can facilitate the development of additional modules that will not only meet the user’s needs, but also extend the functionality of the PhysioScripts package. It is in this regard that we view PhysioScripts as an extensible platform for physiological data processing. Furthermore, the expense of using PhysioScripts is effectively nil, so there is no cost impediment to testing the suitability of the software.
In view of the above facts, future directions will include both developing additional modules for other physiological data types and measures (e.g., tonic and phasic electrodermal activity and estimates of baroreflex function from continuous blood pressure data) and further automating the generation of epoch-based summary data by incorporating automated or manual event marks. Notwithstanding these future directions, we believe that this new collection of freely available and uniquely R-coded scripts for processing and editing physiological data provides an efficient and modifiable framework for executing critical data processing and analysis routines for a broad range of time series data.
Text-based files containing R code are typically referred to as scripts. In this case, the scripts define the functions that perform data processing or editing tasks. Thus, the term “scripts” and “functions” are used interchangeably.
Microsoft Word is not a text editor. Notepad on a Windows machine or TextEdit on a Mac will serve this purpose. A Web search for “text editor” will reveal numerous free options for editing text files on any computing platform.
The function, vernier.to.gz(), converts data files recorded using ASCII data files exported from Vernier physiological recording software (Vernier Software & Technology, Beaverton, OR) to the default PhysioScripts file format.
Although comma-delimited text can be read into nearly any spreadsheet software, the automatic formatting performed by most spreadsheet software (e.g., Microsoft Excel) can be problematic for even advanced users. Creating or editing such text files is most easily done in a text editor, hence the use of the “.txt” file extension.
This research was supported by Grant No. R01-HL089850 from the National Institutes of Health. The authors thank Elizabeth Mezick and Kristen Stedenfeld for testing early versions of the software.
- Calkins, S. D., Johnson, M. C. (1998). Toddler regulation of distress to frustrating events: temperamental and maternal correlates. Infant Behavior and Development, 21(3), 379–395.Google Scholar
- Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5, 299–314.Google Scholar
- Porges, S. W. (1992). Autonomic regulation and attention. In Campbell, B.A., Hayne, H. (Eds.), Attention and Information Processing in Infants and Adults: Perspectives from Human and Animal Research, (pp. 201–223). Lawrence Erlbaum Associates, Inc., Hillsdale, NJ, USA.Google Scholar
- R Development Core Team. (2010). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from www.R-project.org.Google Scholar
- Vance, A. (2009, January 8). R you ready for R? Retrieved May 31, 2011, from http://bits.blogs.nytimes.com/2009/01/08/r-you-ready-for-r/