QUAliFiER: An automated pipeline for quality assessment of gated flow cytometry data
- First Online:
- Cite this article as:
- Finak, G., Jiang, W., Pardo, J. et al. BMC Bioinformatics (2012) 13: 252. doi:10.1186/1471-2105-13-252
Effective quality assessment is an important part of any high-throughput flow cytometry data analysis pipeline, especially when considering the complex designs of the typical flow experiments applied in clinical trials. Technical issues like instrument variation, problematic antibody staining, or reagent lot changes can lead to biases in the extracted cell subpopulation statistics. These biases can manifest themselves in non–obvious ways that can be difficult to detect without leveraging information about the study design or other experimental metadata. Consequently, a systematic and integrated approach to quality assessment of flow cytometry data is necessary to effectively identify technical errors that impact multiple samples over time. Gated cell populations and their statistics must be monitored within the context of the experimental run, assay, and the overall study.
We have developed two new packages, flowWorkspace and QUAliFiER to construct a pipeline for quality assessment of gated flow cytometry data. flowWorkspace makes manually gated data accessible to BioConductor’s computational flow tools by importing pre–processed and gated data from the widely used manual gating tool, FlowJo (Tree Star Inc, Ashland OR). The QUAliFiER package takes advantage of the manual gates to perform an extensive series of statistical quality assessment checks on the gated cell sub–populations while taking into account the structure of the data and the study design to monitor the consistency of population statistics across staining panels, subject, aliquots, channels, or other experimental variables. QUAliFiER implements SVG–based interactive visualization methods, allowing investigators to examine quality assessment results across different views of the data, and it has a flexible interface allowing users to tailor quality checks and outlier detection routines to suit their data analysis needs.
We present a pipeline constructed from two new R packages for importing manually gated flow cytometry data and performing flexible and robust quality assessment checks. The pipeline addresses the increasing demand for tools capable of performing quality checks on large flow data sets generated in typical clinical trials. The QUAliFiER tool objectively, efficiently, and reproducibly identifies outlier samples in an automated manner by monitoring cell population statistics from gated or ungated flow data conditioned on experiment–level metadata.
KeywordsFlow cytometryQuality assessmentBioConductor package
(scalable vector graphics)
Flow cytometry (FCM) is a high-throughput technology that offers rapid quantification of a set of physical and chemical characteristics for a large number of cells in a sample. The technology is widely used in health research and treatment, including for monitoring of infection, diagnosis of cancers like lymphoma and leukaemia, and auto–immune diseases[1–9]. It is also used for cross-matching organs for transplantation and in research involving stem cells, vaccine development, apoptosis, phagocytosis, and a wide range of cellular properties including phenotype, cytokine expression, and cell-cycle status[10–15]. Importantly, clinical trials in these fields often use flow cytometry to monitor the immune system or the progression of a disease over time, generating large amounts of data in the process.
Variation in instrumentation, antibody staining, reagent lots, and other technical problems can crop up over time and manifest themselves as biases in the extracted cell subpopulation statistics or fluorescence intensities. Such errors are neither obvious nor easy to detect via the examinations of dot plot outputs from individual FCS files that are performed as part of regular, daily quality control procedures in a flow cytometry core. Careful and systematic examination of gated populations over time and in the context of the larger study design together with follow–up analysis of experimental metadata is often necessary to identify the problematic samples as well as the underlying cause of the bias (i.e. a reagent change). There is currently a paucity of tools to help investigators effectively and systematically perform quality assessment on large and complex flow cyotmetry data sets[16–18].
BioConductor provides a suite of open-source tools and software infrastructure to analyze FCM and other high–throughput data[18, 19]. The core of this tool set includes flowCore, flowViz, flowQ, and flowStats, which together provide functionality for basic data manipulation, visualization, automated gating, and some basic quality control[18, 20, 21]. The flowQ package provides high–level quality control procedures for ungated FCM data using statistical approaches to detect disturbances or unusual patterns in the signals of each channel during acquisition. However, the package is restricted to global measures of quality, as it can only handle ungated data and cannot leverage the complex metadata associated with the larger structure of an FCM study (e.g. monitoring the stability of common fluorescence markers in different staining panels of a longitudinal study, or monitoring the variability of gated cell populations across aliquots of a sample).
In order to perform quality assessment of manually gated data, the manual gates and gated data must be accessible to the computational framework for quality assessment. One of the most popular software packages for performing manual FCM gating is FlowJo (Tree Star Inc, Ashland, OR). This tool generates "workspace” files in XML format that define the preprocessing and gating applied to a set of FCS files. Currently, the flowFlowJo package provides some limited support for importing manually gated data into R from workspace files generated by older Windows–only versions of the software (i.e. FlowJo for Windows version 7.x). However, it does not support workspaces generated by newer versions of FlowJo (> version 7), or workspaces generated by FlowJo for Mac OS X. Importantly, flowFlowJo does not correctly handle FlowJo’s specific biexponential data transformation and it is limited to manipulating small data sets that can fit in the available physical memory of the computer. Thus large, real–world FCM data sets generated in clinical studies and data sets analyzed using recent versions of the FlowJo or other manual gating tools remain inaccessible to users of BioConductor’s flow tools.
To address these issues, we have developed two new BioConductor packages: flowWorkspace and QUAliFiER (QUality Assessent for Flow ExpeRiments). flowWorkspace makes manually gated data from large, arbitrarily complex FCM studies accessible in the R environment. It imports compensation matrices, data transformations, manual gates, and FCS files from analyses described in FlowJo workspaces (supports workspaces generated by FlowJo for Mac OS X and Windows), and reproduces them using the BioConductor flow toolset, thus making manually gated data accessible to the computational flow community. The tool has methods implemented for visualizing, summarizing, extracting and exporting population statistics for gated cell populations. Importantly, the tool can handle large FCM data sets through support of NetCDF via the ncdfFlow package[23, 24]. flowWorkspace can also be used to export data to the LabKey (Seattle, WA) tool, allowing one to use R as the engine for flow data analysis with a LabKey front end and data repository[16, 17]. The package is closely integrated with other BioConductor flow tools, including normalization via the flowStats package and quality control using QUAliFiER. flowWorkspace makes manually gated flow cytometry data of arbitrarily large size (provided enough disk space is available) accessible for analysis using BioConductor’s flow tools, so that new or automated data analysis strategies can be rapidly compared against current best–practices manual gating methods.
Comparison of Quality Assurance Functions Available in FlowJo, flowQ, and QUAliFiER
QA Across Multiple Experiments
QA Ungated Data
QA Gated Data
Interactive HTML QA Report
Use Study Metadata as Grouping Variables for QA
Customizable QA tasks
Customizable outlier detection
Automated for pipelined analysis
In the remainder of the paper we use bold face type to refer to software packages and teletype font to refer to object, classes, and functions in the packages.
Both packages make use of R’s S4 programming system to define classes and methods, adopting a formal object-oriented paradigm in their implementations. The packages are integrated with the larger flow cytometry package infrastructure available through BioConductor. flowWorkspace is integrated with the BioConductor core flow packages, including flowCore for support of the full range of operations on flow data including compensation, transformation, and gating, large data set support through ncdfFlow and visualization and plotting through flowViz. The QUAliFiER package takes advantage of manual gates available through flowWorkspace to perform quality assessment of both the gated and ungated FCM data, and produces visualizations of samples flagged as outliers for the further investigation through the flowViz package.
flowWorkspace makes use of R’s XML package and the xpath query language to parse and import FlowJo XML workspaces (FlowJo for Mac OS X, versions 7.0 and greater1)[26, 27]. The package reads in the list of samples, data transformations, compensation matrices, and gates associated with each sample in a workspace and constructs associated flowCore objects. The package implements two new data structures to represent this information: the GatingHierarchy and the GatingSet. As the name implies, the GatingHierarchy represents the set of hierarchical gates applied to an individual sample. The GatingSet represents a collection of gated samples from the workspace, analogous to grouped samples in FlowJo2. However, the design is sufficiently flexible to represent manually gated data coming from any external tool. Each GatingHierarchy is formally a tree data structure associated with a single FCS file, a set of data transformations applied to the channels of the FCS file (these can differ between samples), a compensation matrix (another flow cytometry specific transformation), and a set of gates (boundaries defining distinct cell populations). Each node of the GatingHierarchy tree represents a cell subpopulation in the sample associated with a flowCore gate stored at that node. To save space, flowWorkspace stores only one copy of the data together with a bit mask representing the events in the sample that are included in each gate.
The data import and gating steps are logically separated, allowing the user to import the workspace without necessarily performing the gating of the data. The package implements parallel import of workspaces using the parallel R package, and parallel gating over samples in a workspace using MPI (message passing interface) functionality from the Rmpi package[28–30]. The package is available through BioConductorhttp://www.bioconductor.org/packages/2.10/bioc/html/flowWorkspace.html.
The QUAliFiER tool makes use of a local database to store and access extracted cell population statistics from multiple experimental runs (i.e. imported FlowJo workspaces), as well as study metadata, and resulting outlier calls, allowing QA tasks to span multiple experiments performed in the course of a larger study. The getQAStats function extracts cell population statistics from each sample and gated population defined in a GatingSet and stores them in the local database. The basic quality assessment functionality is defined by the qaTask class, which is a general container that regroups all the information essential to perform a particular quality assessment task. The user pre–defines qaTask s via an external configuration file or directly in an R script that runs the quality assessment procedure, and these are evaluated using the core qaCheck method. This method performs the actual quality assessment for each QA task. Methods for outlier detection, and the specific details of each quality assessment task are all contained in the qaTask object and can be defined by the user or can use any of the pre–defined outlier detection functions or qaTask objects. To visualize the quality assessment results, several plot methods have been implemented including methods for generating dot plots or density plots of gates across samples, and scatterplots or boxplots of population statistics grouped by user–defined or experimental metadata grouping variables. Finally, qa.report collates the generated qaTask s and generates HTML reports with interactive SVG (support vector graphics) graphics and plots for all quality assessment tasks. The package is implemented entirely in R, is publicly available on Github as well as BioConductor. (http://mikejiang.github.com/QUALIFIER/,http://bioconductor.org/packages/2.10/bioc/html/QUALIFIER.html).
Results and Discussion
We present an application of our pipeline to a subset (3000 FCS files) of a large study from the ITN (Immune Tolerance Network) monitoring immunosuppression withdrawal in paediatric recipients of living donor liver transplants.
The QUAliFiER Workflow
The workflow involved in using QUAliFiER is relatively straightforward. It involves importing the data, extracting cell population statistics, defining QA tasks, performing outlier calling, and then generating an quality assessment report. The first three steps are handled by the QUALIFIER function, which essentially combines the different pieces of information necessary to perform QA on a dataset. A more detailed description of these steps follows.
Importing the QA gating template with flowWorkspace
The code for running the following and other examples can be found in the package source athttp://mikejiang.github.com/QUALIFIER.
The gating hierarchy for any sample can be inspected via plot (Figure1B) and the success of the import procedure verified via the concordance of the imported cell counts against FlowJo’s cell counts using plotPopCV (Figure1C). Slight discrepancies (a few fractions of a percent in the coefficient of variation) are due to FlowJo’s quantization of the data transformation function, which must be interpolated by flowWorkspace. Larger CVs may either indicate errors in the import process or small (containing few cells) populations where differences of two or three cells between the computed and imported counts result in a large coefficient of variation. Individual gates and samples can be visualized with the plotGate function (Figure1D) to inspect populations flagged with a large coefficient of variation. Importantly, these statistics and plots can be exported (via ExportTSVAnalysis) to the LabKey tool, which provides a web–based front–end for visualizing gated flow cytometry data[16, 17].
Extracting population statistics
After importing the data from FlowJo, QUAliFiER extracts population statistics from the GatingSet (internally via the getQAStats function), and stores them in a local database. Subsequent quality assessment makes use of this database to rapidly query and manipulate the data. QUAliFiER can apply filters to the population statistics and perform outlier calls based on grouping and conditioning variables defined in the associated study metadata. Each quality assessment task is defined in a qaTask object. The details for all the qaTask s are provided in a qaTask definition file (described below), whereas the study metadata is supplied as an associated comma separated value file. This file associates each FCS data file with study metadata (e.g. subject, date, dose, aliquot, and so forth). The GatingSet, qaTask definition file, study metadata file, and database connection are passed to the qaPreprocess() function, which does the work of extracting and combining the relevant information from each source into a coherent data structure. Importantly, the QUAliFiER package could be used to QA any manually gated data file format supported by flowWorkspace and is not limited to the template gating QA process highlighted here. Additionally, QUAliFiER could be used in a stand-alone fashion to perform QA on a set of extracted cell-population statistics and study metadata. flowWorkspace acts to simplify access to extracted statistics, but is not strictly required for use with QUAliFiER.
The specific cell population or gate for QA.
The cell population statistic (i.e. counts or proportion) to QA.
The metadata variables for stratification and outlier calling.
How to present the data to the user (i.e. plot type).
The qaTask class is a general container that allow users to define different quality assessment tasks using the information above. The class uses R’s familiar formula interface as a compact and flexible description of the QA task. Briefly, it is generally of the form, where y is the population statistic to monitor and takes four possible values:
MFI: Median or Mean Fluorescence Intensity of the cell population (the mean or median is user–defined).
proportion: The percent of the parent population represented by the population being QC’d.
count: The number of events in the cell population.
spike: Applicable to each channel of an FCS file measured over the acquisition time. A windowed, cumulative Z–score that quantifies spikes in the MFI of a channel over the acquisition time of the sample. In the absence of spikes, this is approximately zero.
In the right hand side of the formula, x specifies the x–axis variable for plotting. It can be any variable defined in the associated study metadata such as date or sample id. Variables on the right of the vertical bar represent conditioning variables used to stratify the population statistics for outlier detection. These also must appear in the study metadata. Outlier detection is performed within each level of the cross product of the grouping variables. If these are omitted, then outlier detection is performed on the entire set of samples.
The cell population to be monitored by the qaTask is passed as a name to the pop argument of the qaTask constructor. All of this information (the formula, population name, plot type, and other details) can be provided for all the qaTasks to be performed on a data set via an external csv file passed to the qaPreprocess function. Internally, the makeQaTask function can read a set of these task definitions from the csv file and construct all the qaTask objects simultaneously. Users may also create individual qaTask s directly via the new method.
Aggregate QA populations
The population name defining a qaTask generally refers to a unique gated cell population, either via the terminal gate name (e.g. "WBC_perct”), or via a unique gating path (e.g. "/MNC/FITC-A MFI”) (Figure1B). QUAliFiER also supports aggregating populations using common portions of gate names (e.g. "MFI” or "margin”) (Figure1B). The tool supports regular expression and substring matching to select multiple, non–unique cell populations for QA assessment. In this way, the population "MFI” selects all five terminal populations matching the string "MFI”, which can then be visualized simultaneously in separate plot panels, with each panel representing a different channel, as defined in the formula (see Figure2). Aggregating multiple cell populations in this way for quality assessment provides further flexibility to tailor the quality checks to the needs of the user. This aggregate approach is used throughout the template gates applied to the sample data set in this paper.
Outlier Detection and Visualization
Once data is imported and quality assessment tasks are defined, the qaCheck and plot methods perform the quality assessment and visualization based on the definitions stored in each qaTask object.
The actual outlier calls are performed by the qaCheck method. The method reads the population statistics from the database and performs outlier detection within the groups defined in the formula. The qaCheck method can accept a default or user–defined outlier detection function.
Summary of outlier detection methods in the QUAliFiER package
1. % of cells in WBC gate for RBC Lysis
2. % of total events as boundary events
3. Minimum total event count
1. Stability of MFI of a population vs time
2. Consistency of gating (%) of a population
3. High variability groups when measuring between–group variation (i.e. log(IQR) for boxplots)
4. Individual outliers from residuals of robust regression (i.e. in xyplot)
1.5 × IQR
1. Outliers within groups for boxplots
The qaCheck method will record the outlier calls in the database. Plots can be generated without outlier detection by simply omitting the call to qaCheck. In some applications it may be desirable to simply examine trends rather than make explicit outlier calls (e.g., for monitoring MFI stability over time, Figure2).
We show an example for monitoring the efficiency of red blood cell lysis (Figure4) from the ITN data set. Efficiency of lysis is measured as the fraction of total cells collected in the WBC_perct (white blood cell) gate (Figure1B). The qaTask definition used to monitor this population statistic over time, conditioning on all staining panel (tubes) is:
Description=‘‘Sufficient RBC lysis'',
plotType=‘‘xyplot'', qaName=‘‘RBCLysis'', qaID=1L,db=db)
Level : Tube
Description : Sufficient RBC lysis
Plot type: xyplot
Gated node: WBC_perct
The call to data loads the study data that has already been parsed and combined with metadata and quality assessment tasks as defined in the previous section. When constructing a qaTask via new it is also necessary to supply a unique qaID, and the database (an environment) holding the extracted statistics and metadata (this is initially passed to the qaPreprocess() function, where it is populated).
To perform the outlier detection, the qaCheck function is called on the rbc.lysis task and the results are stored in the database. A call to the plot method will generate the summary plot in Figure4, passing additional plotting parameters via the par argument.
The plot method is used to generate figures summarizing the outlier detection and quality assessment checks. This function takes the qaTask as an argument, as well as options similar to the lattice package, such as subset, which allows a subset of the levels in the grouping variables to be plotted. For example, samples can be subset based on a range of dates, or the plot of the quality assessment task defined above could be restricted to samples within a single staining panel (Tube) by passing subset=Tube%in%'CD8/CD25/CD4/CD3/CD62' to the plot method. This allows for flexibility in visualizing or analyzing subsets of the data.
Adding robust regression lines to scatterplots
As data accumulates over the course of a study (e.g. a longitudinal study), QUAliFiER stores this data in the QA database, and it becomes trivial to monitor trends in data collected over longer periods of time. As an example, the QA task for monitoring fluorescence stability in the FITC channel over time benefits from the addition of a robust regression line to the output plots within each panel in order to identify groups of samples where there are either non–linear effects or where the MFI is not stable over time (i.e. the slope of the regression is significantly different from zero). The outlier detection task for this procedure is defined in the following way:
"Fluorescence stability vs time",
> plot(MFIvsTime,y=MFI RecdDt|stain
Note the rFunc argument to the qaCheck and plot functions. It allows us to fit a robust linear regression within each group in order to help visualize the changes in MFI over time. Outliers within each level of the stain grouping factor are detected based on the deviation of the residuals from the regression line. By default these are called at a threshold of the absolute Z–score of the standardized residuals (3 by default) (Figure2 and Table2). If the qaCheck call is omitted, but rFunc is passed to the plot function, the resulting plots will be generated without outlier detection, which may be desirable in some circumstances. Importantly, all the qaTask definitions can be pre–defined in a csv file read in by qaPreprocess(), with column names for each argument to the qaTask constructor.
Creating a Quality Assessment Report
Summary of the Quality Assessment Report for an ITN Clinical Trial Dataset
One of the key advantages of QUAliFiER is that it provides an integrated environment for review of quality assurance data by flow domain experts. In the past, the flow analyst would either spot check and manually review plots within flow gating software tools or have data exported from such tools into spreadsheets for sorting, plotting, and viewing of trends over time or across tubes. Should specific anomalies be found, the analyst would have to shuffle between applications, sort through files to review plots within the flow gating tool and return back to summary statistics or plots of trends for confirmation. The disjointed process was cumbersome.
QUALiFiER takes a lot of this frustration out of the process so domain scientists can focus on the scientific questions of interest. It should be noted that the use of QUALiFiER, whether in a research or clinical trial setting is to have the flow cytometry domain expert always review trends and patterns and not simply rely on automatic exclusion of flagged samples. There may be instances where a trend may be due to administration of therapy or other clinical event of interest. In those instances, having the system within the R/BioConducotor framework allows us to easily overlay QA concerns with potentially biological events in an integrated, seamless fashion, further demonstrating the ease and utility of the tool. To our knowledge it is the first tool to integrate this level of extensive quality assessment for large scale gated FCM data in a cohesive pipeline.
Ongoing improvements to the software include complete FlowJo support, as well as FACSDiva (BD Biosciences, San Jose, CA) experiment files, improvements to the HTML report formatting, and generation of PDF output for quality assessment reports. The tool will also be integrated into LabKey (Seattle, WA).
The features and description of the software herein refer to flowWorkspace version 1.2 and QUAliFiER version 1.0.1 found at the BioConductor website (see Availability and Requirements). The development version of flowWorkspace supports Windows and Mac versions of FlowJo, including the latest version (version X, Chimera) which is Gating–ML compliant. Support for BD’s (Franklin Lakes, New Jersey) FACS DiVA is actively being developed and the next release of flowWorkspace will support some the most frequently used manual gating tools (DiVA and FlowJo reach approximately 50% of users). FlowWorkspace data import and gating has also been reimplemented in C++ in the development release, for a 100–fold speed up over the current R–only version of the package.
flowWorkspace is a BioConductor package that allows FCM data that has been preprocessed and manually gated using the FlowJo tool (and other tools in future releases) to be imported in the R statistical computing environment where the BioConductor suite of advanced FCM data analysis tools can be leveraged to further analyze the data. A good example of the utility of flowWorkspace is its integration with the QUAliFiER tool, performing flexible and robust quality assessment of gated and ungated FCM data. Together, the flowWorkspace and QUAliFiER tools address the increasing demand for tools capable of performing QA on large FCM data sets generated in typical clinical trials. flowWorkspace deals with the issue of working with more data than can be loaded into memory at once through its integration with ncdfFlow that stores data in netCDF files on disk. QUAliFiER provides an infrastructure for identifying outliers amongst the large number of samples collected in an experiment or clinical trial while taking into account the structure imposed by the trial metadata. It simplifies and summarizes the data and presents the results in an interactive way.
The QUAliFiER tool automates what has been, for the most part, a manual QA process. Within the ITN, the template gates and subsequent QA are applied manually within flowJo, the resulting statistics extracted, and plots are generated and visualized by an analyst to identify possible problems. In addition, SAS, Excel, and other graphing tools made the process time consuming and disjointed. QUAliFiER automates this entire process. The ease of use and customizable nature of the analysis output mark the advantage of the QUAliFiER platform over the manual processes. Additionally, QUAliFiER brings all the steps of the QA procedure into one software tool. Importantly, QUAliFiER is not limited to the template gate-based QA process presented here, but can QA any set of manually gated data (either imported via flowWorkspace or otherwise), provided that the data set identifies common cell populations across multiple samples.
QUAliFiER objectively, efficiently, and reproducibly identifies outlier samples in an automated manner by monitoring cell population statistics from FCM data conditioned on study–level and experiment–level metadata for outlier detection. The tool has a flexible interface allowing users to define new QA checks and outlier detection routines that suit their data analysis needs. Importantly, interactive quality assessment reports are generated automatically by the tool to facilitate exploration of the data by domain scientists and help identify the underlying causes of potential QA issues flagged by the tool. QUAliFiER has uses beyond simple quality assessment. It can be used for exploratory data analysis, to look for correlations between gated populations and clinical covariates for biomarker discovery, and has been applied to evaluate datasets for the flowCAP projects (http://flowcap.flowsite.org/).
Availability and requirements
Project name: QUAliFiER
Operating systems: Platform independent
Programming language: R
Other requirements: R, Bioconductor
License: Artistic 2.0
Project name: flowWorkspace
Operating systems: Platform independent
Programming language: R and C++
Other requirements: R, Bioconductor
License: Artistic 2.0
RG and GF and WJ developed the methodology and designed the software. WJ and GF developed the software, and performed the analyses. AA and JP participated in its design and coordination. WJ and GF drafted the manuscript. All authors read and approved the final manuscript.
This research was performed as a project of the Immune Tolerance Network (NIH Contract #N01 AI15416), an international clinical research consortium headquartered at the University of California San Francisco and supported by the National Institute of Allergy and Infectious Diseases and the Juvenile Diabetes Research Foundation. Development of these packages was also funded by NIH grant #R01 EB008400.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.