HAFNI-enabled largescale platform for neuroimaging informatics (HELPNI)

Tremendous efforts have thus been devoted on the establishment of functional MRI informatics systems that recruit a comprehensive collection of statistical/computational approaches for fMRI data analysis. However, the state-of-the-art fMRI informatics systems are especially designed for specific fMRI sessions or studies of which the data size is not really big, and thus has difficulty in handling fMRI ‘big data.’ Given the size of fMRI data are growing explosively recently due to the advancement of neuroimaging technologies, an effective and efficient fMRI informatics system which can process and analyze fMRI big data is much needed. To address this challenge, in this work, we introduce our newly developed informatics platform, namely, ‘HAFNI-enabled largescale platform for neuroimaging informatics (HELPNI).’ HELPNI implements our recently developed computational framework of sparse representation of whole-brain fMRI signals which is called holistic atlases of functional networks and interactions (HAFNI) for fMRI data analysis. HELPNI provides integrated solutions to archive and process large-scale fMRI data automatically and structurally, to extract and visualize meaningful results information from raw fMRI data, and to share open-access processed and raw data with other collaborators through web. We tested the proposed HELPNI platform using publicly available 1000 Functional Connectomes dataset including over 1200 subjects. We identified consistent and meaningful functional brain networks across individuals and populations based on resting state fMRI (rsfMRI) big data. Using efficient sampling module, the experimental results demonstrate that our HELPNI system has superior performance than other systems for large-scale fMRI data in terms of processing and storing the data and associated results much faster.


Introduction
Understanding the organization of brain function has received significant interest since the establishment of neuroscience. During the past two decades, functional magnetic resonance imaging (fMRI), which is an in vivo neuroimaging technique, has revolutionized the functional mapping of the brain [1][2][3][4][5][6][7][8]. Specifically, task-based fMRI (tfMRI) has been widely used to record functional brain activities during a specific task performance and further to identify brain regions that are functionally involved in the task performance [2,4,5]. Meanwhile, resting state fMRI (rsfMRI) has also received intense interest more recently to acquire brain activities while participants are in a task-free state. The coherence in the functional brain organization which is free from the task performance constraint can be reflected based on the spontaneous signal changes during resting state [1,[3][4][5][6][7][8].
Given the importance of fMRI (including both tfMRI and rsfMRI) data for functional brain mapping, tremendous efforts have been devoted on the establishment of fMRI informatics systems which recruit a comprehensive collection of statistical/computational approaches for fMRI data analysis [9][10][11][12][13][14]. For example, MEDx is one of the earliest tools which was produced to incorporate advances in neuroimaging methods in 1993 [9]. Later on, FSL (FMRIB's Software Library) toolbox was developed to bring more insights to the neuroscience analysis tools, and since June 2000, it has helped researchers globally apply FEAT, MELODIC, FABEER, BASIL, and VERBENA tools for fMRI data processing and analysis [10,11]. Moreover, statistical methods and tools have become one of the main tools to study brain networks and connectivity. For example, statistical parametric mapping (SPM) is one of the most influential tools which has been designed for brain imaging data sequence analysis from different cohorts or time-series [12]. Analysis of functional neuroimages (AFNI) package is another tool to visualize and statistically analyze fMRI datasets [13]. Furthermore, some have dedicated their resources to create a concentrate database to index the context and content of the fMRI literature in a searchable fashion, considering the multidisciplinary nature of fMRI researches and thousands of investigators around the globe. Fox and Lancaster have discussed demands of such a system and proposed Brain-Map to address required applications [14,15]. Although significant successes have been achieved for these fMRI informatics systems [16,17], a considerable limitation is that all of those state-of-the-art systems are especially Fig. 1 I The decomposed dictionary components from the fMRI data during one single task and II the 14 corresponding reference weight maps by applying the HAFNI method to the whole-brain fMRI signals. This figure visualizes 14 selected dictionary components which are either motor task-evoked networks (M1-M5) or resting state networks (RSN1-RSN9). The green bars in (I) show 400 dictionary network components (indexed along x-axis) and the spatial non-zero voxel numbers that each component's reference weight map contains (represented by the horizontal height of each bar). The panels in (II) visualize the temporal time series (white curve) and spatial distribution map (eight representative volume images) of each network. The red curves represent the task contrast designs of the motor tfMRI data. (Color figure online) c Fig. 2 HELPNI structure and connected components. a Web builder through which the web application will be built. b HELPNI platform big picture. c File infrastructure workflow consists of pre-archive and archive in which data will be temporary stored and then after user inspection and running required processes, data will be moved to their permanent destination where pipelines processes will be run on. d Client application and users transactions. Local and global users connect to the web interface after logging into the system and passing firewall, using their preferred client application. Then, they will be able to process, share, download, and upload data interactively. e Pipeline processing unit(s) that dynamically receive parameters and executives from pipeline manager and after running predefined steps, generate a user friendly report along with required notifications and then will store the results into file storage designed for specific fMRI sessions or studies of which the data size is not really big. As a consequence, there is difficulty for those systems to preprocess, analyze, and visualize fMRI 'big data' simultaneously.
With the advancement of neuroimaging technologies, the size of fMRI data is growing explosively. Given the lack of a uniform resource center for fMRI data providers, researchers, and developers, neuroimaging informatics tools and resources clearinghouse (NITRC) were established in 2006 to facilitate finding and computing neuroimaging resources for functional and structural neuroimaging analyses to be a common place to share required tools and data [18]. Although it was not for the first time that a government-funded project became an international neuroscience resource provider to cover pioneers worldwide, for example, neuroscience information framework (NIF) in 2004 [19] as well as biomedical informatics research network (BIRN) in 2001 [20], NITRC was successful and popular to host and provide one of the biggest fMRI databases named 1000 functional connectomes (1000FC) resting state fMRI project. [https://www. nitrc.org/projects/fcon_1000/]. Moreover, there are other fMRI big datasets that are publicly available for researchers such as OpenfMRI [21] and human connectomes project (HCP) [22]. HCP is a recent NIH-funded project devised to map the brain's communication network called connectome. This project provides a collection of neural data along with an interface to graphically navigate the data. The OpenfMRI is a National Science Foundation funded project established in 2010 to provide resources for researches to upload their owned fMRI data and make them publicly available.
In short, the availability of fMRI big data has globally attracted increasing attention for researchers in the neuroimaging field to test various methods and algorithms based on a 'big data' strategy. For instance, the velocity of studies as well as the variety and volumes of neuroimages is aggregating exponentially, which are among the biggest challenges nowadays [23]. As Van Horn studied and mentioned [24], the calculated neuroimaging data from listed articles in representative issues of neuroimage have been increased drastically and it is being expected to grow exponentially. The average size of raw data per study is expected to be 15 GB in 2015 and 20 GB in 2020. Therefore, effective and efficient fMRI informatics systems which can process and analyze fMRI big data are much needed.
To deal with the abovementioned limitation of previous fMRI informatics systems and to address the need of effective fMRI informatics system which can process and analyze fMRI big data for researches, in this paper, we have developed a HAFNI-enabled largescale platform for neuroimaging informatics (HELPNI) (http://bd.hafni.cs. uga.edu/helpni). This system is established using the extensible neuroimaging archive toolkit (XNAT) web application and storage solutions [25], a widely used open source system for storing, managing, and analyzing medical images and related meta-data [26]. RESTful application programming interface makes it especially useful for data sharing since the entire database's contents are reachable programmatically through the web application [26]. Specifically, the proposed HELPNI system in this work implements our latest computational framework of sparse representation of whole-brain fMRI signals which is called 'holistic atlases of functional networks and interactions' (HAFNI) [27]. The main idea of HAFNI is to aggregate all of hundreds of thousands of tfMRI or rsfMRI signals within a whole brain of one subject into a big data matrix, which is subsequently factorized into an over-complete dictionary basis matrix (represented by the panel (I) of Fig. 1) and a reference weight matrix (represented by the panel (II) of Fig. 1) via an effective online dictionary learning algorithm [28,29]. The time series of each over- Fig. 3 An overview of HAFNI implementation through HELPNI and its workflow completed basis dictionary represents the functional BOLD (blood-oxygen-level dependent) activities of a brain network (the white curves in the panel (II) of Fig. 1) and its corresponding reference weight vector stands for the spatial map of this brain network (the volume images in the panel (II) of Fig. 1). The HAFNI framework has been found to be effective and efficient in inferring a comprehensive collection of concurrent functional networks in the whole brain [27]. HELPNI covers the fMRI big data both from big data matrix and high volume of subjects. This happens first through employing HAFNI framework to handle the big data matrix for each subject and second by utilizing a database to store large-scale datasets, and then using a scheduling engine to distribute analyzing tasks to multiple machines and process multiple subjects simultaneously. HELPNI, as an advanced informatics system, provided us with resources to identify large-scale (over all 1200?) functional connectomes subjects automatically via automated computational pipeline based on our HAFNI framework function, to store the results in an organized data structure, and to generate detailed reports for data analysis (containing registration, online dictionary learning, and identified functional brain networks results) accessible through our web interface publicly. The HELPNI system significantly expands the previous neuroimaging archive toolkit by adding HAFNI capabilities, that is, HAFNI-enabled, while significantly enhancing HAFNI by integrating the advanced informatics system. The rest of this paper is organized as follows. We will describe the methods of development in addition to obtained results of HAFNI implementation in Sect. 2. We will also discuss the significance of this system in comparison to the previous methods of fMRI analysis studies. Results are provided in Sect. 3, and discussion and conclusion are in Sect. 4.

Method
In this section, we first provide a technical overview of HELPNI system and then we discuss HAFNI implementation details and its workflow in our system. Subsequently, we will discuss the 1000FC database we used as the test bed in this paper.

Overview of HELPNI system
The main purpose of HELPNI is to store and manage large diverse imaging datasets to facilitate neuroimaging researches with complicated processes and large amount of data. The interesting feature of this platform is the extendibility, through which developers can customize their desired analytical and visualization tools. The platform uses XML schema to generate custom components, modules, workflows for different tiers. As Fig. 2 elaborates, the standardized workflow helps users to (a) capture The computational pipeline of sparse representation of wholebrain fMRI signals using an online dictionary learning approach. a The whole-brain fMRI signals are aggregated into a big data matrix, in which each row represents the whole-brain fMRI BOLD data in one time point and each column contains the time series of one single voxel. b The target optimization function of dictionary learning and sparse coding. c Illustration of the learned atomic dictionary, each dictionary represents one functional network component. d The coefficient matrix, each row in the matrix measures the weight coefficient of the corresponding dictionary over the whole brain. That is, each row defines the contribution of one dictionary to the composition of all voxel-wise fMRI signals HAFNI-enabled largescale platform for neuroimaging informatics (HELPNI) 229 imaging/non-imaging data and meta-data (either from neuroimaging machines or other databases manually); (b) inspect data by means of pre-archiving feature; (c) analyze data remotely or locally on-demand; (d) collaborate easier using the predefined filter (in this way, collaborators can be noticed when a related dataset were added to system); and (e) control access and share data where datasets and linked results can be shared publicly through the web interface to facilitate collaboration. In the HELPNI system, we implemented our recently developed HAFNI framework for fMRI data analysis using the extendible pipelines. Pipeline is a workflow described in a XML document. Parameters could be specified within the XML document or be sourced as another XML document. So far we have implemented a few pipelines each of which contains different sets of scripts for our HAFNI framework. These pipelines can both extract input parameters from subjects automatically or ask users to provide them manually. Pipeline engine works based on the Java framework which parses parameters out of XML document and it links sequence of activities into a defined process flow and can manage data flow at each step. It can be configured to send notification at desired step(s) for quality control or to modify parameters, and then restart pipeline from where it stopped. We have used pipeline to automate the whole processes of fMRI data registration and online dictionary learning (ODL) and to reduce the processing time. It also helped to run the data over a very large set of data in much less amount of time as we implemented it over the 1000FC data. Pipelines can leverage from distributed computing, and in this way, a huge amount of processes can result in much less computation time.
In this work, we used the 1000FC project datasets as test bed for HELPNI system developing and testing. The 1000FC project contains 1200? resting state functional MRI (rsfMRI) images collected from 33 locations. We defined a workflow to obtain the result as we discuss here. Figure 3 shows the implemented pipelines and workflow of our process from the beginning of obtaining fMRI data from NITRC to data process steps and finally result reporting. The main three steps of this workflow are (a) data preparation and modification; (b) data process and Each row represents the networks from one dataset; the last row shows the RSN templates for comparison. Only the most informative slice, which has been overlaid on the MNI152 template, is shown here

Data preparation and modification
At the very first step, users need to prepare data to import to system. We first obtained data from 1000FC database and modified the data structure as our own predefined structure. After modifying hierarchy and trimming data, images with correspondent meta-data should be uploaded to pre-archive for primary tests and analysis. The required format of data should be created in file system including ID and sequence type as well as any special data type that needs to be defined in system. To do so, we prepared required meta-data including TR value, field strength, gender, and handedness of each subject and experiment. Then, data were transferred to pre-archive as a temporary cache destination for further tests and review of quality (Fig. 2c). Pre-archiving step keeps data integrated and protects them from data loss or corruption. We also tested our workflow to fix any possible flaw in implemented algorithms. When data became ready and analytical methods turn mature to be modeled in XML schema, we imported data into the archive as final destination for viewing purposes and/or running standard processes on prepared data. We used curl to upload fMRI data through REST API [30] from command line.

Data process and workflow
The next step in HELPNI platform is data processing. The raw fMRI data need to be preprocessed before data analysis. We implemented the rsfMRI and tfMRI preprocessing pipeline in HELPNI to address this demand. Our preprocessing step includes skull removal, motion correction, slice time correction, and special smoothing as well as global drift removal [8]. We used Build and ArcBuild [26] predefined XNAT tools for image session scan selection and running processing steps, respectively. Applying the major processing pipeline is the next step. We integrated our HAFNI computational framework in HELPNI. The basic idea of HAFNI framework [27] is to aggregate all of the thousands of fMRI signals within the whole brain from one subject into a big data matrix and then decomposes it into an over-completed dictionary matrix and a reference coefficient matrix. Specifically, each column of the dictionary matrix represents a typical brain activity pattern and the corresponding row in coefficient matrix naturally reveals the spatial distribution of the activity pattern. Typically, each subject brain's signals form an m 9 n matrix S, with m represents the fMRI time points (observations) and n represents the number of voxels. In order to sparse represent the signal matrix S using D, we aimed to learn a meaningful and over-completed dictionary matrix D2 R mÂk (k [ m, k \ \ n), with k being the dictionary atoms (i.e., components). The loss function is defined in Eq. (1) with a l 1 regularization that yields to a sparse resolution of a i : Here a i is the coefficient matrix and k is a sparsity regularization parameter. In order to prevent D from arbitrarily large values, the columns d 1 ; d 2 ; . . .d m are constrained by Eq.
Briefly, the problem can be transferred into a matrix factorization problem in Eq. (3) and we adopted the stateof-the-art online dictionary learning algorithm [29] for the sparse representation of the whole-brain fMRI signals.
Once we obtained the learned dictionary matrix D and coefficient matrix a, we mapped each row in the a matrix back to the brain volume and examine their spatial distribution patterns, through which functional network components are characterized on brain volumes [27]. At the conceptual level, the sparse representation framework in Fig. 4 can achieve both compact high-fidelity representation of the whole-brain fMRI signals (Fig. 4c) and effective extraction of meaningful patterns (Fig. 4d) [28,29,[31][32][33][34]. For more details, please refer to our recent literature report [27].
The system is designed to feed the preprocessing as the input of online dictionary learning pipeline automatically or manually after filtering the preprocessed data. For visualization purposes and to make the generated results easy to explore, both preprocessing and ODL pipelines will generate a PDF report at the end after which it will be automatically uploaded to the web interface. These reports contain generated results from the executed pipelines identified by experiment ID appended to pipeline name. For example, ODL report contains 400 png files sorted sequentially. Pipelines can also be set to send notification within different steps of workflow. For example, user can be notified when a specific step is done to evaluate the result and then if it meets the quality, let the pipeline continue. Otherwise, user can modify the input variables and restart the pipeline. Also at the end of workflow, assigned users will be notified of a successful run.

User interface and data access
Large-scale fMRI data usually need group-wise analysis and collaborators need to work together. In HELPNI, users can connect to system remotely and choose their desired subset of archive through bundle feature in the system. Users are also able to email other collaborators a link containing selected subset of archive.
The standard user interface features useful tools including a search box which provides searching through all archived subjects and sessions and menus in which users upon their permissions can access. Users need to login to system to be able to modify or upload new data but viewing and downloading 1000FC data as well as preprocessing and ODL results are publicly available (http:// bd.hafni.cs.uga.edu/helpni). User can browse experiments and data via three methods. One is by selecting project and subject subsequently, and the other is through searching for a subject name from search box, and the last is through selecting a listing, where user can input certain information of project/subject or image modality and then query a list containing correspondent filtered data.

Results
We tested the proposed HELPNI platform by applying the implemented computational framework of HAFNI on one of the largest open source resting state fMRI (rsfMRI) databases: 1000 Functional Connectomes project (known as 1000FC) [6]. This database has gathered more than 1200 rsfMRI datasets independently collected from all over the world containing over 130 Giga Bytes of data. Table 1 (see http://fcon-1000.projects.nitrc.org) summarizes rsfMRI datasets. Age, sex, and imaging center information are provided for each of datasets and all subjects have been uploaded to the HELPNI. As detailed in Sect. 2, HELPNI Table 3 Spatial overlap between identified individual RSNs and templates in different datasets   3.1 Group-wise consistent functional brain networks identification using HELPNI With the help of HELPNI system and the implemented HAFNI computational framework, we successfully identified 10 meaningful and consistent resting state networks (RSNs) which are in agreement with previous studies across all individuals and datasets in 1000FC database. Figure 5 shows the identified 10 group-wise consistent networks in five randomly selected datasets (that are Baltimore, Beijing, Berlin, Cambridge, and Cleveland dataset) in 1000FC. Networks #1, #2, and #3 are all located in visual areas and closely related to visual behavior. Network #4 includes ventromedial frontal cortex, bilateral inferiorlateral-parietal, and medial parietal areas and are often referred as default mode network (DMN). Network #5 covers the cerebellum and corresponds to action-execution function. Networks #6, #7, and #8 are related to sensorimotor, auditory, and executive control function, respectively. Networks #9 and #10 cover several front parietal areas and are closely related to cognition/language paradigms [35]. Figure 6 illustrates the identified 10 consistent networks in five randomly selected individual subjects from the same five datasets. We can see that the identified 10 functional networks are quite consistent across different datasets and subjects and consistent with the templates in previous studies [35]. Quantitatively, we calculate the spatial overlap between the identified networks and templates which are detailed in Table 2 and Table 3. The spatial overlap is calculated as the percentage of the overlapping area between our identified networks and templates [27]. Based on these results, we can see that our developed HELPNI system is effective and efficient in reconstructing meaningful functional brain networks from rsfMRI data.

Integrating sampling module in HELPNI
One important characteristics of our HELPNI system is the plug-and-play capability. Since the implemented pipelines are modularly designed, we could easily develop and test new modules to enhance established computational framework. For example, in order to speed up the current HAFNI framework in the HELPNI system, we developed and integrated an efficient signal sampling module [36] to improve the calculating speed while obtaining comparable results. The average computation time of training a dictionary for one individual brain is about 30 s using sampling module, whereas the time cost without sampling is 340 s, which speeds up the HAFNI training procedure more than 10 times. At the same time, the returned results could identify the similar consistent and meaningful functional brain networks across datasets and individuals as discussed in Sect. 3.1. Figure 7 shows the same identified 10 group-wise consistent networks with sampling module in the same five datasets (that is Baltimore, Beijing, Berlin, Cambridge, and Cleveland dataset) in 1000FC. Figure 8 illustrates the identified 10 consistent networks in the same five individual subjects in Sect. 3.1. Similar to original HAFNI computational framework with no sampling module, the identified 10 functional networks are quite consistent with each other across different datasets and populations and consistent with the templates in previous studies [35]. Quantitatively, we calculated the spatial overlap between the identified networks and templates which are detailed in Tables 4 and 5. From these results, we can see that the integrated sampling module in HAFNI framework via HELPNI system significantly decreased the computing time while achieved comparable results for functional brain network identification at the same time. It also demonstrates the plug-and-play capability of HELPNI system to effectively detect meaningful functional brain networks from raw neuroimaging data.

Discussion and conclusion
In this work, we have designed and developed a neuroimaging informatics platform, HELPNI, to archive large-scale fMRI datasets, to automate sequence of complex processes for fMRI data analysis and finally to use distributed and parallel computing resources to bust up big data analysis time. HELPNI has leverage from extensible neuroimaging archive toolkit to power up the web application and storage part of the system and is composed of three main parts of web application and storage, pipeline analysis framework, and the big data analytic tools. This novel platform integrated our recently developed HAFNI computational framework for fMRI data analysis in an accelerated way. As demonstrated in this work, we used the open access 1000 functional connectome datasets as a basic example to import 1200? rsfMRI data into HELPNI system, to run the HAFNI framework on the rsfMRI data, and to identify consistent and meaningful functional brain networks across individuals and populations. Our experimental results demonstrated that efficient sampling module can be implemented together with HAFNI framework to speed up the dictionary learning and identification of meaningful functional brain networks. The HELPNI platform is publicly accessible through http://bd.hafni.cs.uga.edu/helpni where users can view all of the archived fMRI data as well as the processed results. Authorized users can also upload new data and run pipelines over their desired fMRI images.
Considering the explained characteristics (Sect. 2) as well as the task scheduling feature of our HELPNI (Fig. 3e) in which tasks can be run in a distributed or parallel fashion, HELPNI with plug-and-play capability and modularity can significantly speed up the fMRI data processing. Users can easily feed their workflow to the HELPNI and it will schedule, distribute, and run all tasks using all available resources and will notify users with the final results. We are also implementing big data analytic tools to empower the processing part through Hadoop and Spark. Parallel optimization procedure has shown significant improvement in sparse dictionary learning computation time [37].
The large-scale datasets can be imported to the HELPNI system and various computational pipelines, and analyses can be then run over the big data without corrupting the original archived images. For example, in this paper, we ran the HAFNI pipeline over all subjects in 1000FC project, and the users could examine the results in a well-structured report in addition to original image data. We also ran the sampling pipeline on a subset of the dataset and stored them in the same fashion. In this way, users can evaluate and compare the results with sampling and no sampling simultaneously. The HELPNI system saved much computing time since there was no idle time in between of processes using the task scheduling feature. In the future, the distributed scheduling and big data analytics tools are planning to be used to save more time by means of distributed system available at the University of Georgia. This will provide fMRI community to use HELPNI system integrated with other analytical tools on large-scale fMRI datasets and to collaborate with other laboratories and research centers.
Adding a few new features including auto classifying the stored images based on the analysis results, fully implementing the parallel algorithm for HAFNI and improve the current user interface of HELPNI are scheduled as our future improvements to HELPNI. Future applications of HELPNI include testing other big datasets such as HCP and Open-fMRI, implementing new modules such as population clustering of learned dictionary HAFNI spatial maps, and eventually discovering disease-specific biomarkers.