Machine learning-driven automatic storage space recommendation for object-based cloud storage system

An object-based cloud storage system is a storage platform where big data is managed through the internet and data is considered as an object. A smart storage system should be able to handle the big data variety property by recommending the storage space for each data type automatically. Machine learning can help make a storage system automatic. This article proposes a classification engine framework for this purpose by utilizing a machine learning strategy. A feature selection approach wrapped with a classifier is proposed to automatically predict the proper storage space for the incoming big data. It helps build an automatic storage space recommendation system for an object-based cloud storage platform. To find out a suitable combination of feature selection algorithms and classifiers for the proposed classification engine, a comparative study of different supervised feature selection algorithms (i.e., Fisher score, F-score, Lll21) from three categories (similarity, statistical, sparse learning) associated with various classifiers (i.e., SVM, K-NN, Neural Network) is performed. We illustrate our study using RSoS system as it provides a cloud storage platform for the healthcare data as experimental big data by considering its variety property. The experiments confirm that Lll21 feature selection combined with K-NN classifier provides better performance than the others.


Introduction
Nowadays, the application areas of a cloud storage system are not limited to digital data deposits.Many machine learning approaches [2,12,16] are associated with renowned cloud storage systems to make them computationally advanced.Storage location prediction is one of the crucial applications B Anindita Sarkar Mondal   sarkar.anindita5@gmail.comAnirban Mukhopadhyay anirbanbuba@yahoo.comSamiran Chattopadhyay samirancju@gmail.com of cloud storage systems.In this article, we have utilized health-related data to illustrate this task.
Health data comes from various organizations, like sensor provider, hospital manager, account manager, patients relatives, etc. and the structure of the generated data is varied with the source.Eventually, more information is added to the existing one.With time, they may feel to store additional information on the currently stored objects [11].Therefore, they follow the unstable structure format, transmit towards the unstable structure schema.All of them make the storage space prediction job more complex.
Different cloud storage systems are available in the literature to support health data.These include Amazon S3 [28] that provides bucket-oriented object storage architecture, Openstack Swift that [27] supports account-container based object storage architecture, and Object based schema oriented data storage System (RSoS System) [24] that follows schema-based account-container oriented object storage architecture.Some time-series databases are also included to handle health sensor datasets.For example, PhilDB [20] uses metadata tracking architecture for identifying every time-series data type, and SciDB [38] follows native array data model for handling time-series datasets.The challenge is that health data is not only composed of time-series data or big chunks of data file, it is a combination of different types of datasets (time-series data, graph-based, file, etc.).Among all these database model variants, only schema-based architecture can provide the variant storage location to the different health data format.In such a way, the characteristics of this schema-based storage architecture not only supports the big data variety property, but also reduces the database operation execution time as shown in [24].Due to this reason, we have considered RSoS system as a prototype design in this article.
RSoS may use different database models for different types of data to be stored.It is required to separate the storage space for different types of data.Identifying the types of data manually takes time and constant human intervention.Therefore it is required to automate the process of identifying the type of the data before sending it to its designated storage space.Machine learning can help us do this by predicting the storage space corresponding to each type of incoming data automatically.
Hybrid technologies are used in the healthcare domain to deal with different dataset formats like wiki-health platform [18], PolyglotHIS framework [15] and RSoS storage system [24].All of them use more than one database model under a single storage layer.Cassandra [4], MongoDB [25] and HDFS [3] are widely used database models for big data.They are different in data type, storage capacity, storage architecture and data storage process.
RSoS system is a schema oriented object-based cloud storage system.Assigning object storage space requires enough database knowledge.In most of the cases, users do not have enough knowledge and time to define this object storage space.Also, if the user makes any mistake for defining the contents of object storage space then the data will not be stored in the proper place.To remove such complexity, we move towards designing an automated system to identify the corresponding storage space of the incoming dataset.Machine learning technology is one way to predict the storage space for the automated system to make it convenient for any human being.
In articles [23,24], the author discussed the detailed architecture, working procedure and performance comparisons of object-based schema oriented data storage system (RSoS system).To communicate with RSoS system (object based schema oriented data storage system), users use three queries in JSON format, viz., READ, WRITE and DELETE.Before storing any data to RSoS, a user needs to notify the related information like account, attribute name, attribute status (key, nonkey) and storage location (database name).The hypergraph data model maintains this information in a graphml.xmlfile, as presented in Listing 1.According to this listing, account values are U101, U102, U103 and U104; attribute names are Time, Temperature, PatientID, PatientDetails, fileextension, remotefiledata and id.These storage devices are named by considering the used database model in that storage device.Here, three database models are used viz., Cassandra, MongoDB, and HadoopDFS.The mentioned Cassandra, MongoDB, and HadoopDFS are the name of storage devices.These storage locations are accessed using unique and individual URI, mentioned in Listing 1.
This hypergraph data model known as graphml.xmlupdate is mandatory in every little change of data schema or the new dataset entry.This task is done manually, and it is a laborious and time-consuming job for human beings.Hence, in this article, our primary goal is to design a classification engine for the RSoS system that can reduce human trouble by predicting storage space automatically.
Here, the purpose of the classification engine is different from the existing ones.It motivates us to build a new classification engine framework for the RSoS system.RSoS system is a cloud storage system where data reaches within its own query format.Due to unique structure of every query for the different types of the datasets, generated datasets from each and every query act as the input to the designed classification engine.This generated dataset holds the system sub-parts status values of the server corresponding to the query.The input data looks like a two-dimensional matrix, M = [m i j ] ∈ R m×n , m and n being the number of queries and the number of system sub-parts treated as features, respectively.The training dataset consists of one more column named class which holds the data type value of the corresponding query with input data.The test dataset includes only the feature values of the input data.
A comparative study of feature selection approaches wrapped with classifiers is conducted to find suitable and efficient machine learning technologies for this framework.Firstly, the feature selection algorithms applied to the training dataset are responsible for selecting the relevant features.Next, the trained classifier determines the class value of the test dataset by considering the selected features.For the comparison purpose, three feature selection algorithms, viz., Lll21, Fisher score, F-score, and three classifiers, viz., K -NN, neural network, SVM with three kernel functions, viz., linear, polynomial, radial basis function (RBF) are considered here.
The ultimate aim of this machine learning technologies is to devise a feature selection algorithm along with a classifier to find proper object storage space for the incoming data presented in the write query.The experiment of this article is limited to the storage device URI prediction as the part of object storage space and the other parts viz., attribute list is constrained to the supporting storage device.
The major contributions of this article are listed below.
-A classification engine framework is proposed to predict the probable object storage space architecture from input data characteristics.-The detailed architecture of the RSoS system associated with the proposed classification engine framework is presented.
-The brief workflow model of the classification engine framework is described.The detailed description about the involved sub units with their internal communication is also presented.-A comparative analysis is performed to find out the proper combination of feature selection algorithm and classifier for the designed classification engine framework.
In the next section, we describe the existing classification engines.The overall information of RSoS system is explained in the next section.The following section describes the proposed classification engine framework with the workflow model.The main focus of the next section is to determine the components of the proposed classification engine, i.e., a feature selection algorithm and a classifier.An experimental comparative study is made here.Finally, the last section concludes the article with a discussion on future work.

Related work
In this section, we discuss some reputed classification engines and point out how they perform the tasks based on associated application's demand.
Development of the classification engine as machine learning technology for cloud storage system can broadly be categorized into two research areas, viz., the provided services of existing classification engines and machine learning approach associated with different cloud storage systems.
The main aim of varonis classification engines [40] is to classify, manage and protect the sensitive datasets from cyber-attacks.For this purpose, it monitors the user's behaviour and decides who will have access to which type of data.Moreover, it helps prevent unwanted data access.ACE (autonomous classification engine) provides a framework to the user, where by giving feature vectors, the user can optimize the classifier, classifier parameters and reconfigure the problem [22].By using this framework, the authors in [35] classify beatboxing sounds and find out that adaBoost classifier with C4.5 decision trees provides more accurate results than the others.Here, a genetic algorithm is used for feature vector generation.
Advanced classification engine (ACE) is maintained and designed by Websense security lab [security overview websense ACE (advanced classification engine)].It is designed for providing real-time websense gateway and cloud security services.It works by classifying web page contents, URL and protocols inline based on the presence or absence of hidden malicious messages in page data contents [8,42].PSIGEN releases the ACE (accelerated classification engine) [29].It is able to custom classification of documents based on user requirements.
ACE is a framework, consisting of three classification engines, viz., data classification engine, storage classification engine and data placement engine for ILM (information lifecycle management) [34].Here, storage location of a data item is shifted based on business value of the data.The authors in [10] proposed a classification engine which is used to identify the name of the social network platform of an image.This engine is built up using K -nearest neighbor (K -NN) classification and decision tree mechanism.
Integrated classification engine has been proposed by Veritas in [41].Veritas is dedicated for multicloud data management where integrated classification engine is used to achieve the data protection requirements by classifying the risky and sensitive data worldwide.
Google cloud machine learning engine is a computing platform that provides training and prediction services [2].Applying this platform, developers can build advanced complex machine learning models.One can use of google cloud machine learning engine in google cloud storage system as data processing purpose.Here, it auto-scales the number of server clusters to generate the result within the access time limit.
Amazon machine learning provides services to the developer for doing predictive analysis by drawing machine learning models [1,12].For this reason, through the AWS account, a user can select the amazon machine learning standard setup and the necessary data can be selected from amazon S3 and Redshift.After performing predictive analysis on the selected data set, amazon machine learning generates predictive or machine learning models.
Microsoft provides easy data access facility to the clients without hard code connection information (like, subscription ID and token authorization) using Azure machine learning datastores [6,16].When a client registers azure storage solutions as a datastore, then azure machine learning will be responsible for creation and registration of the necessary datastores.The datastore stores all the related connection information of the corresponding storage account.
Table 1 compares some popular smart storage solutions which use machine learning approaches to make them intelligent.Archivist [30] is a mechanism to select the storage device unit from solid-state drives (SSD), conventional hard disk drives (HDD) and non-volatile RAM (NVM) for file placement.Adaptive Resource Management (ARM) system developed by [26] is another machine learning based technology which enables object based storage cluster to be a self-managing and self-adaptive system.On the other hand, AIOps (artificial intelligence for IT operations) platforms designed by Gartner [9,17] is a combination of big data and machine learning.One can use AIOps to detect the failure parameter of cloud object storage service at IBM (IBM COS [13]).Smart object storage system (SOSS) [45] is a machine level storage system which considers machine level technologies to make it intelligent.
Our proposed approach in this article differs in two ways from the existing approaches as shown in Table 1.Here we develop a machine learning unit (called as classification engine).It works as a part of the cloud storage system unlike the discussed ones that provide a platform where clients can build their own machine learning models using the facilities of that storage system to make the storage system presuming services more advanced.Secondly, the objective of the proposed classification engine unit is different from the discussed ones and also the working process and architecture are varied with the demands of associated applications.
In this article, our main aim is to predict the storage space for the dataset presented in client's write query request where multiple database models are involved.In the research world, there have some popular recent research work which uses traditional strategy rather than machine learning for handling multiple databases.Such as PolyglotHIS [15] uses the query mapper to divide a query into subqueries for accessing data from multiple data storage systems which come from different database models.Data mapper [39] is used to map the dataset to the corresponding database system by considering data object relationship with database system.polyglot persistence mediator (PPM) [32] selects the database at run time on the basis of tenant's defined information where it consists with the database schema with the nodes values, SLA information with the functional (ACID transactions, Joins, updation etc.), continuous (e.g., availability, latency, throughput etc.), and non-functional (e.g., scalability, elasticity, durability, replication etc.) parameters as the inputs.Wiki-Health platform [18] considers the ontology engine is used to describe the storage systems by referring to the semantic information of storage systems.These mentioned four research work focus on different direction for working with multiple database models.However, none of the above works is focused on automatic storage space prediction using machine learning.The approach proposed in this work is thus novel to the best of our knowledge.

RSoS system overview
For doing the experimental analysis, "object based schema oriented data storage system" (RSoS System) platform is chosen due to its data storage procedure and architecture.Before drawing the desired unit for calculating storage space automatically, here we discuss the related information of RSoS system.The detailed information of RSoS system is distributed in two articles.The first article [23] presents a conceptual view of RSoS system and the second article [24] discusses the detailed architecture, functional prototype, and comparative analysis.The abbreviation of this storage system is shown in [24].
Object storage procedure: RSoS system treats data as an object.This object makes encapsulation of data with three overlays, viz., account, container and object.The account is used to identify each user uniquely.The container is the user-defined identifier.The information of these overlays is passed to the RSoS system by the query.
The main task of data storage procedures is to generate "object storage space".To create object storage space it is required to know account, container, local schema and database.These information are object-dependable.After receiving account-container information with an object, the next task is to find out the database and local schema from this object received as a data query.At this time it takes the help of "hyper-graph data model", the contents are human written.
Hyper-graph data model makes up with three parts, viz., node, hyperedge, hyperhyperedge.Node is the representation of the attribute, account and status.The attribute is presented in data passed by the query.Status informs that this is key or non-key.The link between two edges is hyperedge and a combination of different hyperedges are hyperhyperedge.Hyperhyperedge holds information of the database, i.e., database name and device URI where this database is presented.All of this information is present in graphml.xml of RSoS system.RSoS system [23] Storage space prediction for health data based on it's data structure Handling big data variety issue of health data ML based framework analyse the system behavior corresponding to input data and match the related storage device Background job effects the value of system monitoring output when an input hits the system Architecture: This storage system has three-layer architecture, viz., client API, object manipulation unit and storage device unit.Through the client API, users can communicate with RSoS system.RSoS system receives three queries, viz., read, write and delete in JSON format.
Storage device unit consists of several workstations with huge disk size.These workstations run by database management systems.To support the health data variety property, more than one DBMS is involved.RSoS system uses three database management systems, viz., Hadoop distributed file system, Cassandra, MongoDB.This unit is enclosed with a RESTful web service.Each individual database management system has separate unique URI.
The sub-components of RSoS system placed between client API and storage device units receive instruction from client API as query and decides which service needs to be provided.To provide the instructed service, it needs to know the corresponding object storage space.Here, it takes help from graphml.xml and the respective query.At last, it generates the corresponding query and URI to hit the particular storage device unit.

Classification engine framework
The data classification engine uses machine learning technologies on a set of training data to acquire knowledge about this dataset.This knowledge helps predict the property of a similar type of dataset.To do such operations, a set of units is works parallelly, so we called it engine.The ultimate goal or work process of it varies with the system demands to which it is associated.

Overview
Because of the application dependency, RSoS cloud storage platform is used to describe the framework of the desired classification engine.The overall architecture of the designed classification engine unit associated with the RSoS system is shown in Fig. 1.This classification engine framework is built up with four major components: (1) server monitoring unit, (2) artificer, (3) decision maker and (4) computation unit.
Server monitoring unit collects the information of cloud storage server (e.g., RSoS system) subparts (viz., load, memory, input/output, processor) when a write query hits this system.Artificer is responsible to run the data type prediction jobs.If it finds the storage space information of the incoming data is present in the hypergraph data model then it stops 123 This classification engine is connected with three components of RSoS system, viz., Client API, storage device API generator and hypergraph data model.The received query from consumers is passed to this classification engine unit through the client API.It also acts as a communication medium between RSoS and consumers.The classification engine informs storage device API generator about the storage space of the dataset placed within the arrived query.This storage device API generator then converts this query to the device reachable query and generates the corresponding storage device URI which varies with the storage space value and query type (viz., read, write, delete).At last, the classification engine is responsible to update the hypergraph data model with the generated storage space information which is used further by classification engine to find out storage space for similar type datasets.

Workflow
To perform the prediction of object storage space of the incoming write query, the classification engine follows a workflow model presented in Fig. 2 by considering RSoS cloud storage platform.The proposed classification engine workflow model is designed for predicting the storage space of the arrived client's write request which hits the storage system at that time.For the experimental purpose healthcare dataset is considered that consist with mainly three types of data, viz., document, file, and sensor.We find few differences between sensor, file, and document data requests.Sensor data is tuple structured dataset consists of data value which is very small in size compared to others and the time gap between two data requests is a few seconds or few minutes.On the other hand, document type data is smaller in size than file type data and consists of a set of key-value pairs and the time gap may vary from a few minutes to hours or even days.File type data is larger or same in size compared to document type data request.Also, between two sensor type data requests, database servers can receive document type or file type data requests.No doubt these characteristics make an effect on the system properties (i.e., input/output, network, CPU, and memory) values.The server monitoring unit observes the server status value related to resource performance metrics (e.g., memory, processor, Input/output and load) [31] when a JSON query hits the RSoS server.The generated server-status value set corresponding to the same query is not the same every time but the fluctuation value is very negligible in different queries.These characteristics help us decide the class of the unknown query.
Artificer acts as an investigator to find out the object storage space of coming write query in graphml.xmlfile.If it achieves the desired aim then the classification engine stops its tasks.Otherwise, it starts decision maker activity to move forward.Decision maker is the combination of two activities, learning technique and classifier.
The classifier communicates with learning technique to decide the class value (i.e., data type) of the dataset, present in this unknown write query.This process goes through certain steps.Firstly, the learning technique accomplishes its jobs offline.It trains a set of server monitoring datasets whose class values are known and finds out a set of features for doing the classification.Here, we consider three class values: (1) file, (2) sensor, (3) document.Secondly, Using the selected features, the classifier decides the class value of received monitoring dataset in real-time.
The computation unit generates the object storage space for the unknown arrived write query.For generating object storage space, the object key-value, attribute names and database name are needed to be known.The "received time" value is varied with every upcoming query.So, by default, time is considered as an object key.After receiving the write query, the computation unit can know the attribute names.From the class value, collected from decision maker, the database name is known.Each class value corresponds to a database name such as "HadoopDFS" for "File", "Cassandra" for "Sensor" and "MongoDB" for "Document".The computation unit updates the graphml.xmlfile by the corresponding predicted object storage space information and at the same time, it sends this information to the Artificer.

Experiments
When a dataset hits the RSoS server, it comes as a query request.These queries are of three types, viz., write, read and delete.Here, for the experimental purpose, the write query is considered.
During the experiments, three types of data are used viz., sensor, document and file, as shown in Table 3. Real health sensors send tuple structured datasets within write query, presented as sensor type data.These data come in an infrequent manner.Here, we consider five types of sensors, Temperature, ECG, Pulse, Airflow and Oxygen.Consumers are mainly health workers.Consumers send documents or file type datasets to the RSoS server in write query format.For practical simulation, three consumers, who send medical images (as file type), patient's photo (as file type) and patient's details information (as document type) are considered individually.
Table 2 shows a comparison of different technical parameters between our RSoS system and other four machine learning based storage solutions.It can be seen that none of the solutions handles more than one kind of data format as considered datasets and three different types of database models under a single platform like RSoS.Also, two recent technological considerations, viz., Ganglia and RestAPI have been used to setup the experiments by RSoS while others are used classical technologies like RAM, LRU, DLR etc.Some storage solutions use benchmark to set up their experimental parameters to measure their task performance.For our case, parameters are chosen according to the components selected for the classification engine unit.These elements make us rethink about the designing of experimental components.

RSoS server monitoring
To observe the experimental scenario, Ganglia 3.6.0[21] plug-ins are used.As an example, we run the scenario for 1 h which generates 50 KB of data as sample data for the experimental execution.However, shorter or longer runs on the scenario will generate a smaller or larger amount of sample data but the ratio (1:1:1) of the involved three types data (file, document, and sensor) is same.Also, the training model is fixed here.The effect of training data size variation on the performance of the classification engine (accuracy, precision, and recall) is found to be very small provided we have good representation of all data types in the training set.RSoS system runs in our local workstation in this experiment.Ganglia starts monitoring the server status of RSoS system in four parts, viz., Processor, Memory, Input/Output and Load [31] from the first query request and stops at the last query request.Each and every time when ganglia finds a query request, it generates the corresponding monitored value matrix.The dimension of this matrix is (quer y number × 18).This 18 is the number of properties which shows the corresponding status values of system sub parts of the RSoS server node (memory, input-output, load, CPU) of that query.Each query has different sets of property values but the property names are same in every query.It acts as the set of properties or features and helps the classification engine distinguish the dataset according to their types.If we analyze the property value set from query to query, we can see a minor difference.

Monitored dataset processing
After collection of RSoS server monitoring dataset, we need to make the classification engine learn so that it can predict the object storage space of incoming data query.For experimental reason, we divide the collected dataset into two parts, training dataset and test dataset.Both training and test datasets are in .csvformat and one more column, representing the class is added.
Three types of write queries are used for each type of data viz, file data, sensor data, and document data.After the task of monitored dataset processing, a set of rows with 18 columns are generated corresponding to each type of write query.One more column is added with this data as 'class'.The corresponding data type of the write query is written under the class attribute.For example, for file data, class value is file, for sensor data, class value is sensor, and for document data, it is document.For training dataset, class value is known but for test dataset it is to be predicted.

Consumer2
Patient Medical Image {"account":"U104" , "operation ":" Store " , "container ":"ecg" , "datagroups":{ "id ":"BenchmarkGain1" , "remotefiledata ":"xyz . . . ." , "fileextension ":"png",}} In this experiment, we apply such feature selection algorithms which select features (or properties) based on label information.Here, three feature selection algorithms are used from three different sub categories, (1) Fisher score, (2) F-score and (3) Lll21.Fisher score [44] is a supervised similarity-based feature selection algorithm which is fast and inexpensive.Here, each feature holds the score value individually using Fisher criterion and higher score value determines the better relevant feature compared to others.At the same time, it selects those properties whose values are similar for the same class and dissimilar for different classes.F-score [5] is a supervised statistical feature selection criterion which is based on the correlation between the input variables.It is fast and efficient because this process selects those properties which have a strong statistical relationship with the target variable.Lll21 [19] is the supervised sparse learning-based feature selection algorithm.It selects properties using sparsity-induced regularization terms by making feature coefficients small or zero.This feature selection algorithm considers the concept of multiple classes or multiple targets at the time of feature selection.This model is cheaper and can perform on the smaller set of input variables.

Dataset classification
In the next step, our target is to divide the collected dataset on the basis of their data types or class values.For the training dataset, the class values are known.Three data types, viz., sensor, file and document represent three class labels.Classifier learns from the training dataset which is divided into three classes.This knowledge is applied to predict the class value of test dataset.
We consider three supervised classification algorithms to do the experiments.These are K-nearest neighbors (K-NN), support vector machine (SVM), and neural network (NN).K-NN [7] follows the "closest point search" mechanism which assigns unknown data points to the class in which the majority of their K nearest neighbors fall.It is a popular supervised non-parametric classifier used for prediction.In our experiment 5 is chosen as the K's value because the odd value of K shows a clear majority prediction output.SVM [33] is also known as "Discriminative Classifier" where it calculates the optimal hyperplane separating the classes from the training dataset and using this hyperplane finds out the class of unknown data points.It performs efficiently on a multi-dimensional dataset.Here, three types of SVM are used, linear SVM, radial basis function (RBF) SVM, polynomial (Poly) SVM.In this experiment, linear SVM, RBF SVM and Poly SVM consider the value of C (regularization parameter) as 1.0.Whereas, RBF SVM and Poly SVM consider 2.0 as the gamma value and poly SVM assigned 2.0 as the degree value.
Neural network [37] follows the human mind mechanism.From the training dataset it generates some rule or functionality using a multi-layer way and applies these rules to find out the class of unknown data points.It is able to work with incomplete information.According to our experiment, a neural network uses three hidden layers with thirty neurons in each layer.The used activation function is ReLU and weight optimization algorithm is Adam for hidden layers.Two hundred epochs are used here.

Dataset prediction
After collecting the knowledge of classification, the next target of classification engine is to predict the data type of incoming dataset.For this purpose, SVM, neural network and K -NN classifiers are used.The dataset prediction procedure goes through in certain steps.
In the first step, the classifiers identify the class value (sensor, file and document) of the dataset which is present in the incoming query request.After that, the classification engine generates the corresponding object storage space of this query and writes down the related information into the hypergraph data model if it is not present there.
The task of deciding the probable object storage space structure corresponding to the predicted class value is done by the computation unit of the classification engine shown in Fig. 2. To generate the object storage space, the computation unit needs to have little information such as, account value, list of attributes of dataset, name of the storage device and the corresponding storage device URI according to the structure of the hypergraph data model shown in Listing 1.Three  Table 4 shows that different storage technologies use individual machine learning approaches and parameters to improve the performance of their storage solutions.It can be seen that mainly classification based technologies got much attention to this kind of prediction task.The sharpness of the RSoS system also depends upon the performance of the designed classification engine unit.Hence, three performance parameters of the classification engine are considered that provide better analysis comparing with the other storage technologies because they do not bother about the performance of the building elements.Input parameters or selection of feature set are the important parameters for any kind of classification task.It can be seen from the table that RSoS system is using many important features as input parameters compared to any other storage solutions.This helps design the classification engine for RSoS system.For this important task, we need to consider a suitable feature selection algorithm and the corresponding classifier.As per the results shown in the next section, it can be seen that Lll21 feature selection algorithm and K -NN classifier are the most suitable components for this classification engine.

and discussion
Two main things are needed to design the classification engine.One is a feature selection algorithm and another is a classifier.So, we make a comparative study of feature selection algorithms and classifiers to find out suitable combination of feature selection algorithms and classifiers for this classification engine.The results are shown in Table 6.
According to the results, the input of the feature selection algorithm is the generated output of ganglia.When RSoS server receives any query request then it passes through ganglia and the feature selection algorithm selects some features from these outputs.Here, top 5 features are selected from 18 features.The selected feature sets by the three mentioned feature selection algorithms are presented in Listing 9 separately.
A selected feature set contains the attributes of the data set that are chosen to be important by a feature selection algorithm for distinguishing the different classes of data types.Among the set of selected features by Fisher score and F-score algorithm, proc_total and proc_run signify the processor status such as total number and running processors, respectively.The features load_five and load_one determine the load status during last 5 and 1 minutes, respectively.Also the feature disk_free denotes the total free disk size.On the other hand, the selected feature set by Lll21 algorithms includes mem_cached, mem_buffer and mem_free which describe memory status such as cache, buffer and free memory size, respectively along with bytes_in and byte_out for representing the number of input and output bytes, respectively.These features are used by the classifiers to identify the class value of the test dataset.
Fisher score : { proc_total , proc_run , load_five , load_one , disk_free } F -Score : { proc_total , proc_run , load_five , load_one , disk_free } Lll21 : { mem_cached , mem_free , mem_buffers , bytes_in , bytes_out , } Listing 9 Selected Feature-set by the specified feature selection algorithms The results are shown in two separate tables.Table 5 is for the test dataset where it holds only feature values.Table 6 consists of original class values with the predicted class values.Pointer column of Tables 5 and 6 are used to make bridge between these two tables.If we analyse Tables 5 and 6 then we find out some rows are marked and some are unmarked.Input feature values of marked data rows are present in training dataset.
The class column of Accuracy, precision and recall are used to judge the performance of the designed classification engine [14,36], shown in Table 7.These metrics are measured based on the test data prediction.The values of these metrics are varied by the applied algorithms (the combination of feature selection algorithm and classifier).Accuracy measures the correctness of prediction.precision is used to measure the positive predictive value and recall for calculating the true positive rate.These two are computed based on the class items.
Decision Maker is the main backbone of the classification engine unit.The whole system may break down by the wrong components selection for the decision maker.Figure 3 presents the visualization of performance metrics values which helps to make a fair decision.Here, three sub-plots present the comparison of precision (Fig. 3a), recall (Fig. 3b), and accuracy (Fig. 3c) values between the 13 feature selection algorithms and classifier algorithms.Precision and recall are calculated based on three different data types (document, file, and sensor).
The pair of Lll21 feature selection algorithm and K -NN classifier give a 100% accuracy rate.Fisher-score feature selection algorithm gives a very low accuracy rate (13.33%) with a neural network classifier.The combination of the Lll21 feature selection algorithm and K -NN classifier has the highest precision and recall values, i.e., 100% on the respect of three class items.

Conclusions and future work
In this article, we have proposed an idea to convert any object based storage system like RSoS into an automated system by automatically identifying the storage spaces for the incoming queries.For this purpose, we have built a classification engine which is able to categorize data based on their structure format.The RSoS system stores data based on its structure by implementing object storage space, a combination of the database name, data attributes name and object key.The storage system considered here can accommodate three types of data (i.e., sensor, document, file) under a single storage platform.The association of this classification engine with the RSoS system enables it to predict the corresponding object storage space for the incoming queries without manual intervention.
The efficacy of the classification engine depends upon the performance of its two main building blocks which are a feature selection approach and a classifier.Three supervised feature selection algorithms from three categories, viz., Fisher score from similarity-based, F-score from statisticalbased, Lll21 from sparse learning-based approaches have  In future, our next aim is to extend the functionality of the classification engine in such a way that it can provide more prominent results in less time.Also, in the current situation, the training part of the engine is done offline which creates a deficiency in some situations.Whereas, if the system is trained with wrong information or the system predicts wrong data at test time, there is no chance to rectify on the later situation.We plan to explore the possibilities of alleviating these problems through online training when needed.

Fig. 1
Fig. 1 Architecture of RSoS system for automatic storage space prediction

Fig. 2
Fig. 2 Workflow model of proposed classification engine storage devices are used in this article, viz, Cassandra, Mon-goDB, and HadoopDFS.These storage devices consist of their own database model.According to the supporting characteristics, Cassandra is able to handle large scale structured data, mongoDB for document data, and HadoopDFS for file data.In this experiment, some predefined constraints are followed.Sensor data has three attributes: patientid, time, and data value; document data has time and document attributes; and file data has id and file attributes.As mentioned in Listing 1, each storage device has an unique URI with their URI values.The task of the computation unit is to collect the above information against the predicted class values and write them into the graphml.xmlfile as a form of object storage space value.

123 6 7 Fig. 3
Fig. 3 Comparative study of the performance values of the components of the decision maker been compared.Side-by-side, three classifiers, viz., SVM, K -NN, and Neural Network are used with the mentioned feature selection algorithms.After analyzing the results, Lll21 feature selector combined with K -NN classifier is found to provide the best service.In future, our next aim is to extend the functionality of the classification engine in such a way that it can provide more prominent results in less time.Also, in the current situation, the training part of the engine is done offline which creates a deficiency in some situations.Whereas, if the system is trained with wrong information or the system predicts wrong data at test time, there is no chance to rectify on the later situation.We plan to explore the possibilities of alleviating these problems through online training when needed.

Table 1
Comparative analysis on systems overview of ML based storage solutions

Table 2
Comparative study on technical parameters of ML based storage solutions

Table 4
Comparative study on experimental parameters of ML based storage solutions

Table 6
represents the original class values of the input dataset.In Table 6, we present outputs of three feature selection algorithms, viz., Fisher Score, FScore and Lll21 after applying 3 different SVMs (linear SVM, Poly SVM, RBF SVM), Neural Network and K -NN classifier as Pfisher_Linear, Pfisher_Poly, Pfisher_RBF, Pfisher_Neural,

5
Example of feature values used for testing purpose

37 22152916 0.8 144.44 205.16 289816 0.51
The row consisting of bold values also present in the training dataset.