# An Compression Technology for Effective Data on Cloud Platform

- 76 Downloads

## Abstract

There are more and more application systems running on the cloud platform, which will produce large amounts of effective data everyday. In order to preserve them and make fun use of the storage space, those effective data must be compressed and those compressed data, if necessary, should be recovered correctly. Meanwhile, there are a lot of equivalent data item values (or equivalent data item values within the system error) in the original data. So, it is not right to compress those effective data directly. In order to make fun use of the storage space and correctly recover the original data, a new method occurs. When compressed, those effective data must be processed firstly and then the handled data should be compressed with Huffman coding; when the compressed data need recover, the process is against with that of data compression.The experiment shows that this method has the advantages of fast compression speed, high compression ratio and lossless recovery .

## Keywords

Cloud platform Block Classification Reorganization Huffman coding## 1 Introduction

With the further application of cloud computing technology, there are more and more application systems running on the cloud platform, which will produce all kinds of effective data everyday. So far, there is no perfect way to solve the problem that the data in the cloud platform need large amount of storage space [1]. Data compression could be used to reduce storage space, improve the efficiency of transmission and storage [2, 3]. Data compression is divided into lossy compression and lossless compression. Lossy compression makes full use of the character that human beings is not sensitive to certain frequency components in the image or sound [4, 5], allowing some loss of information compression process. It is widely used in the field of speech, image and video. The methods widely used in this sphere are data fitting and line interpolation [8]. In order to keep the correctness and completeness of the data, the lossy compression may not be suitable to the effective data on the cloud platform. On the contrary, lossless compression is based on the statistical redundancy and in the progress of compression. The lost of information is not allowed. It is also widely used to deal with accurate data in the field of text [6, 7], procedures and image. Thus, the traditional data compression methods can be used in this article. But it is well-known that compression ratio received from the traditional data compression methods is low. In order to get a better compression ratio, we will make some improvements at the foundation of these methods [8, 9].

## 2 Related Work

There are all kinds of data received from the application on the cloud platform. Here, data sheet is chosen as the object of study. There are lots of same or similar data item among the same column in the same data sheet, which always occupies huge internal memory. To solve this problem, a new idea occurs. Firstly, the valid data on the Grid cloud platform is divided into blocks, from which some data blocks are selected, then the records in these data blocks are sorted and reorganized. Finally, the reorganized data is compressed with the Huffman coding [10, 11].

In general, the number of the data item shared by different records is not less than 1, but when the data item shared by two records, the number of data items is not less than 2. In this paper, if the number of data item shared by two records is one, those records are considered to belong to the first category. The purpose of this classification is to reduce the storage space in the next step.

In this Fig. 1, 1 stands for the first class, 2 stands for the second class and 3 stands for the third class.

Before recombination

Block attribute | \(f_{1}\) | \(f_{2}\) | \(f_{3}\) | \(f_{4}\) | \(f_{5}\) |
---|---|---|---|---|---|

A | \(y_{11}\) | \(y_{12}\) | \(y_{13}\) | \(y_{14}\) | \(y_{15}\) |

B | \(y_{21}\) | \(y_{22}\) | \(y_{23}\) | \(y_{24}\) | \(y_{25}\) |

C | \(y_{31}\) | \(y_{32}\) | \(y_{33}\) | \(y_{34}\) | \(y_{35}\) |

After recombination

Block attribute | \(f_{1}\) | \(f_{2}\) | \(f_{3}\) | \(f_{4}\) | \(f_{5}\) | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

D | \(y_{11}\) | \(y_{21}\) | \(y_{31}\) | \(y_{12}\) | \(y_{23}\) | \(y_{14}\) | \(y_{24}\) | \(y_{34}\) | \(y_{15}\) | \(y_{25}\) | \(y_{35}\) |

D is the new record. It is easy to find that there are four data items had been reduced and the storage space of D is less than the sum of storage space of A, B and C.

*n*bits binary code. The high

*m*bits indicate the class where the new record comes from and the quantity of records generating the new record. Furthermore, m is equal to the number of records generating the new record and each bit in the m bits binary code represents a block, of which ‘1’ stands for data items shared by this block. Take the example above for example, since the records gained form three blocks, ‘111’ can be used to show the process of recombination and the source of records. Meanwhile, the low

*n–m*bits shows which data items are duplicated in the new record. In this part,

*n–m*is equal to the number of attributes occupied by the record. The value ‘1’, in the

*n–m*bits binary code means that ‘there is a shared data item in this place. As for the example above, ‘01100’ can be used to show which data items is shared by those blocks. Thus, the new record above, D, should be changed.

Block attribute | \(f_{1}\) | \(f_{2}\) | \(f_{3}\) | \(f_{4}\) | \(f_{5}\) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

D | \(y_{11}\) | \(y_{21}\) | \(y_{31}\) | \(y_{12}\) | \(y_{23}\) | \(y_{14}\) | \(y_{24}\) | \(y_{34}\) | \(y_{15}\) | \(y_{25}\) | \(y_{35}\) | \(y_{new}\) |

*n*is 8, The value of

*m*is 3 and the \(y_{new}\) = 119. The value of \(y_{new}\) can be turned into binary code, ‘01110011’. The high 3 bits, ‘011’, means that the new record was transformed from two raw records, which originated from the second, third data blocks. Besides, the low 5 bits, ‘10011’ stands for that the value under the first, fourth, and fifth data item is equal in those raw records. So, ‘01110011’ means that the new record was transformed from two raw records and the value under the first, fourth, and fifth data item is equal in those two records [16, 17].

Before recovery

Block attribute | \(f_{1}\) | \(f_{2}\) | \(f_{3}\) | \(f_{4}\) | \(f_{5}\) | ||||
---|---|---|---|---|---|---|---|---|---|

D | \(y_{11}\) | \(y_{21}\) | \(y_{12}\) | \(y_{23}\) | \(y_{14}\) | \(y_{24}\) | \(y_{15}\) | \(y_{25}\) | \(y_{new}\) |

After recovery

Block attribute | \(f_{1}\) | \(f_{2}\) | \(f_{3}\) | \(f_{4}\) | \(f_{5}\) |
---|---|---|---|---|---|

A | \(y_{11}\) | \(y_{12}\) | \(y_{13}\) | \(y_{14}\) | \(y_{15}\) |

B | \(y_{21}\) | \(y_{22}\) | \(y_{23}\) | \(y_{24}\) | \(y_{25}\) |

The operation of classification and reorganization can be used to reduce the storage space, which can be tested in theory.

The last equation suggests that the process of data processing is scientific and effective and at best, the storage space that can be reduced is \(\frac{{100(Nn_{2} - n_{2} - 1)}}{{Nn_{2} }}\%\).

## 3 Results and Discussion

The experimental data sets

Data sheet | Space (kB) | Records | Data item/per record |
---|---|---|---|

SystemData1 | 23.6 | 780 | 5 |

SystemData2 | 27.2 | 900 | 5 |

SystemData3 | 30.8 | 1020 | 5 |

SystemData4 | 34.2 | 1140 | 5 |

SystemData5 | 38.1 | 1260 | 5 |

SystemData6 | 41.7 | 1380 | 5 |

SystemData7 | 45.4 | 1500 | 5 |

SystemData8 | 53.9 | 1800 | 5 |

*N*is, the smaller the storage space of new records is. This truth can be tested by experiment. SystemDate8 is chosed and the value of

*n*is 20. When the value of

*n*is 2,3, the storage space of new records were shown in Table 2. It is easy to find that the storage space of new records at the condition that the value of

*N*is 3 is less than that at the condition that the value of

*N*is 2. The Table 2 prove that the number of blocks chosen should be set large.

Memory space under different data blocks

Blocks (\(N\)) | Storage space of new records (kB) |
---|---|

2 | 45.3 |

3 | 43.7 |

*n*is 5, 10, 15, 20, 25, 30, 40, 50, 60. First of all, the result of classification was shown in Table 3 with the data in the SystemData8, which is used to test the correctness of classification. Then, when the number of records in each block changed, the storage space occupied by the new record varied greatly as shown in Fig. 2. In Fig. 2, “The number of records”, the label on the X axis, means the number of records in each black. “Space”, the label on the Y axis, means the total storage space when the number of records in each black differ.

The result of classification

Records in each block class | First | Second | Third | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

5 | 6 | 7 | 2 | 3 | 4 | 1 | ||||||||||||||

2 | 3 | 4 | 5 | 2 | 3 | 4 | 5 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | ||||

30 | 15 | 16 | 20 | 3 | 3 | 1 | 1 | 1 | 303 | 200 | 63 | 11 | ||||||||

40 | 17 | 16 | 27 | 12 | 1 | 1 | 1 | 1 | 2 | 338 | 180 | 46 | 4 | |||||||

50 | 16 | 14 | 45 | 31 | 2 | 2 | 2 | 1 | 1 | 346 | 154 | 45 | 4 | |||||||

60 | 23 | 10 | 33 | 20 | 4 | 1 | 1 | 3 | 348 | 163 | 34 | 7 |

Table 3 suggests that when the number of records in each block varied, the result of classification was different from each other. It is easy to find that the sum of the quantity of records in each class met with the requirement of Eq. (2). That is to say, it was also easy to find that the result of classification could be used to get the number of original records with the introduction above. Take the first row of Table 2 for example: \(150 + 160 + 200 + (3 + 3 + 1 + 1 + 1) \times 2 + (303 + 200 + 63 + 11) \times 3 = 1800,\) to some extent, which can be used to prove the correctness of classification and recombination. Meanwhile, the Fig. 2 shows that the storage space occupied by the new records was less than the storage space of original data. When the number of records in each block is 5, the storage space of new records is the best. But, with the quantity of records in each block increasing, the storage space occupied by new records was also increasing. Especially when the number of records in each bloch is 60, the result is the worst. The percentage of the system data SystemData2 in the memory space reduced before and after the reconfiguration is 33.8, 32.7, 31.6, 28.7, 30.1, 28.7, 27.6, 28.3%, respectively. The percentage of the system data SystemData8 in the memory space reduced before and after the reorganization is 24.6, 23.5, 20.5, 19.2, 18.1, and 24.6%, respectively. 18.5, 17.4, 17, 16.4%. It means that the quantity of records in each data block shouldn’t too large. The suitable quantity of records in each block could be selected between 5 and 30.

In Fig. 3, the storage space of all the new records were distinct less than that of original data and the percentage of all the dropped storage space were 57.6, 30.1, 19.8,19.6, 24.7, 24.0, 26.4%, which shows that classification and recombination is effective in reducing storage space. Meanwhile, the smallest percentage of the dropped storage space was 19.6% and it gained from systemData4. The reason why the percentage is so small is that there are more records belonging to the second class in systemData4 than other data in Table 1. It is not difficult to know that the method named for classification and recombination is effective.

The comparison of compression ratio

Data sheet | The original data | The whole new records | |||||
---|---|---|---|---|---|---|---|

Before compression (kB) | After compression (kB) | Compression ratio (%) | Before compression (kB) | After compression (kB) | Compression ratio (%) | Relative compression ratio of raw data (%) | |

SystemData1 | 23.6 | 10.05 | 42.60 | 10.0 | 5.00 | 49.95 | 21.17 |

SystemData2 | 27.2 | 11.30 | 41.55 | 19.0 | 8.60 | 45.27 | 31.62 |

SystemData3 | 30.8 | 13.02 | 42.27 | 24.7 | 10.83 | 43.83 | 35.15 |

SystemData4 | 34.2 | 14.31 | 41.86 | 27.5 | 11.74 | 42.68 | 34.32 |

SystemData5 | 38.1 | 15.72 | 41.27 | 28.7 | 12.50 | 43.56 | 32.81 |

SystemData6 | 41.7 | 17.20 | 41.25 | 31.7 | 13.56 | 42.78 | 32.52 |

SystemData7 | 45.4 | 18.57 | 40.90 | 33.4 | 14.45 | 43.27 | 31.83 |

In Table 3, the storage space of the compressed data in the third column of Table 3 were 10.05, 11.30, 13.02, 14.31, 15.72, 17.20, 18.57 kB. Besides, the storage space of the compressed data in the sixth column of Table 3 were 5.00, 8.60, 10.83, 11.74, 12.50, 13.56, 14.45 kB, which is less than that in the third column of Table 3. The compression ratio of those original data were 42.60, 41.55, 42.27, 41.86, 41.27, 41.25, 40.90%. Meanwhile, the compression ratio of the whole new records in the eighth column of Table 3 were 21.17, 31.62, 35.15, 34.32, 32.8, 32.52, 31.83%, which is obviously less than that of original data in the fourth column of Table 3. The percentage reduced with the new method were 21.43, 9.93, 7.12, 7.54, 8.46, 8.73, 8.07%. Those data in the Table 3 shows that the method mentioned above was effective. This also indicates that data partition, classification and reorganization are beneficial to reduce the memory space occupied by data sets and improve the compression efficiency.

## 4 Conclusion

The appropriate number of records and blocks in each block is determined according to the equation and the experiment. After determining the appropriate number of records in each block and the appropriate number of blocks chosen every time, the changes in the memory space occupied by the system data before and after classification and reorganization are studied. Finally, the Hoffman code is used to compress the raw data and the reprocessed data which are not processed. Compared with the traditional compression method with Huffman coding, the method used in this paper did reduced the storage space and the percentage reduced with the new method was not less than 7.12%. So, it is easy to find that the compression ratio of the new method is better than that of the traditional compression method with Huffman coding.

Data recovery is the reverse process of data compression. By understanding the principle and process of data compression in the text, it is easy to restore the original data. Thus, the process of data recovery is not introduced too much. This method is not only suitable for compression of the same file, but also suitable for compression of multiple files with correlation and even multiple system data. It has good extensibility.

## References

- 1.J. Duda, K. Tahboub, N. J. Gadgil and E. J. Delp, The use of asymmetric numeral systems as an accurate replacement for Huffman coding[J],
*Picture Coding Symposium*, pp. 65–69, 2015.Google Scholar - 2.W. Wang and W. Zhang, Adaptive spatial modulation using Huffman coding[J],
*IEEE Global Communications Conference*(*GLOBECOM*), pp. 1–6, 2016.Google Scholar - 3.A. M. Rufai, G. Anbarjafari and H. Demirel, Lossy medical image compression using Huffman coding and singular value decomposition[J],
*Signal Processing and Communications Applications Conference*(*SIU*), pp. 1–4, 2013.Google Scholar - 4.H. Abid and S. Qaisar, Distributed video coding for wireless visual sensor networks using low power Huffman coding[J],
*44th Annual Conference on Information Sciences and Systems*(*CISS 2010*), pp. 1–6, 2010.Google Scholar - 5.W. Wei, Y. K. Liu, X. D. Duan and C. Guo, Improved compression vertex chain code based on Huffman coding[J],
*Journal of Computer Applications*, Vol. 12, pp. 3565–3569, 2014.Google Scholar - 6.E. H. Yang and C. Sun, Dithered soft decision quantization for baseline JPEG encoding and its joint optimization with Huffman coding and quantization table selection[J],
*Asilomar Conference on Signals, Systems and Computers*, pp. 249–253, 2011.Google Scholar - 7.K. S. Kasmeera, S. P. James and K. Sreekumar, Efficient compression of secured images using subservient data and Huffman coding[J],
*Procedia Technology*, Vol. 25, pp. 60–67, 2016.CrossRefGoogle Scholar - 8.W. Wang and W. Zhang, Huffman coding based adaptive spatial modulation[J],
*IEEE Transactions on Wireless Communications*, Vol. PP, No. 99, pp. 1–1, 2017.Google Scholar - 9.S. J. Yun, M. R. Usman, M. A. Usman and S. Y. Shin, Swapped Huffman tree coding application for low-power wide-area network (LPWAN)[J],
*IEEE International Conference on Smart Green Technology in Electrical and Information Systems*, pp. 53–58, 2016, DOI: 1109/ICSGTEIS.2016.7885766.Google Scholar - 10.R. Arshad, A. Saleem and D. Khan, Performance comparison of Huffman Coding and Double Huffman Coding[J],
*Sixth International Conference on Innovative Computing Technology*, pp. 361–364, 2016.Google Scholar - 11.A. Vaish and M. Kumar, A new Image compression technique using principal component analysis and Huffman coding[J],
*International Conference on Parallel, Distributed and Grid Computing*(*PDGC*), pp. 301–305, 2014.Google Scholar - 12.M. Nelon and J. -L. Gaily,
*The Data Compression Book*, 2nd ed. MIS Press, 1995.Google Scholar - 13.J. Radhakrishnan, S. Sarayu, K. George Kurian, D. Alluri and R. Gandhiraj, Huffman coding and decoding using android[J],
*International Conference on Communication and Signal Processing*(*ICCSP*), pp. 0361–0365, 2016.Google Scholar - 14.R. Patel, V. Kumar, A. Tyagi and V. Asthana, A fast and improved image compression technique using Huffman coding[J],
*International Conference on Wireless Communications, Signal Processing and Networking*(*WiSPNET*), pp. 2283–2286, 2016.Google Scholar - 15.K. S. Venkata, K. T. K. C. Rhishi, B. Karthikeyan, V. Vaithiyanathan and R. M. M. Anishin, A hybrid technique for quadrant based data hiding using Huffman coding[J],
*International Conference on Innovations in Information, Embedded and Communication Systems*(*ICIIECS*), pp. 1–6, 2015.Google Scholar - 16.X. K. Liu, K. Chen and B. Li, Huffman coding and applications in compression for vector maps[J],
*Applied Mechanics & Materials*, Vol. 333–335, pp. 718–722, 2014.Google Scholar - 17.C. C. Chang, T. S. Nguyen and C. C. Lin, A novel compression scheme based on SMVQ and Huffman coding[J],
*International Journal of Innovative Computing, Information & Control*, Vol. 10, No. 3, pp. 1041–1050, 2013.Google Scholar - 18.L. -C. Petrini and V. -M. Ionescu, Implementation of the Huffman coding algorithm in windows 10 IoT core[J],
*International Conference on Electronics, Computers and Artificial Intelligence*(*ECAI*), Vol. 8, pp. 1–6, 2016.Google Scholar - 19.T. Kumaki, Y. Kuroda and T. Koide, CAM-based VLSI architecture for Huffman coding with real-time optimization of the code word table [image coding example][J],
*IEEE International Symposium on Circuits & Systems*, Vol. 5, pp. 202–5205, 2005.Google Scholar - 20.J. Wu, Y. Wang and L. Ding, Improving performance of network covert timing channel through Huffman coding[J],
*Mathematical and Computer Modelling*, Vol. 55, No. 1–2, pp. 69–79, 2012.MathSciNetCrossRefzbMATHGoogle Scholar - 21.Y. H. Lee, D. S. Kim and K. K. Hong, Class-dependent and differential Huffman coding of compressed feature parameters for distributed speech recognition[J],
*IEEE International Conference on Acoustics*, pp. 4165–4168, 2009.Google Scholar - 22.D. S. Kim and K. K. Hong, Voicing class dependent Huffman coding of compressed front-end feature vector for distributed speech recognition[J],
*Second International Conference on Future Generation Communication and Networking Symposia*(*FGCNS*), Vol. 3, pp. 51–54, 2008.Google Scholar - 23.J. H. Pujar and L. M. Kadlaskar, A new lossless method of image compression and decompression using huffman coding techniques[J],
*Journal of Theoretical and Applied Information Technology*, Vol. 46, No. 1, pp. 11–16, 2012.Google Scholar - 24.A. Vaish and M. Kumar, A new image compression technique using principal component analysis and Huffman coding[J],
*International Conference on Parallel, Distributed and Grid Computing*(*PDGC*), Vol. 1, pp. 301–305, 2014.Google Scholar - 25.J. H. Jiang, S. C. Shie and W. D. Chung, A reversible image steganographic scheme based on SMVQ and Huffman coding[J],
*International Conference on Connected Vehicles and Expo*(*ICCVE*), pp. 486–487, 2013.CrossRefGoogle Scholar - 26.A. Kawabata, T. Koide and H. J. Mattausch, Optimization vector quantization by adaptive associative-memory-based codebook learning in combination with Huffman coding[J],
*Proceedings 2010 First International Conference on Networking and Computing*(*ICNC 2010*), pp. 15–19, 2010.Google Scholar - 27.J. Duda, K. Tahboub, N. J. Gadgil and E. J. Delp, The use of asymmetric numeral systems as an accurate replacement for Huffman coding[J],
*Picture Coding Symposium*, pp. 65–69, 2015.Google Scholar - 28.M. Hameed, Low power text compression for Huffman coding using altera FPGA with power management controller[J],
*1st International Scientific Conference of Engineering Sciences–3rd Scientific Conference of Engineering Science*(*ISCES*), pp. 18–23, 2018.Google Scholar - 29.G. C. Chang and Y. D. Lin, An efficient lossless ECG compression method using delta coding and optimal selective Huffman coding[J],
*In Conjunction with 14th International Conference on Biomedical Engineering (ICBME) and 5th Asia Pacific Conference on Biomechanics*(*APBiomech*), Vol. 31, pp. 1327–1330, 2010.Google Scholar - 30.R. P. Jasmi, B. Perumal and M. P. Rajasekaran, Comparison of image compression techniques using Huffman coding, DWT and fractal algorithm[J],
*International Conference on Computer Communication and Informatics*(*ICCCI*), pp. 1–5, 2015.Google Scholar