Currently, the concept of smart city is starting to be applied across the world [1,2,3,4]. Although the concept is typically implemented in a city-scale, it can also naturally be adapted in a more granular context such as in a building [5]. With this scale, the concept can be named as smart building. The implementation of this concept promises a more effective and efficient building management. Unfortunately, applying this concept requires a considerable amount of cost for procuring various type of Internet of Things (IoT) devices. Therefore, typical prototypes of smart building use only closed circuit television (CCTV) cameras as the IoT devices, which usually are already available in the building.

Using only CCTV poses a significant challenge for smart building. In case of energy management, the straightforward implementation of smart building is by using heat sensors to detect activity level in a room, which can be used to adjust the power usage of electric devices in the room. If the only available IoT devices are CCTV, a robust intelligent system with computer vision technology is needed.

To build such a robust computer vision system, a deep learning algorithm needs to be embedded within. Deep learning has been proved to have powerful performance in computer vision case such as image classification [6,7,8,9,10,11], object detection [12,13,14,15], and crowd counting [16,17,18,19,20,21,22,23]. Deep learning is also applicable for analysis of data from CCTV, which streams a big data that is difficult for other machine learning model to extract valuable information from. However, deep learning requires a big dataset for a reliable performance. As the large dataset is not always available for every problem, training a deep learning model from scratch is considered to be impractical. To overcome the challenge, transfer learning has been broadly applied in many deep learning model developments (cite). This study introduces a transfer learning scheme that can be used to develop an intelligent system for smart building management. We focus on the development of intelligent system for counting human in a room, which can be employed for adjusting appliances for energy usage optimization. In addition, we also collected and shared a dataset that can be used in the proposed transfer learning scheme.

Literature study

The advancement of computer vision nowadays grows astonishingly fast. This growth was initiated by the use of deep learning in the ImageNet Large Scale Visual Recognition Challenge [24, 25]. At glance, it seems that the impressive performance of deep learning is the main cause of the huge growth in computer vision. However, it should be noted that the huge size of ImageNet dataset also contributes significantly to the deep learning performance. ImageNet has about 1.2 million of labeled images, which is currently one of the largest computer vision datasets. Only after it was trained on ImageNet that deep learning finally showed its extraordinary performance [6]. That particular deep learning model for computer vision, namely convolutional neural networks (CNN), was not a first choice for computer vision research since its invention in 1989 [26].

Unfortunately, a massive dataset such as ImageNet requires a laborious effort to be collected. As the consequence, it is impractical for many problems which has no large dataset available. To cope with the problem, recent research that utilize deep learning employs a concept called as transfer learning. This concept is defined as using a model that was previously trained on data from a task as a base to develop new model for other task. By using transfer learning, it is possible to use a deep learning model that has been pretrained on large dataset to learn from relatively smaller dataset. The use of this concept in deep learning was first initiated by Girshick et al. [27] to transfer utilize a CNN model pretrained on ImageNet to develop a model for object detection problem. In the following year, Yosinski et al. [28] exhaustively studied and proved the benefit of transfer learning for deep learning model. Since then, it is a standard to use an ImageNet-pretrained model in many computer vision problems. Even after the development of large dataset for object detection [29], the use of transfer learning is still widely adopted for the problem.

The benefit of transfer learning is mostly apparent in crowd counting, one of the most extensively studied computer vision problem. The most popularly used dataset in crowd counting, ShanghaiTech dataset [30], consists of only 1198 images. The other popular dataset, WorldExpo’10 [31], contains only 3980 images. The smallest dataset for crowd counting, UCF_CC_50 [32], even contains only 50 images. Despite that, the performance of crowd counting models are consistently growing fast since the use of deep learning in 2013. The fast advancement is possible by the extensive use of transfer learning. Consequently, the state-of-the-art crowd counting models within the last 6 years were always a variant of deep learning. Following this trend, Wang et al. even developed a large simulated dataset for pretraining purpose in crowd counting [33]. The dataset, named as GTA Crowd Counting (GCC), was generated by using Grand Theft Auto (GTA) V game to obtain 15,212 synthetic crowd images.

Transfer learning scheme for intelligent human counting system

For a comprehensive understanding, we depict the whole intelligent system framework in Fig. 1. The proposed transfer learning scheme is part of the framework which is highlighted in green. The transfer learning scheme starts by acquiring a deep learning model that has been pretrained on ImageNet dataset [24, 25]. To convert the pretrained model to an intelligent human counting system, the model needs to be trained with a dataset crafted for human counting task. Therefore, we collected the required dataset, which we call as RHC (Room Human Counting) dataset. After the training, the trained intelligent human counting system is ready to process video streams from a CCTV to output the human count. It is worth noting that the CCTV stream injects a massive data to the intelligent system. For the system to run in real-time, it needs to be implemented using proper Big Data technology. Therefore, the intelligent system should be developed using deep learning libraries that can be implemented on apache spark. Based on the recent survey [34], Tensorflow [35] or Caffe [36] are are excellent options as both libraries are supported by most deep learning frameworks for apache spark. Afterward, the predicted human count from the system is mapped to appliances adjustment setting in a control system.

Fig. 1
figure 1

The proposed transfer learning scheme in the intelligent human counting system for smart building management

Dataset collection

The images of RHC dataset were extracted from the videos captured by a CCTV in NVIDIA-BINUS AI R&D Center room. The dataset is collected only for one room to introduce a challenge for the future AI model to learn from one room only. This is necessary for developing a system that can adapt to different specification of CCTV in different room. If the model is able to robustly learn from this dataset, then it can be easily retrained using videos with different resolution from different room as long as the resolution of the new dataset is homogeneous.

In this dataset, the videos have a resolution of 640 × 360 pixels with a frame rate of 20 frames/s. There are 44 videos used for this dataset. The total duration of all videos is 206 h 24 min and 23 s. Figure 2 shows sample of images from the dataset.

Fig. 2
figure 2

Sample images in RHC dataset

Dataset annotation

Annotating a huge amount of data manually requires laborious work, thus it usually is infeasible. One solution that can be used to annotate a massive dataset is by developing an information system specially crafted for annotation task [37]. Therefore, we built an information system to ease the annotation process. This system takes videos from the previous acquisition process and displays them for the annotation process. The detailed explanation of this annotation system is described by Pardamean et al. [38]. In this system, the annotator decided which frame to be annotated from all videos, resulting 1217 annotated images.

The dataset is annotated with the total count of human per image. We do not use the location of each human as annotation like what is typically done in crowd counting research. Training a deep learning model with the location introduces unnecessary complexity as the location information is not needed for controlling appliances usage in a room. The capability of localizing human in the model also reduces the speed of the system, which is vital for a real-time CCTV stream processing.

Dataset statistics

The human counts in RHC dataset are ranged from 0 to 13 with distributions as shown in Fig. 3. The mean human count in this dataset is 4.1249 with a standard deviation of 2.6206. We can see that the distribution is not uniform. Thus, this dataset can be considered as imbalance, which typically needs special treatment for any machine learning models to learn well from the dataset.

Fig. 3
figure 3

Data count distribution

For a typical training procedure of machine learning, we split the dataset into three different sets: training, validation, and test set. The splitting process was done randomly with stratification to the human count. The split ratio between training, validation, and test set is 60:20:20. After the splitting process, we got a dataset with distribution as shown in Table 1.

Table 1 Data distribution

To understand whether the current size of RHC dataset is enough for transfer learning, we compared the size with public datasets crowd counting. The crowd counting datasets is the most similar dataset to our case, which are also used for counting human. However, crowd counting differs from our case that the images contains huge number of human in outdoor setting. The dataset in crowd counting is typically much smaller than other popular computer vision cases such as image classification and object detection. Consequently, research in crowd counting usually utilize transfer learning. Therefore, the crowd counting datasets are suitable for comparison to RHC dataset. Table 2 lists popular crowd counting datasets as well as RHC dataset together with their size. We omitted GCC dataset in the list since it is a synthetic dataset and typically used only for the pretraining phase of transfer learning scheme. From the comparison, we can infer that RHC dataset size should be enough for deep learning. The size of RHC dataset is the third biggest dataset among the popular crowd counting datasets.

Table 2 Comparison of datasets size

Possible challenges

We identified six possible challenges to be solved for a successful model training on RHC dataset. The first challenge is whether the trained model can count persons whose hair is covered. We see this as a challenge since most of the persons in this dataset let their hair uncovered. The second challenge is whether the model can successfully count human with overlapping heads. This challenge is common in crowd counting as the number of human captured in the images is massive. We see that a small portion in the dataset has overlapping heads, mostly for images with a large actual count.

The third challenge is introducing the trained model to exclude human outside of the room when predicting the count. The room in this dataset has a transparent glass wall on the left side, which outside can be clearly seen. Therefore, to produce a correct count prediction, the model needs to be able to exclude the persons outside of the room. The glass wall also causes the fourth challenge. When the outside of the room is darker, it turns into a mirror that reflects the persons inside the room. The model should be able to differentiate between the actual persons and their reflected figure. The fifth challenge is related to the lighting of the room. Part of the room sometime can be darker if there is a presentation session in the room. Therefore, the model should be robust against a different light setting of the room.

The last possible challenge we identified corresponds to the distribution of this dataset. As given in Fig. 3, this dataset is not balanced to all possible count. The larger the difference between labeled counts to its mean, the smaller the number of images they have. This condition generally leads to poor performance for the labels with fewer images. This problem is called imbalanced data problem and is known to cause diminishing performance for machine learning models as well as deep learning models [39,40,41]. In counting case, one of the possible solutions to this problem is to create a model that is capable to extrapolate its count prediction to count labels with fewer data.


We conducted an experiment to measure the performance of developed intelligent human counting system. In the experiment, we consider five popular CNN models as the pretrained model: AlexNet [6], VGGNet [7], GoogLeNet [8], ResNet [9], and DenseNet [11]. To enable all models to learn from RHC dataset, we changed the prediction layers with a fully connected layer consisting of one neuron. The layer outputs a single number as a predicted human count. Because the input image size of these networks is 224 × 224, we resized the images in the dataset to the size before feeding them to the networks. All models are trained using Adam optimization algorithm [42] with learning rate 0.001. The performance of each model is measured using Mean Squared Errors (MSE) of the difference between predicted count and actual count.

Results and discussions

Quantitative analysis

Table 3 lists all models MSE for the test split of RHC dataset. The best MSE is achieved by AlexNet, which has the smallest number of layers. We can see a trend that the more layers the model has, the MSE is declining. We suspect that this is caused by overfitting that is suffered by the more complex models.

Table 3 Model performance on test split

To check our assumption of overfitting, we tabulate the MSE for each actual count in Table 4. We also plot the MSE in Fig. 4. We can see that the complex models tend to perform worse in the actual count with less training data. Thus, we can confirm that the poor performance from the complex models is caused by overfitting.

Table 4 Test MSE of all models for each actual count
Fig. 4
figure 4

MSE versus count label plots for all models: a AlexNet, b VGG16, c GoogLeNet, d ResNet18, e ResNet50, f ResNet101, g ResNet152, h DenseNet121

Qualitative analysis

We picked up several cases that correspond to the challenges we addressed before. These cases are tabulated in Table 5. In image (a) in the table, we see a person with a veil. Because in most training data the hair of each person is seen, we suspect that the models might unable to count a person whose hair is covered. However, that seems to be not the case as the models count for 5.77 persons in average for this image, with the actual count is 5 persons. The problem instead is the failure of the model to exclude a person that is actually outside the room. The average count prediction is approaching 6 which indicates that the models tend to count an excessive person, which is likely the person near the most left person in the room. This failure is supported with other picture with a similar case as depicted in (b).

Table 5 Cases with possible challenge

Although the models seem to unable to exclude humans outside the room, it is not the case if there are more than one person outside. As seen in (c), the models do not suffer over-counting problem caused by outside persons. This fact is proved by similar predictions by all models in image (d) which outside room is relatively clear.

In fact, all models instead suffer under-counting in predicting image (c) and (d). The average count prediction is 7.23 persons compared to 11 persons in the actual count. This under-counting might be caused by several persons with an overlapping head as seen in image (c). This problem is also the possible cause of the poor performances of all models in Table 4 for large count number. However, this under-counting does not appear in images with fewer human such as image (e). In this image, there are 2 persons with overlapping head. The average count in this image is 2.83 persons, approaching the actual count of 3 persons. Therefore, the models are able to predict this case without notable problem.

In addition to image with large human count, we also checked the opposite extreme, which are images with a small human count. The average prediction of image (f), which contains only 1 person, is 2.84. This indicates that the models are over-counting. However, in this case, it seems that the over-counting is not caused by the persons outside the room, as there are more than 2 persons clearly seen outside. Thus, we expect the over-counting probably caused by imbalance data instead.

Trained with RHC dataset, all models seem to have a robust performance against different lighting. For instance, image (g) is slightly darker than most of images in RHC dataset. However, the performances of all models are still reliable, with a slight under-counting that might be caused by overlapping instead. The models are also robust against the case where the outside room is dark, which makes the glass that separates inside and outside reflective. an example of this case is provided in image (h). It can be seen that all models do not suffer over-counting caused by the reflected figure of the persons in the room.

Conclusion and future works

In this paper, we showed that transfer learning can be used to develop an intelligent human counting system, which can be utilized for energy optimization in smart building management. To enable the development, RHC dataset is collected to train a pretrained deep learning model for counting human in a room. The result of this study shows that AlexNet is the best model for the pretrained model in the proposed transfer learning scheme. However, the size of this dataset seems insufficient to train more complex networks than AlexNet. This indicates that the dataset should be appended with more data in the future. Additionally, it is interesting to extend this dataset with additional annotations for the coordinate of each human. We believe that this additional annotation can help a complex model to improve its performance.