1 Introduction

Electricity consumption globally has continued to increase rapidly. Between 1990 and 2021, the world’s annual electricity consumption rose from 10,000 TWh to 25,100 TWh [1]. As the rise of technology and population increases at an alarming rate so is electricity consumption. Building spaces are one of the important places for optimizing usage of energy based on actual occupancy. Holmin et al. [2] discusses electricity and cost reduction which is based on layout and size of Origo office space and made three proposals based on ratio of occupancy and number of available workspaces. An open office design proposal led to an annual electricity consumption reduction of 686 KWh, activity-based workplace 540 KWh and an individual office design led to 452 KWh [2]. In a white paper published by Dassault Systems, five instances were demonstrated where digital twin technology could create an economic benefit of \(\$1.3\) trillion and reduce \(C{O}_2\) emissions by \(\$7.5\) gigatonnes between the years 2022 and 2023 [3]. This paper proposes a human in the loop system to use electricity in intelligent way by observing room occupancy and corresponding power consumption in a digital twin of any office space. The proposed system will allow building management teams to make informed decisions and create plans to minimize consumption by

  • Estimating room occupancy maintaining the privacy of individuals,

  • Calculating total energy consumption along with individual equipment monitoring using legacy webcam, and

  • Visualizing the real estate with inhabitants in 3D.

A VR digital twin of an office and laboratory space has been developed. In this context, A VR digital twin is a virtual replica of a physical environment or object, recreated in a virtual reality (VR) setting to simulate, analyze, or interact with real-world entities in a digital space. It allows immersive exploration and manipulation of physical elements within a virtual environment. With the legacy monocular webcam, the person detection algorithm has been able to attain an accuracy of \(96.04\%\), while the power consumption estimation algorithm has a true positive rate of \(91.58\%\). The model is already deployed in British Telecom office space for measuring room occupancy and power consumption in real time.

This paper is structured as follows. Section 2 explains literature literature regarding digital twins, mapping between real and virtual worlds, and electric energy consumption using computer vision and machine learning techniques. Section 3 describes proposed methods used for mapping and measuring energy consumption. Results are discussed in detail in Sect. 4, followed by general discussion in Sect. 5 and conclusion and future work in Sect. 6.

2 Literature survey

In tandem with efforts to enhance room occupancy and power consumption management, considerable research has concentrated on providing more convenient ways for occupants to directly interact with appliances. Some systems exhibit a subset of all appliances based on the user’s estimated location, yet struggle to differentiate between appliances of the same type. Alternatives incorporate various sensors and actuators to aid users in selecting appliances, some even providing feedback. However, these approaches often entail significant deployment overhead or prove less suitable for commercial building contexts. In the following sections, the related literature regarding digital twins, mapping between real and virtual worlds, and electric energy consumption using computer vision and machine learning techniques are described.

2.1 Digital twins

The inception of digital twin concepts traces back to NASA’s Apollo program [4], initially deployed during live missions to replicate critical scenarios faced by the crew. The formal definition of Digital Twin by NASA [5] in 2012 established it as an integrated, multifaceted simulation of an as-built vehicle or system. This simulation integrates various physics, scales, probabilities, sensor updates, and historical fleet data to faithfully mimic the life and functions of its real-life counterpart. In our context, digital twins represent real-time, data-rich models that accurately mirror and synchronize with physical or logical assets, processes, or systems. Leveraging real-time data, these models forecast and optimize system behaviors in advance, enabling informed decision-making and leading to significant savings in time and resources [6]. In contrast, a traditional model can only give a behavior snapshot at a specific moment, digital twins accurately describe change over time and have a bi-directional connection to its real-world counterpart via real-time data feedback [7]. Tao et al. [8] highlighted the state-of-the-art in industrial DTs. Khajavi et al. [9] explored a DTs’ use in a smart building scenario by replicating a part of its front facade. The facade was visualized by assigning different yellow shades to the respective lux values received from the sensor. Several commercial solutions have also emerged due to the diverse possibilities and benefits. One example is the Azure Digital Twins (ADT) [10], a cloud-based service to support DT deployment by providing it as a software as a service solution. Steelcase, a company known for designing workspaces, developed a space-sensing sensor network using ADT [11]. By implementing a suite of wireless infrared sensors, they generated analytics on how their spaces were being utilized, which in turn was used to enhance reliability and efficiency. ICONICS [12] also utilized ADT to create a virtual representation of a physical space to improve energy efficiency, optimize space usage, and lower costs. techniques. In a recent published paper, British Telecom has explored the use of DTs in telecommunications for energy modelling, capacity management, in-building network design and knowledge transfer [6]. Mukhopadhyay et al. [13] created virtual environment for creating dataset to train machine learning model and showed as an alternative to conventional dataset preparation which is necessary for any supervised learning process.

2.2 Mapping between real and virtual worlds

Replicating real-world movement to virtual space is not straight-forward problem. There is few research found in literature solved this problem by raycasting technique [14, 15], planar map [16], “ecef" coordinates [17], using GPS system [18] or by using wearable computer system [19]. Mukhopadhyay et al. [14, 15] calculated a direction vector to establish a direction to travel to centroid point from virtual camera or to place the humanoid in VR. To get the direction vector, they mimicked camera position in real world to virtual world. Sun et al. [16] proposed a system to match a pair between virtual and physical world using planar map. They first computed a planar map between real and virtual floor plans to minimize angular and distal distortions. In Singapore, ArtScience Museum and Google Zoo helped people to experience effects of deforestation using Augmented Reality (AR) [17]. They mapped between real and virtual world space using Google Tango which can give exact position and orientation (ecef coordinates) in WGS84, US GPS system. They transformed unity world to sit on top of ecef coordinates to overlap virtual world correctly with real world. Hanke et al. [18] used real to virtual world mapping associated with a parallel reality game. Cheok et al. [19] presented two interactive games by using real world and virtual world mapping. They converted the real world to a fantasy virtual playground by ingraining the latter with direct physical correspondence.

2.3 Measurement of energy using intelligent techniques

Measurement of energy consumption by electrical appliances using computer vision techniques is challenging. The existing literature can be put into multiple subgroups based on their working principles.

  • Electric consumption prediction models:

    • Jiang et al. [20] proposed non-intrusive load monitoring using deep learning models for electric consumption prediction.

    • Olu-Ajayi et al. [21] compared machine learning algorithms for predicting annual energy consumption in residential buildings.

    • Rahman et al. [22] developed a deep recurrent neural network for mid-to-long-term electric load prediction.

    • Gao et al. [23] introduced deep learning models and a transfer learning framework to enhance energy consumption prediction for new buildings or equipment.

  • Electric component detection

    • Abeykoon et al. [24] compared classifier models to detect electric components based on their parameters.

    • Chui et al. [25] proposed a powerline noise transformation approach to merge electricity load disaggregation datasets.

  • Building energy performance prediction

    • Seyedzadeh et al. [26] reviewed machine learning approaches for forecasting and enhancing building energy performance.

    • García et al. [27] provided an extensive review of machine learning methods for estimating energy consumption.

  • Occupancy and equipment detection for energy saving

    • Tien et al. [28] utilized computer vision for occupancy and equipment detection to predict energy savings in buildings.

  • Building cooling load prediction

    • Kwok et al. [29] discussed a probabilistic entropy-based neural network model for predicting building cooling loads.

    • Yezioro et al. [30] used simulation tools and artificial neural networks to assess heating, cooling, and energy consumption in buildings.

Summarizing the literature on electrical power consumption/energy consumption estimation using intelligent techniques, the literature primarily focuses on predicting electric consumption using machine learning and deep learning models, encompassing non-intrusive load monitoring, energy forecasting in buildings, component detection, and methods to estimate building cooling loads. Researchers have employed various techniques, including deep neural networks, transfer learning, and noise transformation, for accurate electric consumption prediction and enhancing building energy efficiency. Additionally, studies have utilized computer vision for occupancy detection and employed simulation tools alongside neural networks for energy assessment in buildings.

2.4 Summary

In summary, past literature has primarily focused on using DTs in industrial scenarios [8]. While there is literature on using twins for workspaces, only Nikolakis et al. [31] focus on mapping a person’s position and posture using expensive depth cameras. In this work, a cost-effective method is proposed in mapping position of person between real and virtual world. In terms of measuring the energy consumed, so far researchers have proposed a predictive model. It requires lengthy and costly data collection procedures. A similar work is carried out by Tien et al. [28], where authors proposed a computer vision-based approach to reduce building energy consumption by detecting room occupancy and equipment usage. However, their focus was only on detecting monitor screens to correlate occupancy and energy consumption. This overlooks other appliances such as lights and fans that also contribute to energy consumption. Additionally, their reported accuracy rate of 80% is lower than our proposed method (91.58%). Our approach addresses these issues by incorporating a variety of appliances commonly used in office spaces and logging data on both room occupancy and energy consumption. This allows for pervasive actions to be taken by the floor management team. The information was fused into a DT to maintain privacy of individuals while obtaining an overall estimation of occupancy and energy consumption in any given place.

3 Research methodologies

The digital twin (DT) of office space was developed using the Unity 3D [32] game engine and Probuilder [33] modeling tool. This virtual twin accurately mirrored the dimensions of physical space and replicated furniture and other elements. Employing baked global illumination enhanced the photorealism of virtual environment (VE) by precomputing lighting behavior as texture files, reducing real-time computational demands. Additionally, physically based materials (PBR) [34] were utilized to authentically simulate material properties and light reflection, achieving realistic visual effects. Figure 1 illustrates the envisaged deployment of the Digital Twin implementation, integrating real-time data from cameras and Internet of Things (IoT) sensors such as temperature and humidity detectors. The physical-virtual linkage, established through sockets, enables mapping various environmental variables (e.g., temperature, humidity via DHT-11 sensor), occupancy status, and energy consumption (captured by on-site low-cost webcams and analyzed using computer vision techniques). Detailed procedures for mapping techniques and energy consumption estimation are described in the following subsections.

Fig. 1
figure 1

Planned setup of the VR-based digital twin

3.1 Person mapping

A transfer learning technique was utilized to fine-tune an object detection model for person detection by using the Open Image dataset, which comprises 2022 images annotated with persons. Here, object detection involves identifying and locating persons within an image, bounding them with a designated boundary or box for recognition and analysis. The model was trained, and performance testing was conducted on NVIDIA GeForce RTX 2070 GPU. The model’s performance was cross validated on both synthetic and real-world images, with an overall accuracy of 96.04% (std error 0.9) for real images and 96.98% (std error 0.13) for synthetic images. There are plethora of mapping techniques available for reconstructing a real space using multi-view stereo (MVS) [35, 36]. One widely used technique is COLMAP [36], which is an end-to-end image-based 3D reconstruction pipeline. It employs MVS to compute depth and/or normal information for every pixel in an image, using the output of Structure-from-Motion (SfM) [36, 37]. By fusing the depth and normal maps of multiple images in 3D, a dense point cloud of the scene is generated. However, this technique requires a large number of images from different viewpoints and high visual overlaps, making it slower and more time-consuming when creating a representation of real-world scenarios at a specific moment in time. Machine learning approach is advantageous because it only requires a single image to recreate a real-world scenario. As a result, it is faster and can consider real-time changes happening in the real world, accurately reflecting them in the virtual world. A linear regression model was employed to map the persons between the real-world and virtual world, which mapped 2D image coordinates to corresponding 3D points in the virtual world. In this context, linear regression is a statistical method used to model the relationship between two variables by fitting a linear equation to observed data, predicting the value of dependent variable value based on the independent variable. In this case, as Y-axis being constant in virtual space irrespective of avatar’s position, ultimately mapping is being done between screen space (xy) to virtual space (xz) coordinates. The regression functions are formulated as:

$$\begin{aligned} X = a_{1} x+ b_{1} y+ c_{1}, \end{aligned}$$
(1)
$$\begin{aligned} Z = a_{2} x+ b_{2} y+ c_{2}, \end{aligned}$$
(2)

where, (XZ) indicates virtual-space coordinates, (xy) indicates screen-space coordinates, and \(a_{1}, b_{1}, c_{1}, a_{2}, b_{2}, c_{2}\) are constant terms. The 2D coordinates were obtained as the centroid of the bounding box of detected persons in real-world space, and the humanoids were placed in corresponding locations in the digital twin of the real-world space to obtain 3D coordinates. The real-world person coordinates and virtual world avatar coordinates were found to be significantly correlated (coefficient of determination R\(^{2}\) \(= 0.93\)). Finally, Mixamo’s [38] motion capture data was used to automatically rig the humanoid’s armature (base skeleton rig) so that it reflects realistic poses of a human. Realistic pose of humanoids corresponding to real world person movement helps in creating more accurate simulations and predictions for digital twin.

3.2 Energy consumption measurement

The objective was to develop an end-to-end system taking frames in real time and calculating the total amount of wattage consumed in the given space. The process was divided into two parts: (I) Training and validating an object detection model on an electric appliance dataset; (II) Developing an image processing algorithm on the detected objects to determine their states (ON/OFF). Figure 2 shows end to end working the proposed system.

Fig. 2
figure 2

The working of the proposed “consumed electric energy” estimation system

Dataset preparation: A dataset was necessary to validate the object detection model, yet no existing dataset included indoor electronic/electrical appliances to our knowledge. Consequently, a custom dataset was prepared by capturing images within Indian household and office settings. People were requested across various Indian states to contribute images featuring at least two instances of specific target objects: fans, TV/monitors, tube lights, and various other light sources excluding tube lights. A total of 441 images were collected initially, with 70 images discarded due to containing only one object instance or distortion. This resulted in a final set of 371 images, depicting 359 fans, 382 TV/monitor instances, 209 tube lights, and 783 other light sources. Additionally, 48 new images were captured in laboratory and office environments for model evaluation. The details of instances per class are outlined in Table 1. Manual annotation of electrical appliance instances was performed using Computer Vision Annotation Tools (CVAT) [39], generating labeled data in XML format containing bounding box coordinates, later converted for YOLO model training.

Table 1 Overview of the electric appliance dataset

Model description: Object detection methods vary, ranging from classical computer vision techniques to diverse deep learning models. Deep learning approaches typically fall into three primary categories. Two-stage models involve region proposal followed by classification within those proposed regions. Single-stage models perform object detection and classification simultaneously. Semantic segmentation-based models delineate and categorize each pixel within an image according to its class. After studying different type of state-of-the-art object detection models (two stage, one stage, segmentation based), the YOLO model was selected due to its accuracy and latency [40]. YOLOv5 was specifically employed for this implementation, trained within the PyTorch 1.7 environment. Various YOLOv5 models offer parameter ranges from 1.9M (YOLOv5n) to 86.7M (YOLOv5x). Selecting the YOLOv5s variant with 7.2M parameters was influenced by the modest size of the training dataset. The training involved 25 epochs with a batch size of 16, determined through a grid search for optimal epoch sizing. NVIDIA GeForce GTX 1080Ti facilitated model training, while inference time measurements were conducted on NVIDIA GeForce RTX 2070 with Max-Q Design.

ON/OFF detection: Once electric appliances are detected, the next step was recognizing their ON/OFF state. Detecting the ON/OFF status of the individual appliance is not a straightforward problem.

Different types of image processing techniques were applied to recognize these statuses. For instance, for detecting the ON/OFF status of a tube light or light_source, mean intensity of the detected region was calculated. It is important to note that the accuracy of these algorithms relies on various external factors such as illumination and reflectivity [41, 42]. To mitigate this issue, images were captured at different times of the day (morning, afternoon, evening) under diverse lighting conditions. Similar methodology was employed to determine the ON/OFF status of a TV/Monitor. It helped in finding global threshold for mean intensity.

During testing, two issues emerged: (I) Difficulty in recognition when TV/monitor operates at lower brightness, and (II) False detection of ON state during daytime due to shadows. To address these, the screen was divided into five regions (center, top left, top right, bottom left, bottom right). Utilizing histogram analysis of R, G, and B channels with a fixed threshold, the algorithm determined if all regions were above or below the threshold, counting the number of regions above. A positive count indicated the TV/Monitor as ON; otherwise, it was deemed OFF (Fig. 3 illustrates this). Detecting the FAN state posed the greatest challenge, tackled by considering the visibility of its blades. When the FAN operates at full speed (ON), its blades are not visible, contrasting with their full visibility when OFF. A segmentation technique [43] was utilized to compute the blade visibility ratio in both states, establishing an optimal threshold for FAN detection. The operational principle of the fan detection algorithm is illustrated in Fig. 4.

Fig. 3
figure 3

Visual interpretation of the pseudo code to detect whether TV/monitor is ON or OFF

Fig. 4
figure 4

Visual interpretation of the pseudo code to detect whether Fan is ON or OFF. Pipeline to detect status of (a) ON and (b) OFF. Fg and bg indicates total number of foreground and background pixels

3.3 Embodiment of electrical component detection camera module

This module facilitates electrical appliance detection with two degrees of freedom for movement along two directions. A high-resolution (1080p) webcam, featuring a 55° field of view, is mounted atop the module, rotating at a speed of 1.6°/s with a 300 ms delay. It comprises two SG90 Servo motors and an Arduino Uno microcontroller. Calculations indicate that a 35° motor rotation covers the largest room area within approximately 22 s. Considering the frame rate of 30 frames per second of the camera, each frame requires 33 ms, with an additional 200 ms for processing. Factoring in GPU latency, network latency, and a 90 ms buffer, the total delay amounts to 300 ms. The proposed camera module is illustrated in Fig. 5. Using Fusion 360 [44], a digital model was created through CAD (Computer Aided Design), allowing iterative concept development and prototype design via 3D printing technology.

Fig. 5
figure 5

Integrated angular camera module. a CAD model of the setup. b The physical setup

4 Validation

4.1 Experimental setup

The experimental setup incorporates a GPU system, legacy webcams, and Unity software, depicted in Fig. 1. To ensure zero blind spots in the target office space, multiple Logitech HD 1080p webcams were strategically placed in the physical space, interconnected to a local GPU. Details about the GPU configuration is provided in Sect. 3.2. Additionally, another system is arranged for running Unity software, as outlined in Sect. 3, to create the VR DT. Both the GPU and the Unity system are linked within a local network. The GPU is crucial in executing person detection and electric power consumption algorithms. The resulting information is transmitted to the Unity system, which is received and presented through dials, as demonstrated in the Additional file 1: Video (https://youtu.be/0Gc833mJQlI).

4.2 Person mapping

Procedure: In Sect. 3.1, the linear regression-based mapping procedure was described. Two validation techniques were employed to assess the accuracy of the mapping algorithm. Initially, predictions were made concerning the position of humanoids corresponding to individuals in the real world. Additionally, a correlation between Euclidean distances in both the real and virtual worlds for individuals was reported. Figure 6 depicts the positioning of individuals in the real world and their corresponding representation as humanoids in the virtual realm. The humanoids are generated in VR to match the 2D position determined by the person detection algorithm. Distances between individuals, their real-world positions (generated by the person detection model), and the anticipated distance and position derived from the mapping algorithm have been labeled.

Analysis:  A video lasting approximately 2 min was captured using a camera placed in a room. The video underwent processing via a person detection algorithm to derive corresponding coordinates for each detected person. Subsequently, a tabulation of the Euclidean distance was performed in both real-world and VR contexts. Analysis encompassed examining correlations for distance and coordinates produced by the CNN and linear regression model. The findings revealed correlations of R\(^{2}\) \(= 0.85\) for coordinates and R\(^{2}\) \(= 0.5\) for distances. Figure 7 depicts the correlation graph for both coordinates and distances.

Fig. 6
figure 6

Distance measured. a Distance in real world using CNN algorithm. b Corresponding avatar distance in VR

Fig. 7
figure 7

Scatter plot. (a) CNN coordinates and predicted values in a virtual world through linear regression. b Distance plot coordinates in the virtual world

4.3 Energy consumption measurement

In this section, accuracy of electric appliance detection model is reported followed by energy measurement accuracy by the proposed system.

Object detection model: The accuracy of the model was assessed using IOU (Intersection Over Union), calculated as the ratio of the overlap area between the ground truth and predicted bounding boxes to the union area of these boxes. Precision, recall, and F1 score were reported for individual classes and overall model performance. Latency measurements were conducted on the NVIDIA GeForce RTX 2070 with Max-Q Design, showcasing a mean IoU of 0.32, an accuracy of \(65.61\%\), and a speed of 86.11 FPS. Performance details were summarized in Tables 2 and 3.

Table 2 Accuracy analysis of YOLOv5 (class wise)
Table 3 Accuracy analysis of YOLOv5 (overall)

Energy estimation accuracy: Upon detecting electric appliances, the subsequent stage involves determining their ON/OFF status. A test was conducted using a total of 36 images captured at various times and locations. The total energy consumption of the identified devices was calculated, followed by the execution of image processing algorithms to determine the ON or OFF status of these devices. Metrics such as true positive (TP), false positive (FP), and false negative (FN) were recorded, enabling the calculation of the true positive rate (TP/total energy consumed by the detected appliances) and false positive rate (TP/total energy consumed by the detected appliances) for individual appliances.

The considered electric appliances in this study possess standard power ratings: a fan at 53 W, TV/monitor at 15 W, Tubelight at 20 W, and Light_source at 18 W. Precision scores were utilized to gauge accuracy of the model in estimating energy consumption within a given area. The overall true positive rate for detecting the ON or OFF state of electrical appliances stands at \(91.58\%\), accompanied by an F1 score of \(81.96\%\). The performance of the system is illustrated in Fig. 8, and a comprehensive summary is presented in Table 4.

Fig. 8
figure 8

Results of the proposed system on test images. a Lab space with fan, tube lights. b Lab space having TV/monitors

Table 4 Accuracy analysis of energy consumption estimation

5 General discussion

5.1 Summary

The proposed solution employs computer vision and a VR setup to remotely monitor buildings and spaces, aiming to enhance operational efficiency and optimize space management and asset utilization while ensuring individuals’ privacy, as this is built-in by design. A novel approach is introduced for measuring electrical energy consumption in any office space location using computer vision techniques with legacy webcams. Compared to traditional electrical meters, which require physical contact with the system being measured and are relatively costly (around USD 110), legacy webcams offer a more cost-effective alternative (around USD 28) [45]. Additionally, installing multiple traditional meters in small spaces like cubicles, meeting rooms, and cafeterias is infeasible as it would require a dedicated current sense transformer on each circuit, and each cubicle, meeting rooms is never wired up as a ‘one room-one circuit.’ In recent times, smart plug posed as a solution in saving energy [46]. It is cheap and plug-and-play device. However, the disadvantages of using smart plugs in office spaces include dependency on internet connectivity, potential disruptions, the ‘one electrical appliance, one smart plug’ approach, and privacy/security concerns. In comparison, our computer vision-based solution addresses this limitation as it requires only cameras for estimating energy. Moreover, traditional meters do not provide insights into individual appliances’ or devices’ energy consumption patterns, making it challenging to identify energy-wasting equipment [47]. The proposed system adopts a Non-Intrusive Load Monitoring (NILM) method to monitor the electricity consumption of individual appliances, leveraging the cost-effectiveness and convenience of NILM methods [20, 47]. The system’s evaluation utilized the energy-intensive NVIDIA GTX 1080Ti, yet this can be alleviated by employing edge computing boards. Our tests with the NVIDIA Jetson Nano found it highly efficient in running a YOLO model, consuming only 5–10 W of power, significantly less than even a tube light with a power rating of 20 W. Contrary to a ‘smart plug,’ images are not transmitted from edge computing devices; it sends only the location information of persons to the server system. Furthermore, a novel camera module is proposed to reduce the number of cameras needed for estimating power consumption.

Moreover, a practical method is introduced to measure room occupancy utilizing computer vision techniques and the identical cameras utilized for energy estimation. Recent attempts have explored estimating room occupancy by monitoring Wifi probe signals from mobile phones for slip and fall risk assessment [48]. However, this method entails expensive Google Cloud services for tracking, unlike our cost-effective computer vision-based approach. Data visualization techniques were applied to integrate the data into a digital twin, elucidating the model’s functionality for both real-world and digital twin applications.

5.2 Accuracy of the proposed system

Our implementation achieved a 0.85 correlation on human mapping. CNN-based object detection is rapidly evolving, with new models frequently appearing in the literature. Thus, the electric appliance detection accuracy (even with different factors) can further be increased using customized CNN models. However, our focus was not on developing CNN for detection but instead on proposing a novel way of detecting the ON/OFF status of electrical appliances using computer vision techniques. Moreover, the human mapping algorithm works using a webcam’s visual field placed in a room, but a future version will implement 3D distance measurement like Bertoni’s [49] Monocular 3D Localization algorithm.

5.3 Utility

The proposed VR prototype is deployed as a VR-based digital twin of an office space implementing real-time person occupancy, electric energy consumption and environmental variable monitoring capabilities through dashboards. This can be valuable for monitoring person occupancy and energy consumption measures in office spaces. Correspondingly, the floor management team can take pervasive action to control unnecessary energy consumption, which positively affects the environment. A second benefit could be that an observer could undertake a detailed remote virtual walk through the office space, which would not be possible with a standard multi-screen video from security cameras. Figure 9 depicts the actual and corresponding virtual space, demonstrating that individual identities remain undisclosed in the virtual representation. However, the floor management team can access detailed information, including occupancy, employee postures, total power consumption, and additional sensor data within the space.

Fig. 9
figure 9

Example scene from real world and corresponding virtual world. a It shows position of person, power consumption is zero as all electrical appliances are turned off. b Humam posture along with occupancy details. Power consumption has changed to 40 W as two tubelights are detected as ON

The environment dimension of ESG (Environmental, Social, and Governance) focuses on minimizing the negative impact of business activities on the environment and promoting sustainability. This paper intended to create cost effective environments by implementing remote room occupancy monitoring and automatic energy consumption detection. By accurately estimating energy consumption in shared spaces and implementing an energy consumption algorithm, the proposed system can facilitate the identification of high-energy-consuming areas or appliances. This information can be used to develop strategies to reduce energy consumption and subsequently lower carbon emissions associated with the operation of offices and houses. By visualizing the energy consumption data, stakeholders can identify opportunities for energy optimization, implement energy-saving measures, and make more sustainable choices.

5.4 Value addition

The global pandemic has reshaped the use of working space in offices dramatically due to the implementation of working from home (WFH) and indoor social distancing requirements. Now that people are returning to the offices, and there is a focus on hybrid working, current office designs must be re-evaluated to provide safer and healthier environments, optimize spaces, and reduce costs. How companies manage spaces in the emerging new normal is more important than ever.

Many machine learning-based energy estimation prediction models have been proposed in the past years. However, most of these systems were data dependent. They need a vast amount of data to make a predictive model. Our paper proposes a new way of measuring energy consumed in a place. Our implementation can be used as a plug-and-play system to estimate consumed energy for any place. In doing so, the benefits are:

  • The conventional use of digital twins revolves around optimizing or simulating asset process life cycles or maintenance. An innovative application of digital twin technology allows the visualization of room occupancy and energy consumption within a workplace environment. This visualization tool significantly aids floor management in making informed decisions to enhance the overall environmental sustainability of the office space.

  • The digital twin implementation ensures privacy preservation for individuals. The human mapping algorithm provides real-time positioning in office spaces, utilizing randomly generated avatars to maintain privacy-by-design, avoiding disclosure of identity or gender. While Tien et al. [28] discussed a simulation model for the working space, their work overlooked the critical aspect of privacy, particularly crucial within a working environment.

  • Utilizing an image processing pipeline enabled the detection of appliance ON/OFF states, showcasing a unique approach. Detecting the fan’s ON/OFF state presented challenges with conventional techniques, compelling us to devise an innovative solution. Similarly, addressing the ON/OFF state detection of the TV/monitor involved discussing an inventive approach.

Occupancy detection and power consumption monitoring using machine vision can be extremely useful for HVAC (Heating, Ventilation, and Air Conditioning) regulation and optimization in buildings. Here the floor manager can walk through the office space in digital twin and optimize the HVAC systems based on the occupancy level of an area and temperature and humidity data to ensure that the temperature and humidity levels are comfortable and conducive to productivity.

5.5 Implication of the study

The findings have substantial implications for energy management and building design. Firstly, the ability to estimate room occupancy and energy consumption at varying levels of granularity-ranging from individual rooms to entire buildings-via a single camera’s visual field surpasses the capabilities of standard energy meters. This granularity offers a detailed understanding of energy utilization patterns that were previously unattainable, providing valuable insights for optimizing energy usage across diverse spatial scales within buildings. Secondly, other research efforts leverage computer vision techniques alongside external data, such as latitude, longitude, and sunlight duration, to estimate solar energy generation and optimize energy utilization [50, 51]. Our study aligns with the Environmental dimension of ESG principles by aiming to mitigate adverse environmental impacts through remote room occupancy monitoring and automatic energy consumption detection. Moreover, the amalgamation of person occupancy, energy consumption, solar power utilization, and building location data could pave the way for optimizing energy usage within buildings. This integration of interior and exterior observations might foster an intelligent energy management system capable of dynamically adapting to environmental conditions and human presence.

6 Conclusion and future work

This paper introduces an innovative end-to-end system designed to estimate energy consumption in various settings. The system demonstrates a reported accuracy of predicting energy usage at an average of \(91.58\%\). Additionally, a person occupancy method based on a human mapping algorithm is proposed. The information is transmitted to a 3D digital twin of the office space, allowing the floor manager to visualize occupancy and energy consumption in real-time, aiding in strategic planning. However, limitations include assumptions about uniform energy ratings for specific electrical equipment like TVs/Monitors and various light_sources. Efforts are focused on addressing these issues by collecting more diverse data to enhance the accuracy of the electrical appliance detection model. Future plans involve expanding the dataset with images from different countries and employing k-fold cross-validation to validate techniques across diverse regions, aiming to improve precision. As previously mentioned, the deployed system covers the British Telecom office area, accommodating up to 150 employees. Initial challenges involved determining optimal camera positions to prevent overlapping fields of view and addressing cable connectivity issues across all cameras to the base system. Subsequently, utilizing USB-3 cables notably improved the performance of the person and electric appliance detection algorithm. Additionally, for cost-effective solutions, legacy webcams and a privacy-by-design approach proved beneficial for real-time remote asset monitoring. To address scalability concerns, the system’s deployment should consider larger spaces by reassessing camera placements and their coverage areas to accommodate more employees efficiently. Introducing a networked setup or alternative connectivity solutions beyond cables might resolve integration challenges, allowing for seamless camera-to-system connections. Upgrading to USB-3 cables enhances the algorithms’ performance, but exploring cost-effective yet efficient devices, such as legacy webcams, ensures scalability without substantial expenses. Incorporating a privacy-by-design approach in device selection and deployment remains crucial for real-time remote asset monitoring, ensuring compliance with privacy regulations while maintaining system efficacy. Furthermore, the model underwent testing in various lab settings, with all data transmitted to the digital twin of a real-world space (refer to: https://youtu.be/0Gc833mJQlI).