IoT-oriented high-efficient anti-malware hardware focusing on time series metadata extractable from inside a processor core

We aim to improve the efficiency of our previously proposed anti-malware hardware; it is a hardware-implemented malware detection mechanism that uses information inside the processor. We previously evaluated a prototype, but, due to its prototypical nature, there remain limitations, such as only detecting certain behaviors, high power consumption, and a tendency to bloat the training model. In this paper, we propose a circuit and a learning method to achieve high efficiency, low power consumption, and light weight for the model. In considering these three issues, we focus on time-series metadata obtained by transforming the processor information. To improve efficiency, we implement predictive detection to predict the behavior of metadata in the malware detection component. This lets the model detect malware within less than 19% of the number of execution cycles of the conventional method. To reduce power consumption, we implement a sampling circuit that interrupts the input to the detection circuit at regular intervals, reducing the system’s uptime by 99% while maintaining judgment accuracy. Finally, for a light weight, we focus on the training process of the metadata generator based on a machine-learning model. By applying sampling learning and feature dimensionality reduction in the training process, a metadata generator approximately 16% smaller than the previous version is created.


Introduction
The recent spread of Internet of Things (IoT) technology has enabled various devices to be operated via networks. The majority of IoT device users are likely using multiple devices simultaneously and, by the end of 2020, the number of IoT devices connected to the Internet is estimated to be 25 billion [1,2]. However, many IoT devices remain vulnerable to security risks [3] and there has been an increasing number of attacks targeting them. Mirai, a malware that takes over IoT devices and executes distributed denial-of-service (DDoS) attacks on other servers through command-and-control servers, emerged in 2016 [4]. One prominent target of Mirai was the Domain Name System server "Dyn" [4]. With the release of Mirai's source code, many malware variants have been created, including some that destroy device data [5]. In this paper, we describe Anti-malware software covers a range of software-based security measures. For example, the signature method detects malware-specific code and hash values by pre-listing them, calculating the hash values from target files, and comparing them to the list. Behavioral methods focus on program behavior and dynamically determine malicious behavior by obtaining statistics on the type and order of application program interface (API) calls made by a program and comparing them to pre-listed malware statistics [6][7][8]. Heuristic methods apply machine learning and data mining to detect malware by a discriminator that dynamically collects behaviors and uses them as features [9]. However, these methods are difficult to implement in resource-constrained IoT devices because they are software-based and run in the background, constantly using memory and power.
Therefore, we proposed a malware detection mechanism for IoT devices, implemented on the large-scale integration (LSI) as dedicated hardware (hereafter called anti-malware hardware [10]). Unlike software-implemented mechanisms, the hardware can operate independently of the core to reduce the core resource consumption associated with malware detection to zero. Anti-Malware Hardware detects malware by using information directly obtained from the core on the LSI as electrical signals (hereafter called processor information) as features.
However, our previously proposed detection mechanism has four problems: (1) it cannot correctly detect a program that behaves differently than expected, (2) it can only detect malware after executing at least a certain number of cycles, (3) it has high power consumption because it runs every cycle, and (4) the hardware implementation size is likely to be huge.
In this paper, we implement and evaluate proposal methods to address these issues. We consider meta-information (metadata) that can be extracted from processor information. These metadata are data that label processor information as benign or malignant using a machine-learning (ML) model and transform the information into a form that can be interpreted over time as the behavior of the application. Significant differences between normal programs and malware emerge in metadata. We take advantage of this by applying sampling circuits and predictive detection to capture behaviors that are not identified by conventional methods. Predictive detection has the effect of speculating on metadata features and limiting the number of execution cycles required for malware detection. The sampling circuit reduces power consumption by efficiently thinning, or sampling, the processor information required for judgment. This sampling can also be leveraged in the learning process to reduce the size of the ML model implementation.
The rest of this paper is composed as follows. Section 2 reviews related works. Section 3 shows differences from previously published papers. Section 4 presents the concepts and problems of the proposed mechanism. Section 5 describes the proposed methods. Section 6 provides the evaluation method and its results, which are discussed in Sect. 8. Section 7 evaluates the hardware performance. Section 9 concludes the paper.

Related works
Recent CPUs are equipped with a mechanism called hardware performance counter (HPC): Intel Performance Counter Monitor (PCM), Intel Processor Trace (Intel PT), ARM Streamline, etc. The HPC monitors performance events inside processors. For example, Intel PCM counts instruc-tions retired, cache hits and misses, and others. The information can be used to help performance analysis or tuning of application software. Many researchers have focused on the processor level information and proposed cyber security measures utilizing HPC [11]. Among them, the following researches use machine learning and HPC to detect attack behavior.
Torres et al. proposed a software method to detect dataoriented attacks using HPC [12]. A data-oriented attack is an attack that causes unintended behavior by rewriting the data in the memory space. Their proposed method generated a machine learning model using the data obtained from the HPC under normal conditions and under abnormal conditions when the data-oriented attack is being executed. They evaluated two-class classification using the SVM model and showed 92% detection accuracy.
Bahador et al. proposed a malware detection method that combines HPC, singular value decomposition, and machine learning [13]. This detection method, called HPCMalHunter, shares similarities with our method in that it uses processor information such as cache hit rate and uses machine learning to detect malware, but differs from our hardware-based mechanism in that it is implemented solely in software. They evaluated HPCMalHunter with some malware and benign programs. The evaluation results showed that the correct answer rate was 90.69% and the false positive rate was 0.79%, resulting in a high detection rate and a low false positive rate. However, it means that there still exist at least some undetected malware and falsely detected benign programs.
Junaid et al. proposed a software method to avoid sidechannel attacks by using information obtained from HPC [14]. There are some situations where a side-channel attack can be established. They focused on the situation in which a side-channel attack occurs when two different applications share the same functional unit in a processor. In the proposed method, a neural network detects and predicts the execution phase of each application based on the hardware events which are obtained from the HPC by the OS scheduler. Based on the detected and predicted phases, the scheduler allocates applications so that they do not share the same functional unit at the same time. The software method proposed by Manaar et al. also uses hardware information to detect sidechannel attacks [15]. This work is similar to that of Junaid et al. in that it focuses on side-channel attacks and uses performance counters, but differs in that it focuses on the correlation between the process of the attack program and the process of the Clefia encryption program.
Vinayaka et al. proposed a software-based framework called BRAIN, which adds HPC information as features to conventional DDoS detection models that use network and application statistics [16]. The conventional model that uses network and application statistics as input to the SVM showed a correct answer rate of 99.32%. In contrast, the BRAIN framework succeeded in improving the correct answer rate to 99.8% while keeping the false positive rate at 0%.
Liu et al. proposed a security measure that does not use HPC, but Intel Processor Trace (IPT) to enhance the integrity of the application control flow [17]. IPT is a software-based function that collects processor execution traces and provides millions of lines of instruction history to the OS with low overhead.
The above studies used processor information and machine learning algorithms as well as our works. However, their cyber security measures are realized in software, in contrast to our AMH that is a hardware mechanism on the LSI. Therefore, these software-based methods are difficult to introduce in resource-constrained IoT devices. In addition, they used dedicated tools or instructions to extract processor information and directly used the output from the identifier generated by machine learning for the detection, while our AMH employs wired-logics to directly capture processor information and treats it as time-series data as mentioned in Sect. 4.4. As a result, our method accelerates malware detection speed while maintaining high accuracy.
By offloading certain processes to dedicated hardware, and thereby freeing central processing unit (CPU) resources for other applications, studies have reported reduced load on a CPU and increased processing speed. Ding et al. implemented a hardware-based transmission control protocol (TCP) offload engine that addresses the problem of traditional protocol stacks implemented in software [18].
ARM, a company heavily involved with processor core intellectual property (IP) for IoT devices, proposed edge device artificial intelligence (AI). One of its products, Ethos-U55, is a neural network processing unit (NPU) that specializes in machine-learning-based reasoning algorithms [19]. Combined with ARM's processor core, Cortex-M55, the processing performance of the inference algorithm was up to 480 times compared to Cortex-M33. As it requires less power than a general-purpose processor to perform similar processing, it has immense benefits for microcomputers and IoT devices.
Additionally, manufacturers have implemented security measures that are achieved through a combination of hardware and software. For example, Intel implements a feature called Intel Threat Detection Technology (Intel TDT) into some of its CPUs [20,21]. Intel TDT sends a hardware-based signal that indicates the run-time behavior at the CPU level. The signal can be used to help software threat detection agents that identify polymorphic malware, fileless scripts, cryptomining, ransomware, and other targeted attacks.
In November 2020, Microsoft announced Pluton, which they developed in collaboration with AMD, Intel, and Qualcomm Technologies, Inc [22]. Pluton is implemented as part of the LSI and can protect authentication informa-tion, encryption keys, and personal data. The advantages of implementing security hardware on the LSI include communication between the core and the security hardware not needing to go through a bus and prevention of physical attacks on the security hardware, even if an attacker were to steal the computer.
However, these examples and the previous studies of security hardware are focused on passive security features, in contrast to our proposed method with active security features for malware detection.
Although software-based security measures are widely used in general-purpose computers, they are relatively difficult to implement in IoT devices with limited hardware resources. Nevertheless, the process of embedding certain functions, including security functions, in hardware has already begun to be implemented. Takase et al. previously proposed the hardware-based malware-detection mechanism for IoT devices using processor information as a feature [10]. In the study, they investigated the effectiveness of the mechanism and found that processor information obtained from the core can be effectively applied for malware detection using machine learning on ARM architecture. However, the mechanism had some problems related to operation efficiency and speed. The above mechanism and problems are explained in Sect. 4.

Relationship with previous studies
In November 2019, we presented a paper at the CANDAR Workshop that is an interim report of this paper [23]. The contributions of the interim report include the improvement of the operation efficiency and speed of Takase's mechanism [10]. The former is achieved by sampling the processor information, which is a feature value. The latter is achieved by implementing a predictive detection algorithm that predicts the behavior of malware running on the CPU. These proposed methods are explained in Sect. 5. However, as the CANDAR paper is the interim report, it has the problem of insufficiencies in the description, analysis, implementation, and evaluation of the proposal. This paper is the complete version of the interim report presented at CANDAR. All sentences and all figures in this paper are different from those in the previous work. The differences are divided into update and addition elements. For example, we updated the description of the hardwareimplemented malware detection mechanism that was also presented in the interim report to explain the mechanism in more detail for improved understanding. In addition to update elements, we newly introduced addition ones. For example, we added the time-series metadata theory, which was not noted in the previous studies, and the analysis of the malware detection behavior based on the new theory. The major differences from the CANDAR paper in this paper are shown below.
-Theory Addition: The theory of time-series metadata is newly introduced to analyze the malware detection behavior of the conventional mechanism in each cycle (Sect. 4.4).

-Explanation
Update: A description of the malware detection circuitry and its operation, with very detailed figures and text (Sect. 5).

-Evaluation
Update: By adding preliminary evaluations to each evaluation process, we conclude performance improvements based on 100% Accuracy and 0% False Positive (Sect. 6). Addition: We have redesigned the proposed mechanism in Verilog language, and performed the logic synthesis and place-and-route targeting FPGA (Sect. 7). Addition: Based on the placement and routing results, we performed a quantitative evaluation of the hardware resource consumption (Sect. 7.2). Addition: Based on the placement and routing results, we performed a quantitative evaluation of the power consumption (Sect. 7.3).

-Verification
Addition: Significant additions to the discussion based on the newly introduced theory of time-series metadata and new evaluation results (Sect. 8).
While the other sections, which are not noted in the above, are minor differences, they are helpful for understanding our study.

Hardware mechanism and problems
We hereafter refer to the hardware-implemented malware detection mechanism as anti-malware hardware (AMH). In this section, we outline the functions of AMH, how to implement it, and key issues.

Introduction of anti-malware hardware
The functions comprising the AMH are implemented in hardware on the LSI. Figure 1 compares the traditional software-based implementation and our proposed hardwarebased implementation. In Fig. 1a, the anti-malware functionality is implemented in software, wherein the user application and the antimalware software are executed in parallel on the core on the LSI. In many cases, anti-malware software runs in the background, competing with other applications for core resources.
By contrast, in Fig. 1b, the anti-malware function is implemented on the LSI as AMH and operates independently of the core. The hardware information inside the core executing the user application is sent unidirectionally as features to the AMH. As the AMH only copies the internal state of the core running the application, it does not consume any core resources, thus preserving these resources for user applications.

Hardware mechanism and behavior
This section details the structure of the proposed mechanism. Figure 2 is a magnified and more detailed version of Fig. 1b.
The yellow elements represent logical areas, such as user applications, software, and the operating system (OS). The blue and orange elements, respectively, represent areas of the core and AMH, which are implemented on the LSI as independent circuits that can operate in parallel.
We describe the core in a general-purpose computer only as it relates to this study, such as the program counter, register file, and branch prediction mechanism. First, the core fetches an instruction to be executed by referring to the memory address in the program counter. At this time, if the instruction exists in the cache, it is fetched from the cache. When the fetched instruction is decoded, register fetching is performed, and the data necessary for instruction execution is loaded from memory into the register file. As well as instruction fetching, if there is data in the cache, the data is also fetched from the cache. Thus, the instruction is executed for the first time when the value of the register file is complete. The execution of a branch instruction may cause the next instruction to branch, but the branch prediction mechanism intervenes to rewrite the speculative program counter.
The AMH connected to the core consists of three elements: -Bypass and Feature-adding circuits, -Metadata generator, and -Malware detector.
A bypass circuit is wiring implemented on the LSI that outputs the status of various registers, caches, memory cells, etc., inside the core every cycle. The bypass circuit directly connects each element on the core, including the program counter, to the machine learning circuit. The program counter is a register that holds instruction addresses and is updated with the address of the next instruction each time the instruction in question is executed. The feature addition circuit consists of a hit counter composed of transistors, wiring between the cache and the hit counter, and wiring between the hit counter and the metadata generator. The hit counter converts cache hit/miss information into a cache hit rate. Thus, processor information can also be described as streaming data because it outputs the state of the core as it changes from one cycle to the next. However, it is different from the information obtained by static analysis of binaries with software. The processor information obtained through the bypass circuit or feature-adding circuit is sent to the metadata generator as a feature that identifies the software being run. The metadata generator generates metadata necessary for judging software by using a built-in ML model. The metadata used to determine malware is a time series of attack and normal labels assigned to the processor information. It is implemented by a transistor on the LSI and, on each cycle, it labels the processor information as normal or attack. The algorithm is based on Random Forest, which has the advantage of not requiring normalization. It can be trained efficiently on processor information that contains both numerical and categorical values. Thus, the essence of the metadata generator is an ML model, but the classification result per cycle output from the model does not immediately affect the software decision result. Like other elements, this ML model is implemented on LSI in the end, but the whole process of implementation of the ML model is as follows: the model is generated by machine learning software, and converted to circuit information by circuit design tools, and then the circuit information is implemented as hardware circuits on LSI.
The metadata output from the metadata generator can be treated as time-series data representing the characteristics of the software. The malware detection circuit makes a decision based on the percentage of attack labels contained in the metadata. First, the decision circuit receives the metadata from the metadata generator and counts the number of attack labels contained in the metadata. If the percentage of attack labels exceeds the threshold value at the end of the software execution, then the decision circuit judges the software to be malware. This approach of conversion and interpretation into metadata allows more accurate determination than directly considering the processor information.

Summary of the features
In this study, processor information is defined as the streaming data output every cycle from the core running the software. Table 1 lists the processor information used in this study. The processor information is roughly divided into raw data in the first half and count data in the second half. Raw data is transmitted via direct wiring between the core and the metadata generator on the LSI. Among this raw data, the operation code and register number have a unique value and order for each software. In contrast, the instruction address and load/store address change each time they are executed, owing to address space layout randomization (ASLR).
Count data is sent via a feature-adding circuit, such as the hit counter. The hit ratio of the L1 and L2 caches is a fea-ture computed by the hit counter. They are incorporated into the processor information to capture the spatial and temporal locality of the program. The items from hit or miss by BTB to predicted branch direction by Gshare are features related to various branching processes. Items from the distance of NOP to distance of other operation are called instruction distances, and refer to the number of instructions that have elapsed since the last instruction of the same type was executed. Although omitted in Fig. 2, various data related to branching and instruction distance are outputted by a dedicated circuit as well as the hit counter. Thus, processor information bypassed directly from the core or added by feature-adding circuits is inputted into the metadata generator as software features. Figure 3 shows how processor information is outputted from the core each cycle. This processor information is sorted based on the rate of contribution of the features described below. The values in the Cycle column are indices that start at 0 and increase by 1 for each cycle. The values in the Hit_L1-I, Hit_L1-D, and Hit_L2 columns increase or decrease by a number in the range of 0 ≤ n ≤ 1 for each cache hit/miss output. The value of the instruction address column points to the address in memory where the instruction is stored, usually increasing by 4. However, depending on the instructions to be executed, the memory space may change during the execution, and the value of the instruction address may change significantly. The Distance_nop column records the number of cycles that have elapsed since the last NOP instruction was executed, and until the first NOP instruction is executed, this is recorded as -1.

Summary of metadata
The metadata is time-series data that is labeled for each cycle by inputting processor information into the metadata generator. Figure 4 shows the metadata of the software used in this study.
The horizontal line is the normalized number of cycles that have been executed on the core (hereafter referred to as execution cycles). This normalization makes it easier to compare metadata between software with different execution cycles. The vertical line is the number of instructions labeled as an attack by the metadata generator. The metadata is a set of binary (attack or normal) data, and we plot the number of attack labels per 250 cycles for illustrative purposes.
As a measure of metadata, we refer to the percentage of attack labels in a given interval as the score. Half of the specimens in Fig. 4a have continuous attack labels throughout and their metadata is linear at the top. In this case, the metadata score is 1 or close to 1. Some specimens, such as Trojan_1 and Kaiten_2, were found to have partially decreasing attack labels and nonlinear graphs. They had lower scores score than the specimen with a continuous attack label. We investigated the phenomenon of normal labels in malware metadata and found that it is caused by the execution of a shared library. Since library calls cannot be regarded as offensive behavior, it is necessary for the malware detection circuit to be able to handle irregularities in metadata.
In the software metadata shown in Fig. 4b (hereafter referred to as the normal program), the metadata is linear at the bottom. In this case, the metadata score is 0 or close to 0. However, we have identified nonlinear metadata containing attack labels such as the ×264 command, which may be because of the same reason as the above issues with Trojan_1 and Kaiten_2.

Problem of the proposed mechanism
In this chapter, we describe some problems with AMH. The first is the metadata determination method. In conventional methods, applications are determined by comparing the overall metadata score with a threshold value. However, as shown in Fig. 4, there is a risk of missing or misdetecting software that contains many labels with opposite attributes in the conventional method. The score of the fourth interval decreases to approximately 100, and the ×264 command shown in (b) exhibits an increase in the number of attack labels in the middle of the graph. As this trend becomes more pronounced, simply looking at the overall metadata score can lead to missed instances or false positives of malware.
The second is detection speed. Traditional methods compare the threshold with the overall metadata score at the point when the software in question has finished. If malware is executed, this cannot be determined until a certain cycle core executes the process, and even if it is detected, the damage has already spread. Thus, we need a decision process that can detect malware with diverse metadata within fewer cycles.
The third is the availability of AMH. As mentioned in Sect. 4.2, the AMH operates in synchronization with the core and the timing of its operation. When judging software with high core utilization, the input and output per unit of time for bypass and processing circuits increases and, with it, the utilization of metadata generators becomes very high. In addition, there is a proportional relationship between availability and power consumption in hardware components, and naturally the same is true for AMH. We cannot ignore the issue of power consumption if the proposed mechanism is installed in an IoT device. Therefore, we study a method to reduce the utilization rate of the metadata generator by suppressing unnecessary operations while maintaining judgment accuracy.
The fourth is the size of the ML model. In conventional methods, when generating an ML model with a metadata generator, the processor information obtained during the execution of the application to be learned is learned for all cycles. The problem is that the model becomes huge when an application with high core utilization is trained. If the model is implemented as hardware, then the smaller the size of the model, the more desirable it is, as long as labeling accuracy is maintained. Thus, we investigate a method to reduce the total amount of training data to make the model lighter.

Proposal method
To address the problems described in Sect. 4.5, we propose the following approaches: -Improving the efficiency of the detection process with predictive detection, -Reduce hardware uptime through sampling, and -Ensuring lightweight ML models through sampling and dimension reduction.

Efficiency with predictive detection
Predictive detection is a method of speculative malware detection that predicts the propensity of attack labels to In addition, the metadata generated from the malware has some common features, as described below, which allows efficient detection of malware with a relatively high number of normal labels. Figure 5 shows the algorithm for predictive detection. We define W as the cycle width (called the window) to be processed at one time, S w as the percentage of attack labels in W cycles, and S t as the percentage of attack labels needed for detection (i.e., the threshold).
1. Label the processor information output from the bypass and processing circuits from the top W cycles and generate the metadata. 2. The malware detector calculates S w from the metadata generated in step (1). 3. If S w ≥ S t , then detect the running application as malware and exit the process. 4. Otherwise, the next output processor information is labeled in one cycle. 5. Recalculate S w with the metadata of the latest W cycles. 6. Repeat steps (3) to (5).
Malware has a characteristic concentration of attack labels on some or all of the metadata. The predictive detection algorithm calculates the score each time while moving the window one cycle at a time, allowing for malware detection to focus on these types of concentrations.

Reducing utilization rates with sampling circuit
We propose a method to reduce the operating rate of the metadata generator by implementing a sampling circuit and thinning the processor information that is outputted every cycle. The sampling circuit plays the role of a gate that repeatedly opens and closes at arbitrary intervals, and is placed just before the hit counter and the metadata generator, through which the output from the core must pass. Figure 6 shows the operation of the sampling circuit.
The operation of the sampling circuit follows the number of cycles x to keep the gate open and the number of cycles (or intervals) y to keep it closed. For example, if x = 2 and y = 3, then the sampling circuit passes processor information for two cycles and is closed for three cycles. If the utilization rate of the metadata generator before the implementation of the sampling circuit is 1, then the utilization rate R u after the implementation is expressed by Eq. 1: If x = 2, y = 3, then the operating rate of the metadata generator is 0.4 (= 40%). The reduction ratio R r is expressed by: We now describe the power consumption of the metadata generator. For the circuit implemented on the LSI, when the switching probability is α, the capacitance of the circuit is C, the supply voltage of the circuit is V DD , and the operating frequency is f , the energy consumption of the circuit is expressed by the following Eq. [24]: Assuming that C, V DD , and f are time-independent constants, Eq. 3 can be rewritten as: Here, power P switching is defined using frequency f , we obtain: Equation 5 shows that the power consumption of the circuit on the LSI is proportional to the switching probability. The utilization rate R u can be regarded as the switching probability α in Eq. 5. For R u = 0.4, the power consumption of the metadata generator can theoretically be reduced to 40%.

Lightweight metadata generator
We propose two methods to achieve lightweight metadata generators: -Sampling learning, and -Dimension reduction of features.
Sampling learning involves applying the sampling concept to the process of generating ML models, which accounts for most of the metadata generator. In conventional methods, the processor information that has been outputted since the start of software execution is directly inputted to the ML model for training. Sampling learning reduces the amount of processor information inputted to the model through sampling. Thus, the size of the model on the LSI is reduced compared to conventional methods. To distinguish this method from sampling in order to reduce utilization, we call it sampling learning. Dimension reduction of features aims to reduce the weight of the model by reducing unnecessary features based on the contribution rate of the features obtained during model creation. In the conventional method, we use the seventeen types of 21-dimensional features shown in Table 1. Some dimensions or features have little or no influence, and the implementation size tends to grow concerning the performance of the generated model. Therefore, we reduce the model weight by calculating the contribution rate of the features during model generation and disabling unnecessary features. The procedure of calculating the contribution ratio and generating the model is assumed to be done on the emulator, and at the same time, a mechanism for disabling features is required on the wiring, as shown in Fig. 2. This hardware mechanism can be easily implemented by closing only the gates of the relevant features in the sampling circuit implemented in Sect. 5.2.

Evaluation
This section describes the common evaluation environment and the evaluation details for the proposed method.

Anti-malware hardware emulator
Because this study proposes and evaluates hardware that implements anti-malware functions, we used a hardware emulator for evaluation. In the LSI and processor study, when evaluating a new architecture or system, it is important to use a highly accurate and flexible simulator or emulator instead of actually creating the LSI [25,26]. SimpleScalar [27], a type of processor simulator, is used to verify various processor systems and architectures and is widely applied in processor research, such as comparing cache models and evaluating the development of branch prediction mechanisms [28,29]. Taking advantage of its open source nature, some researchers have made their own improvements to SimpleScalar and used it for evaluation [30].

Evaluation environment
To evaluate the proposed method, we created an emulator, which can be divided into the core part and the AMH part, as shown in Fig. 2. For the core part, we reproduced the branch prediction mechanism, program counter, register file, and various caches in C language. The core emulator is based on the open source QEMU (ver. 2.4.1) [31]. For the AMH part, we reproduced the hit counter, metadata generation circuit, and malware detector in Python. The emulator for the metadata generation circuit is Random Forest from the scikitlearn library. In addition to the AMH components shown in Fig. 2, we also created an emulator in Python to reproduce the proposed sampling circuit and predictive detection. We booted the Raspbian OS on the core emulator and executed the malware and normal programs shown in Tables 2 and 3, respectively.
The VirusTotal columns in Table 2 show the information obtained by analyzing the samples with VirusTotal [32]. The # of detections column indicates the number of engines that detected the uploaded file as malware. The Symantec analysis result column is the name registered in the Symantec detection engine, and the specimens were named on the basis of this result. The # of cycles columns are the training and testing execution cycles of the malware. All specimens in the table were captured by our Cowrie [33] honeypot. We selected specimens captured between July 31, 2018 and August 23, 2018 and ran them in an ARM/Linux environment. All the specimens were unpacked ELF files. Table 3 shows the list of normal programs. We selected six commands expected to be used in IoT devices. The option column shows the options specified at the time of command execution, except for those that are mandatory. We emulated up to 100,000 cycles of processor information and disabled the first 5000 cycles, including the loading process to memory, in the evaluation experiment.

Evaluation of predictive detection
To evaluate the efficiency of predictive detection, we performed a preliminary evaluation where we examined a range of parameters that would allow successful predictive detection (i.e., accuracy of 100%). We define accuracy in AMH as the percentage of correct decisions made by the evaluation software. The initial candidates for W and S t defined in Sect. 5.1 are The larger the W , the lower the sensitivity to malignant labels, and the smaller the W , the higher the sensitivity. Combining these parameters, we emulated 40 different predictive detections for 12 different evaluation software. Figure 7 shows W , S t , and the change in accuracy. The horizontal axis is S t , the vertical axis is the accuracy, and the graph is summarized for each value of W . For example, if W = 100 and S t = 0.6, the malware is judged to be malware when a section with an attack label is found for more than 60 cycles out of 100 cycles. The arrows indicate the point where the accuracy reaches a ceiling such that even if S t increases, the accuracy remains flat. Of the 40 combinations, the following 8 combinations achieve 100% accuracy: In the main evaluation, we focused on (W = 100, S t = 1.0), (W = 1000, S t = 0.4), and (W = 1000, S t = 1.0).
In the main evaluation, we compared the number of cycles of the executed malware. It is known that normal programs do not respond to predictive detection within the range of parameters described above; therefore, we omitted the main evaluation of normal programs. Figure 8 shows the number of execution cycles of malware specimens.
For the most sensitive parameters (W = 100, St = 1.0), all malware specimens were successfully detected with a sufficiently small number of cycles. Even for the least sensitive parameter (W = 1000, St = 1.0), the number of cycles ranged from 4% to 19% of the reference cycles (see the # of cycles (Test) column in Table 2). These results indicate that the use of predictive detection to speed up the decisionmaking process can reduce the number of cycles required for

Evaluation of sampling circuit
We used a sampling circuit to evaluate the accuracy and the reduction rate of the utilization rate reduction method. First, we show the method and results of the accuracy evaluation. We set the following candidate parameters for evaluation: We emulated the sampling circuit according to x and y and generated metadata for 12 different software. The number of metadata increases with the number of parameters, so the total number of metadata was 144. Each of the 144 metadata has a unique score, and the percentage of software that can be correctly determined, i.e., the accuracy, depends on the value of Threshold. Figure 9 shows x, y, Threshold, and the change in accuracy.
The bars, grouped by Threshold, represent accuracy. The Threshold from 0.1 to 0.7 shown in (a) achieves 100% accuracy regardless of the values of x and y. This means that all 12 types of evaluation software can be classified correctly. However, in (b), one software fails to be classified when x = 1, y = 5000. Furthermore, regardless of the values of x and y in (c), more than one software fails to be classified. Since the judgments start to fail after Threshold = 0.9, x = 1, y = 5000, we limited the range of parameters for evaluating the reduction rate to x ≥ 10 and y ≤ 1000.
Next, Table 4 shows the metadata scores of malware specimens.
The values in the table are the average, minimum, and maximum scores of the six malware samples. For example,  The reference scores are as follows: Average = 0.986, Min = 0.958, Max = 1.000. Bold indicates a score that has decreased by more than 0.01 from the reference score The reference scores are as follows: Average = 0.014, Min = 0.000, Max = 0.080. Bold indicates a score that has increased by more than 0.01 from the reference score for x = 10, y = 500, the average, minimum, and maximum scores are 0.983, 0.955, and 1.000, respectively. The higher the score, the larger the percentage of attack labels in the metadata. The minimum score at x = 10, y = 1000 is relatively reduced from the reference but is still sufficient to correctly identify the malware. Finally, Table 5 shows the metadata scores for the normal program.
For a normal program, a lower score indicates a smaller percentage of attack labels in the metadata. The average or maximum score increases from the reference for four of the six parameters, but the increase is small enough to maintain the overall score.
In the range of parameters to be evaluated, x = 10, y = 1000 is the most sparse sampling and the combination that maximizes the reduction rate. According to Eq. 2, we find that the implementation of the sampling circuit can reduce the operating rate of the metadata generator by approximately 99% in the evaluation environment.

Evaluation of lightweight method
In this chapter, we present the evaluation of a method to reduce the model size using sampling learning and feature dimensionality reduction. First, we observed the changes in the parameters and scores of sampling learning and determined the parameters that do not interfere with the judgment. Next, we calculated the contribution rate of the model creation, observed the number of features and the metadata scores output by the model, and selected suitable features for the weight reduction evaluation. Finally, we compared and evaluated models created using the conventional method and the lightweight method.

Preliminary evaluation of sampling learning
We evaluated the accuracy and size reduction rate of machine learning models using sampling learning. We sampled the training data with the same candidate values as Sect. 6.4 for x and y. Metadata was generated using 12 models trained on the sampled data, and the accuracy of AMH was evaluated using the same procedure as in Sect. 6.4. Figure 10 shows x, y, Threshold, and the change in accuracy.
The bars, grouped by Threshold, indicate accuracy. Compared to the sampling circuit, the overall accuracy tends to decrease, and only Threshold = 0.1 in (a) achieves 100% accuracy regardless of the values of x and y. In the graphs after (b), the number of parameters that fail to make a decision gradually increases. In (f), which shows Threshold = 1.0, more than one software fails to be determined, regardless of x and y. Since the judgments start to fail after Threshold = 0.3, x = 1, y = 5000, we limited the range of parameters to evaluate the reduction rate to x ≥ 10 and y ≤ 1000.  The reference scores are as follows: Average = 0.986, Min = 0.958, Max = 1.000. Bold indicates a score that has decreased by more than 0.01 from the reference score The scores for each parameter are also described in this section. Table 6 shows the metadata scores of malware specimens.
The values in the table are the average, minimum, and maximum scores of the six malware samples. For example, for x = 100, y = 500, the mean, minimum, and maximum scores are 0.969, 0.873, and 1.000, respectively. Bold type indicates scores that decreased by more than 0.01 from the reference. Compared to Table 4, which summarizes the scores of sampling circuits, the scores tend to decrease in several combinations. Table 7 shows the metadata scores for the normal program. While the score of the malware tended to decrease, the score of the normal program did not increase in a way that affected the accuracy. As shown by the mean and maximum, the sampling learning resulted in lower scores overall for both malware and normal programs. In the range of param- The reference scores are as follows: Average = 0.014, Min = 0.000, Max = 0.080. Bold indicates a score that has increased by more than 0.01 from the reference score eters to be evaluated, x = 10, y = 1000 is the most sparse sampling and is the combination that maximizes the reduction rate. In this case, the training data is reduced by about 99%. However, for the final model size, it is necessary to check the file output from the actual training process. We compare and evaluate the actual size of the output models with dimensionality reduction in Sect. 6.5.3.

Preliminary evaluation of dimensional reduction
The emulator of the metadata generator is based on Random Forest, which is equipped with a function to output the contribution rates of features. In conventional methods, contribution rates are output at the stage of creating the metadata generator, and unnecessary features are examined. To evaluate the dimensionality reduction, we trained the model with features according to the following criteria: -Features with a contribution rate of more than 1% -Features with a contribution rate of more than 10% The metadata of malware and normal programs were output by these two models, and the scores were compared with the reference model. Based on the comparison results, we selected a dimensionality reduction model suitable for evaluation, and evaluated the weight reduction method with three models: the reference model, the sampling learning model, and the model with dimensionality reduction. Table 8 shows the processor information output at the time of model creation in the order of contribution rate.
The ruled lines in the table indicate the boundaries between the contribution rates of 10% and 1%. We created clf_Ref, in which all the features were trained, clf_1, in which features with a contribution rate of 1% or higher were trained, and clf_10, in which only features with a contribution rate of 10% or higher were trained.

Model size evaluation
We compared and evaluated the sizes of the models created by the conventional method and the models with sampling learning and dimensionality reduction. The evaluation condition for sampling learning was x = 10, y = 1000. For dimensionality reduction, we selected models that learned only features with a contribution rate of 1% or higher. Table 9 shows the results of the model size comparison. When the size of the reference model was 100, the model size could be reduced to 16 by applying sampling learning. By applying dimensionality reduction, the model size was reduced to 19.

Hardware performance
The contribution of the proposed method is the reduction of power and hardware resource consumption. In Sect. 6.4, we evaluated a sampling circuit that improves the utilization rate and showed its effectiveness. In addition, in Sect. 6.5.3, we implemented two approaches that reduce the weight of ML models. In this chapter, in order to evaluate these hardware contributions quantitatively, we designed the sampling circuit and ML model in Verilog language, and performed  It should be noted that the results assume the case of the target FPGA device, the actual power consumption is decided by which device is implemented on: manufacturing process, architecture, and others logic synthesis and place-and-route on a targeting Field Programmable Gate Array (FPGA). Table 10 shows the environment for measuring power and hardware resources consumption. We used Vivado, an integrated development tool from Xilinx, for logic synthesis and implementation. We also specified Xilinx's XC7Z020-2CLG484I as the target FPGA in Vivado. Table 11 shows the measurement results of the hardware resource consumption.

Evaluation of hardware resource consumption
A configurable logic block (CLB) is the multiple basic logic resources in an FPGA, and the number of look-up tables (LUTs) contained in the CLB is an indicator of hardware resource consumption. In the ML model, the number of LUTs for the Reference is 74498, while the number of LUTs for the model with dimensionality reduction is 6505, resulting in a significant reduction. From another point of view, the number of LUTs is reduced to 8.7% compared to the Reference, and the weight reduction effect is higher compared to the emulation results shown in Table 9. On the other hand, the number of LUTs for the sampling circuit is much lower than that of the ML model, indicating that it can be implemented with a small scale circuit.

Evaluation of power consumption
At first, we evaluated the power consumption of ML model and the sampling circuit when each circuit performs one operation in the case of the target FPGA device. The results show that reference ML model, the one with sampling training, the one with dimensional reduction, and the sampling circuit are 18.95 W, 14.16 W, 14.55 W, and 0.19 W, respectively.
When the sampling training and the dimensional reduction are introduced, the power consumption is reduced by 23.2% and 25.3%, respectively. However, Table 12 does not indicate the utilization rate reduction effect of our sampling circuit, since the ML model for Reference with sampling circuit only operates when it is activated by the sampling circuit which always operates.
Therefore, at second, we evaluate the total hardware power consumption including the one for the ML model and the one for the sampling circuit. The variables α and C V 2 D D f shown in Eq. 5 correspond to the utilization rate R u (Ref. Eq. 1) and power consumption of ML model P M L (ref. Table 12), respectively. Let P SC be the power consumed by the sampling circuit. When each variable is W = 18.95, x = 10, y = 1000, P SC = 0.19, respectively, the total power consumption P total is calculated as follows: Table 13 summarizes the power consumption reduction effect of the sampling method.
The values in the table are, from left to right, the enable or disable of the sampling circuit, the parameters when the sampling circuit is enabled, and the total power consumption of the sampling circuit and the ML model. The results show that the total power consumption can be reduced significantly by the sampling circuit. Equation 6 shows that the power consumption is proportional to R u when the sampling circuit is enabled. Therefore, we can optimize the power reduction rate by the parameters x and y with maintaining 100 % accuracy.

Discussion
In this chapter, we will discuss the evaluation results shown in Sect. 6.

Predictive detection timing
In the evaluation of predictive detection using malware specimens, a difference of more than 10% was observed in the detection timing, depending on the parameters. Thus, using Trojan_1 as an example, we analyzed the features that appear in the metadata and the timing of detection. Figure 11 shows the metadata and two patterns of detection timing for Tro-jan_1.
The figure shows Trojan_1 extracted from Fig. 4a with the first half enlarged. The red vertical dashed line on the left is It should be noted that the results assume the case of the target FPGA device, the actual power consumption is decided by which device is implemented on: manufacturing process, architecture, and others  the detection timing under the condition of high sensitivity to attack labels with parameters W = 100 and S t = 0.4. For the highly sensitive parameters, the attack labels are detected at the timing immediately after execution, when the percentage of attack labels increases instantaneously. The second dashed line shows the detection timing for a parameter with low sensitivity to attack labels (W = 1000, St = 1.0). In this case, the system detects the attack label when it appears for 1000 consecutive cycles, which corresponds to the concentration of the attack label being in the interval 0.10-0.15 on the horizontal axis. Noting the hypothesis that there are always intervals in malware metadata with a concentration of attack labels, there is a tradeoff between the sensitivity of predictive detection and the detection speed. However, we believe that the number of specimens handled in this study is insufficient to cover all the features of malware, and so it is necessary to continue collecting metadata. Figure 7 from the preliminary evaluation of predictive detection shows that many parameters have an accuracy of 91.7% (11/12). The exception is the ×264 command of the normal program. Figure 12 shows an example of ×264 being incorrectly identified as malware by predictive detection.
The dashed line indicates the timing of the detection. In the case of W = 100, S t = 0.9, the ×264 command is determined to be malware by predictive detection at the 54811th cycle. Even if the metadata as a whole contains less than half of the attack labels, when the number of attack labels suddenly increases, the parameters that have not been sufficiently studied will induce false positives in normal programs. Figure 12 is a typical example.
The dashed line indicates the timing of the detection. In the case of W = 100, S t = 0.9, the ×264 command is determined to be malware by predictive detection at the 54811th cycle. Even if the metadata as a whole contains less than half of the attack labels, when the number of attack labels suddenly increases, the parameters that have not been sufficiently studied induce false positives in normal programs.

Sampling circuit
In the evaluation results shown in Sect. 6.4, the operation rate of the metadata generator decreased while maintaining accuracy by appropriately setting the parameters of the sampling circuit. Therefore, we observed the change in metadata when the sampling circuit was implemented. Figure 13 shows the metadata output via the sampling circuit with x = 100, y = 1000.
Compared to the metadata without the sampling circuit shown in Fig. 4, the implementation of the sampling circuit has smoothed the metadata instead of reducing the number of labels. In particular, the Trojan_1 and ×264 commands originally had irregular label appearances, but these irregularities are absorbed. We initially implemented the sampling circuit to reduce the utilization rate. However, with the confirmed effect of smoothing the metadata, the sampling circuit can also be expected to prevent false positives of normal programs in predictive detection, such as that shown in Fig. 12.

Lightweight
In Sect. 6.5.1, we performed sampling training and evaluated the scores output from the model. From Table 6, when x = 10, y = 500, the minimum score is reduced by more than 0.2 from the reference. As an example of a malware training failure, Fig. 14 shows the metadata output from a model trained to sample at x = 10, y = 500.
Among the generated metadata, the metadata of the malware is even more irregular, particularly for Kaiten_1. In the conventional method, Kaiten_1 is relatively smooth, but in the example of Fig. 14, it becomes irregular from around 0.2 on the horizontal axis to 0.8. By contrast, the ×264 command, which was the only normal program with irregular metadata, was smoothed by sampling learning.
As mentioned above, the attack labels decrease for both malware and normal programs. As indicated by the parameter x = 10, y = 500, the interval is large relative to the number of cycles per extraction. Therefore, it is highly likely that the training data (processor information) necessary to output the attack labeling during the sampling process is included in the interval y. In addition, considering the bias in the amount of training data between malware and normal programs (see Tables 2 and 3, respectively), there is a possibility that this induces labeling that is more normal. As the same tendency is observed in other conditions, we determine that in our evaluation environment, using a model with low labeling accuracy makes it difficult for attack labels to appear overall.

Conclusion
In this study, we proposed circuits and learning methods to achieve high efficiency, low power consumption, and light weight for malware detection hardware that uses processor information as a feature value. To investigate these three methods, we focused on the time-series metadata that can be obtained from the processor information. For efficiency, we implemented predictive detection in the malware detection part of the proposed mechanism, and we performed behavioral detection to predict metadata characteristics. Using predictive detection, we were able to detect malware in 19% of the execution cycles a conventional method requires. We implemented a sampling circuit to achieve low power consumption and reduce the processor information output from the core. We were able to reduce the operating rate of the metadata generator by approximately 99%. To achieve weight reduction, we focused on the learning process of the metadata generator based on machine learning models. By separately verifying the sampling learning and feature dimensionality reduction in the learning process, we succeeded in creating a metadata generator that was 16% of the conventional method size, while maintaining labeling performance.
Comparing with the previous works using HPC, the HPC-MalHunter [13] achieves the detection rate of 90.69 % and an false positive rate of 0.79 % with making decision after running 400,000 instructions from the beginning; on the other hand, our AMH accomplishes the 100 % detection rate and the 0% false positive rate with the high detection speed requiring only less than instructions because of utilizing time-series metadata of ML model.
The predictive detection proposed in this study calculates the percentage of attack labels every time the processor information is output for one cycle. Therefore, there is a concern that the overhead will be large even if it is implemented as hardware. As a countermeasure, we envisage a method that combines predictive detection and sampling circuits. However, since predictive detection assumes that the processor information is not sampled, future work should reexamine the predictive detection algorithm based on the sampling circuit.
In addition, there is an issue of the ML model and signature updates. In this paper, the proposed ML model, that is implemented as hardware, does not have an update mechanism. Therefore, it is difficult to handle new malware strains that behave similar to normal programs. Future work should introduce a configurable mechanism to update the ML model repeatedly.