Identifying “sloppy” users in TMS through operation logs

Zhang, Shaoyang; Wen, Lian; Torrisi, Geraldine; Li, Jicheng

doi:10.1007/s41870-023-01489-z

Identifying “sloppy” users in TMS through operation logs

Original Research
Open access
Published: 28 September 2023

Volume 16, pages 1319–1331, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Information Technology Aims and scope Submit manuscript

Identifying “sloppy” users in TMS through operation logs

Download PDF

Shaoyang Zhang¹,
Lian Wen ORCID: orcid.org/0000-0002-2840-6884²,
Geraldine Torrisi² &
…
Jicheng Li³

575 Accesses
Explore all metrics

Abstract

A transportation management system (TMS) is an integral software system for modern logistics and transportation companies. It is crucial to evaluate the quality of a TMS objectively, a task that currently presents significant challenges to both the IT and logistics sectors. One approach to this evaluation is usage analysis. However, usage analysis is complicated by the presence of both diligent users who utilize the system correctly and 'sloppy' users who enter inaccurate data haphazardly. This inaccuracy hampers the success of the information system and obstructs effective decision-making. Thus, identifying and excluding data from such users is essential for an accurate and objective evaluation of a TMS. Yet, the focus has primarily been on identifying outliers, typically for security reasons, while the identification of 'sloppy' users has been overlooked. Against this context we propose a novel method—Log Evaluation through Operation Sequence Distribution (LEOSD). This method distinguishes between abnormal and normal usage of a TMS by analysing system-generated logs. LEOSD is highly efficient and lightweight, minimizing any disruption to ongoing operations. Our experiment, based on real logs gathered from the industry, supports our hypothesis, and shows that LEOSD is effective in identifying 'sloppy' users. The positive results attest to the efficacy and practicality of our proposed method.

The use of Big Data Analytics in healthcare

Article Open access 06 January 2022

Analyzing Healthcare Processes with Incremental Process Discovery: Practical Insights from a Real-World Application

Article Open access 24 June 2024

Big Data Analytics: A Literature Review Paper

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Transportation Management Systems (TMS) represent a class of software systems employed in supply chain logistics management [1]. They are designed to facilitate logistics enterprise management by optimizing transportation activities, thus increasing efficiency, enhancing customer service, minimizing costs, improving visibility, managing scheduling fees, enabling traceability, and managing deliveries [2].

Logistics play a critical role in any national economy. Many logistics enterprises utilize various TMS platforms developed by different providers. Furthermore, logistics enterprises are numerous and, the logistics sector is rapidly evolving. Consequently, TMS platforms have a substantial market in all countries. Figure 1 illustrates the growth of the TMS market in China [3].

Transport management involves numerous intricate processes. A TMS is used by many different people in different roles and must have a comprehensive range of functionalities. The design and development of a user-friendly TMS that can accommodate a diverse user base is a challenging task. Traditionally, TMS development projects may start by collecting user requirements before proceeding with other development activities. However, software cost estimation is a significant challenge in this process, especially due to the diverse user base and rapidly changing demands [4]. An iterative development approach, where the user requirements can evolve over time, is more practical than a single development cycle.

In an iterative development approach, it is crucial to evaluate the current TMS version and prioritize requirements [5], guiding developers on what to add in the next version. TMS manufacturers can leverage several evaluation methods to understand their product’s 'real-time' usage, assess its user-friendliness, and guide its iterative improvement.

Methods available for TMS evaluation, include user feedback collection, functional probe methods, data flow methods, and log methods.

User feedback is often manually collected by manufacturers, but this approach may be biased, inefficient, and time-consuming due to the high cost of customer time. Therefore, automated methods have been proposed to overcome these challenges, such as functional probe, data flow, and log methods.

The functional probe method involves deploying probes within the system that record each function's entry and usage, uploading usage traces to a server for real-time evaluation [6]. Despite its advantages, this method requires user consent for data security and runs the risk of interrupting the user’s operations.

The data flow method focuses on the product cycle, and exchange of data, irrespective of function usage [7]. However, this method often faces resistance from users unwilling to share sensitive data with third parties for software evaluation.

The log method uses system logs, crucial for operation backtracking and security management, to evaluate the TMS. It records user traces, not user data, making it safer, cheaper, and less disruptive than other methods [8]. accurate and valid records from log files is challenging, especially because of the presence of 'sloppy' users. ‘Sloppy’ users operate the system carelessly for various reasons, resulting in abnormal log records. The careless use of the system by ‘Sloppy’ users tarnishes the accuracy and validity of information contained in log files, interferes with proper evaluation, and subsequently impedes efforts for system improvement. To address this issue, it is important to effectively identify (and then eliminate) which of the log records were created by sloppy users. To this end, we propose a novel method named Log Evaluation through Operation Sequence Distribution (LEOSD). The LEOSD method analyses operation sequences in logs to identify sloppy users, and remove their usage records to ensure accurate system evaluation.

Our method rests on the following assumption and hypotheses:

Assumption: TMS used in the industry may have sloppy users, but their numbers are usually smaller than genuine users.
Hypothesis A: Genuine users typically operate the TMS through meaningful patterns, generating log files of operations with a specific distribution.
Hypothesis B: As sloppy users operate the system carelessly, their operation sequence distribution will significantly differ from that of genuine users.

If these assumptions and hypotheses hold, then we can identify sloppy users based on the difference in their operation sequence distributions.

We conducted an experiment using a real industry log file containing 632,575 operation records, which affirmed our assumptions and hypotheses.

The remainder of this paper is organized as follows: Section 2 provides contextual information to understand the research motivation; Section 3 reviews relevant literature; Section 4 introduces the LEOSD algorithm; Section 5 discusses the experiment procedure and results; Section 6 presents issues and discussions; and finally, Sect. 7 provides a conclusion and future research directions.

2 Context

In Chinese transportation organizations, the use of TMS is typically mandatory. However, compulsory system use does not guarantee effective and proper utilization. Some users may misuse the system due to unintentional errors, such as entering incorrect values or mistakenly selecting the wrong fields. These errors may reveal issues with the system's user-friendliness. Conversely, others may use the system ineffectively due to resistance behaviors, such as inputting arbitrary incorrect values or indiscriminately clicking fields.

User resistance to information systems is a well-documented, persistent phenomenon identified as a primary cause of information system failure [9]. User resistance refers to the behaviours of those users expressing dissatisfaction with the system [10]. The causes of user resistance are multifaceted, with various theories proposed to explain it, such as status quo bias (SQB), where resistance is viewed through the lenses of rational decision-making, cognitive misperceptions, and psychological commitment [11]. Resistance can manifest as covert behaviors like apathy and sabotage or as more overt destructive behaviors [9]. Although an in-depth discussion of reasons for user resistance is beyond the scope of this paper, acknowledging its existence aids in understanding the problem that motivated our log evaluation technique.

Research [12,13,14] has shown a spectrum of resistance behaviors, ranging from overt hostility marked by aggressive or destructive behavior to covert actions characterized by apathy and passive resistance [15]. The latter is of particular interest in this paper. Passive resistance, often stemming from fear or stress about the perceived technological 'intrusion' into one's stable world, may manifest through the input of incorrect data [14]. These users give the appearance of system engagement while, they resist it covertly [12], using the system carelessly and inputting inaccurate or random data. These passive resistant users, henceforth referred to as 'sloppy users,' can significantly impact information quality and, consequently, system success [16].

Inaccurate data on a large scale can hinder organizational decision-making. The effect of sloppy users is especially pronounced in industries with large workforces, like China's transport sector. As information systems increasingly integrate with core industrial operations, system use becomes mandatory. While most studies of user resistance have occurred in voluntary contexts [17], passive user resistance in mandatory settings is an emerging and crucial problem. The implications of passive resistance concerning information quality and trust are significant and must be acknowledged.

Given the central role of TMS in effective goods transportation, the impact of sloppy users is self-evident. With the widespread use of TMS in the transport industry, increased economic pressures, the need for rapid goods movement, and large workforces, identifying system data from sloppy users is imperative, particularly in contexts like China, where system use is mandatory. Additionally, the issue of trust is tied to the reliability and credibility of data. Data reliability and credibility are significant factors influencing system effectiveness [18] and therefore must be considered in log analysis methods. To date, the development of log analysis techniques has largely overlooked the identification of sloppy user interactions. The focus in research has been on identifying system use extent, unintentional system errors, and system use patterns from a security perspective. However, as mandatory system use rises and larger workforces require information systems, recognizing the importance of log analysis aimed at identifying erroneous or inaccurate data arising from passively resistant users' activities is critical.

3 Literature background

Transport serves as the backbone of any economy, facilitating the exchange of goods and services. Transport Management Systems (TMS) harness IT to design, plan, and implement transport systems to efficiently and effectively meet the objectives of various processes involved in transportation. However, the success of a TMS or any information system depends on users interacting with the system as intended.

The Delone and McLean model of information system success provides a widely recognized framework for understanding system quality. The model identifies system use as a significant indicator of the success of any information system [16]. System use is related to other crucial factors of system quality, including system quality, service quality, information quality, and user satisfaction, all interacting to manifest as system benefits (e.g., cost savings, efficiencies, time savings, etc.). System use refers to the extent of system use, the way the system is used, and includes metrics such as usage patterns, number of transactions executed, and the number of interactions with the system [16].

System logs serve as a valuable resource to gain insights into system use. These logs capture the events occurring within the system when in use, storing information such as when, who, and how the system is used [19]. System logs record user interactions over time without interrupting the user, providing valuable information about system use. Evaluating these logs by seeking usage patterns can provide a foundation for assessing the system's user-friendliness [20]. However, the analysis of user logs has limitations, such as the lack of information about user goals and objectives [20]. Still, identifying patterns and nature of use can inform iterative system improvements.

While patterns in logs are insightful, outliers in the data can also prove invaluable. Outliers deviate from typical values or patterns [21], allowing for the identification of user behavior that falls outside 'usual' or 'expected' patterns of use. In the quest for understanding user behavior for iterative system improvement, these outliers can be just as (or even more) valuable as the 'normal' patterns found in system logs.

Once outliers are identified, subsequent investigations with users may uncover insights about unintentional careless clicks, erroneous system use, or even intentional careless usage. Identifying outliers in system logs can also help detect intrusions, fraudulent or malicious system use, and inform cybersecurity measures [19, 22]. In settings such as healthcare, outlier detection is crucial to identify anomalies like abnormal heart rhythms or medication errors, for instance [23]. Thus, the identification of anomalies or outliers in system logs is a vital and worthwhile task.

Traditional outlier detection in logs could be done manually, but this approach is time-consuming and prone to human error. Moreover, with the surge in technology use in organizations, the volume of logs generated is immense, making manual methods impractical [24]. Hence, significant efforts are being made to develop techniques for automatically identifying outliers in log data sets. Most of these efforts emphasize intrusion detection, with comparatively less focus on management system logs [19].

Various methods are used for outlier detection in data mining, including distribution-based methods (using statistical testing to identify data deviating from probability distributions), depth-based methods (considering points in k-dimensional space, with shallower points identified as outliers), distance-based methods (often using Euclidean measures), density-based methods (providing a likelihood that the object is an outlier), RNN-based methods, regression analysis, and subspace methods [19].

The most popular framework for abnormal detection based on log file analysis is summarized in Fig. 2, reproduced from Zeufack's paper [25].

The proposed framework for anomaly detection in TMS log data involves four stages: Log collection, log parsing, feature extraction, and anomaly detection. This approach is in line with the general framework for log-based anomaly detection. However, it introduces a crucial difference in the log parsing phase compared to traditional techniques.

Conventionally, log parsing relies on both syntax notation and semantic understanding of the log file. This approach necessitates analysts to comprehend the meaning of the log file and interpret the user's intentions for each operation. Such in-depth understanding is vital to identify malicious activities such as hacking attacks.

However, the method proposed in this research simplifies the log parsing stage. The authors are suggesting a lightweight approach that can easily analyze log files from various TMS without needing a semantic understanding of those log files. Therefore, our log parsing technique is straightforward and swift, as our parser operates only on the syntax level. It aims to extract essential information such as operation id, user id, and time sequence by splitting operations.

This feature of our approach allows for high efficiency in processing generic log files without the need for detailed knowledge about the specific TMS. Consequently, it streamlines the process of identifying anomalous behavior, particularly in the context of passive user resistance and 'sloppy users.' It offers a novel way to ensure the quality and reliability of data, crucial to system effectiveness and organizational decision-making.

4 Log evaluation through operation sequence distribution (LEOSD)

The LEOSD (Log Evaluation based on Operation Sequence Distribution) algorithm proposed in this section aims to identify "sloppy users", log files generated through careless interactions in a TMS.

4.1 Informal description of the algorithm

In this subsection, we will use informal language to give a simple description of this algorithm, therefore, readers, even without strong mathematics background, can quickly grasp its essence. We then will give a precise mathematical description in the following subsections.

The algorithm contains two phases. The first phase is a learning phase, and the second phase is an evaluation phase.

Learning Phase: A set of log files are collected to form the training samples. This algorithm assumes that most log file samples were generated through normal operations, while a small number of samples are abnormal, meaning they were created through careless mouse clicks or passive resistance. The learning phase calculates an average distribution of the operation sequences for the whole set of training samples, and then compares the distance between the average distribution to the distribution of every individual training sample. We then remove those with their distribution distance significantly larger than the average distance, to form a more reliable training set. This process repeats until no single log file in the training set has its distribution too far from the average distribution. We now have a benchmark set of log files, from which we can calculate the average distribution and the average distance between this distribution and the individual samples. This gives us the benchmark distribution and standard distance.
Evaluation Phase: In the evaluation phase, the aim is to evaluate whether a given log file was generated through normal operation or not. We first calculate the distance between the operation sequence distribution of this log file and the benchmark distribution. If the distance is much larger than the standard distance, we consider that the log file was likely generated through careless clicking, hence, it can be identified as an outlier or an abnormal interaction.

The core premise of this algorithm is the assumption that operation sequences resulting from normal interactions will have a different distribution pattern than those generated through careless or resistant user behaviour. In the next subsections, we will present a more detailed and mathematical description of the LEOSD algorithm.

4.2 Definitions and basic propositions

Definition 1

(Operation and log) Let Χ be a TMS system; this system has $k$ types of operations ${o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}$ that will be captured by a log. They are simply called operations, and the set of all operations is denoted as $O$. When a user uses Χ, the system will generate a log, which is a sequence of operations. A log is denoted by $D=\left\{\left({o}_{1},{o}_{2},\dots ,{o}_{n}\right)|{ o}_{i}\in O,1\le i\le n\right\}$, while $n=\left|D\right|$ is the number of operations in the log and it is called the length of log.

Definition 2

(Operation chains) Let X be a TMS system, with operation set $O$. $\forall t>0, l={(o}_{1},{o}_{2},\dots ,{o}_{t})$ is a sequence of operations with $\forall 1\le i\le t,{o}_{i}\in O$, then $l$ is called a tth order operation chain of X. The set of all $t\mathrm{th}$ order operation chains of X is denoted as ${O}^{t}$. The set of all operation chains is denoted as $\mathcal{O}=\bigcup_{t=1}^{\infty }{O}^{t}$.

Please be aware that an operation chain may contain repeating operations. For example $\left({o}^{1},{o}^{1}\right)$ is also a second order operation chain of X.

Proposition 1

Let X be a TMS software, $O$ is its operation set with $k$ different types of operations. Then for any positive integer $t$, there are ${k}^{t}$ different types of operation chains, means: ${\left|{O}^{t}\right|=k}^{t}$.

Proof Let $l={(o}_{1},{o}_{2},\dots ,{o}_{t})$ be a tth order operation chain of X, for any $1\le i\le t$, because ${o}_{i}\in O$, ${o}_{i}$ has $k$ different types of choices. Also, because an operation chain can contain repeating operations, and each position of the operation is independent to each other, therefore, the total number of combinations is ${k}^{t}$.

According to Proposition 1, we know that X has $k$ different first order operation chains, they are denoted as: $\left({o}^{1}\right),\left({o}^{2}\right),\dots ,\left({o}^{k}\right)$ respectively. Similarly, it has ${k}^{2}$ second order operation chains. Please note that what we discuss here are possible operation chains based on their forms, it is different from possible operation chains in log files generated through valid operations. Because of the property of a TMS, some operation chains are impossible in real world. For example, it is impossible to delete an order record in a database immediately after the databased is initialized, because the initialization operation will remove all order records in the database and leave no order records to be deleted.

Proposition 2

Let X be a TMS, $D=\left({o}_{1},{o}_{2},{o}_{3},\dots ,{o}_{n}\right)$ be a log of X. Then for any positive integer $t\le n$, $D$ contains $\left(n-t+1\right)$ operation chains of tth order.

Proof This proposition can be proved through construction. For any positive integer $t\le n$, $\left({o}_{1},{o}_{2},\dots ,{o}_{t}\right),\left({o}_{2},{o}_{3},\dots ,{o}_{t+1}\right),\dots ,\left({o}_{n-t+1},{o}_{n-t+2},\dots ,{o}_{n}\right)$ are the $\left(n-t+1\right)$ operation chains with order of $t$ in $D$.

According to Proposition 2, for a log with the length of $n$, there are $n$ first order operation chains and $\left(n-1\right)$ second order operation chains in it.

Definition 3

(Operation-chain distribution) Let X be a TMS with its operation set $O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}$ and $\mathcal{O}$ as its operation chain set. $D=\left({o}_{1},{o}_{2},{o}_{3},\dots ,{o}_{n}\right)$ is a log of X. Define operation-chain distribution (simply called distribution if without confusion) of $D$ as a function from operation chain set to real numbers, denoted: ${\xi }_{D}:\mathcal{O}\to {\mathbb{R}}$. The value of the function is given in Eq. (1):

$${\xi }_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)=\left\{\begin{array}{c}\frac{{\mathrm{\rm N}}_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)}{n-t+1}, \quad t\le n\\ 0, \qquad \qquad\qquad t>n\end{array}\right.,$$

(1)

while ${\mathrm{\rm N}}_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)$ is the number of occurrences of operation chain $\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\in {O}^{t}$ in $D$. ${\mathrm{\rm N}}_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)$ is also called the count of $\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)$ in $D$.

Example: Let X be a TMS with its operation set $O=\left\{{o}^{1},{o}^{2},{o}^{3}\right\}$. $D=\left({o}^{1},{o}^{2},{o}^{3},{o}^{2},{o}^{3}\right)$ is a log of X. According to Definition 3, the counts of all first order operation chains in $D$ are: ${\mathrm{\rm N}}_{D}\left({o}^{1}\right)=1$, ${\mathrm{\rm N}}_{D}\left({o}^{2}\right)={\mathrm{\rm N}}_{D}\left({o}^{3}\right)=2$. For second order operation chains, ${\mathrm{\rm N}}_{D}\left({o}^{1},{o}^{2}\right)=1$, ${\mathrm{\rm N}}_{D}\left({o}^{2},{o}^{3}\right)=2$, ${\mathrm{\rm N}}_{D}\left({o}^{3},{o}^{2}\right)=1$, and the counts of other second order operation chains are 0. Then, based on Eq. (1) we calculate the distribution as: ${\xi }_{D}\left({o}^{1}\right)=1/5$, ${\xi }_{D}\left({o}^{2}\right)={\xi }_{D}\left({o}^{3}\right)=1/5$, ${\xi }_{D}\left({o}^{1},{o}^{2}\right)=1/4$, ${\xi }_{D}\left({o}^{2},{o}^{3}\right)=1/2$, ${\xi }_{D}\left({o}^{3},{o}^{2}\right)=1/4$, and the distribution of other second order operation chains are 0. Similarly, it is not difficult to calculate the distribution of other order operation chains; we don’t represent them to save paper space.

Definition 4

(Distribution distance between two logs) Let X be a TMS with operation set $O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}$. ${D}_{1}$ and ${D}_{2}$ are two logs, the two logs may have different length. Define $\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)$ as an order weight vector with $\forall t>0, {w}_{t}\ge 0$, then the distribution distance with $\mathcal{W}$ between ${D}_{1}$ and ${D}_{2}$ is defined in Eq. (2):

$${\mathfrak{D}}_{\mathcal{W}}\left({D}_{1},{D}_{2}\right)=\sum_{\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\in {O}^{t},t=1}^{\infty }{w}_{t}{\left({\xi }_{{D}_{1}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)-{\xi }_{{D}_{2}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\right)}^{2}$$

(2)

Order weight vectors play a crucial role in defining the impact of different orders of operation chains in the system. In essence, they reflect the operational requirements and usage patterns inherent to the specific software system in question.

For instance, consider a TMS that has numerous functions necessitating a fixed sequence of operations to execute. In this scenario, a standard operation would naturally create log entries containing many such operation chains, which would be reflected in the high order distribution. To capture this importance of sequence, high order components in the weight vector would carry non-zero values. This setup ensures that the model effectively recognizes and accounts for operation chains that play a crucial role in the functionality of the system.

On the other hand, let's imagine a different system where most functions can be executed with only one or two operations. In this case, the high order operation chains would not be significant. Therefore, in the weight vector, all weight components associated with order larger than 2 could be set to 0. This configuration ensures that the model prioritizes those operation chains that are most relevant to the system's use and functionality.

In essence, the selection of a suitable weight vector is dictated by the specifics of the software system in question and may require adjustment based on experience and iterative analysis. This adjustable nature allows the model to accommodate the diverse and complex usage patterns that can occur across different systems, enhancing the accuracy and applicability of the LEOSD algorithm.

Proposition 3

Let X be a TMS and $D$ is a log of X; $\mathcal{W}$ is an order weight vector, then ${\mathfrak{D}}_{\mathcal{W}}\left(D,D\right)=0.$

Proof This proposition can be easily proved through the definition, details omitted.

Proposition 4

Let X be a TMS software and ${D}_{1},{D}_{2}$ are two logs of X; $\mathcal{W}$ is an order weight vector, then ${\mathfrak{D}}_{\mathcal{W}}\left({D}_{1},{D}_{2}\right)={\mathfrak{D}}_{\mathcal{W}}\left({D}_{2},{D}_{1}\right)\ge 0$.

Proof This proposition can be easily proved through the definition, details omitted.

Definition 5

(Average distribution of operation chains) Let X be a TMS with operation set $O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}$. ${\mathbb{D}}=\left({D}_{1},{D}_{2},\dots ,{D}_{m}\right)$ are $m$ logs of X. $\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)$ is an order weight vector. Define average distribution of ${\mathbb{D}}$ in Eq. (3)

$${\xi }_{\mathbb{D}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)=\frac{1}{m}\sum_{j=1}^{m}{\xi }_{{D}_{j}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)$$

(3)

Similarly define distribution distance between a log $D$ to the log set ${\mathbb{D}}$ as in Eq. (4):

$${\mathfrak{D}}_{\mathcal{W}}\left(D,{\mathbb{D}}\right)=\sum_{\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\in {O}^{t},t=1}^{\infty }{w}_{t}{\left({\xi }_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)-{\xi }_{\mathbb{D}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\right)}^{2}$$

(4)

Definition 6

(Average distance) Let X be a TMS software with operation set $O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}$. ${\mathbb{D}}=\left({D}_{1},{D}_{2},\dots ,{D}_{m}\right)$ are $m$ logs of X. $\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)$ is an order weight vector. Define average distance of in Eq. (5) as:

$${\mathfrak{D}}_{\mathcal{W}}\left({\mathbb{D}}\right)=\frac{1}{m}\sum_{j=1}^{m}{\mathfrak{D}}_{\mathcal{W}}\left({D}_{j},{\mathbb{D}}\right)$$

(5)

Definition 7

(Harmonic log set) Let X be a TMS software with operation set $O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}$. ${\mathbb{D}}=\left({D}_{1},{D}_{2},\dots ,{D}_{m}\right)$ are $m$ logs of X. $\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)$ is an order weight vector. ${\mathbb{D}}$ is called a harmonic log set if $\forall i, 1\le i\le m$, ${\mathfrak{D}}_{\mathcal{W}}\left({\mathbb{D}},{D}_{i}\right)<2{\mathfrak{D}}_{\mathcal{W}}\left({\mathbb{D}}\right)$.

4.3 Formal description of the algorithm

Let X be a TMS with operation set of $O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}$; ${\mathbb{D}}=\left({D}_{1},{D}_{2},\dots ,{D}_{m}\right)$ are logs of X, $\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)$ is order weight vector.

The initial step involves determining the comparison order and the weight vector. Our estimation is that calculating the first three orders should suffice for most systems. This implies that for the order vector $\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)$, whenever $\forall t>3, {w}_{t}=0$. For the initial three orders, we aim to select appropriate weights such that they contribute similarly when calculating the final distance. To accomplish this, for each pair of logs in ${\mathbb{D}}$, we first calculate their distribution distances of the three distinct orders. For instance, for logs ${D}_{i},{D}_{j}\in {\mathbb{D}}$, their distances for the first three orders are provided by Eqs. (6–8):

$${\mathfrak{D}}_{1}\left({D}_{i},{D}_{j}\right)=\sum_{\widehat{o}\in O}{\left({\xi }_{{D}_{i}}\left(\widehat{o}\right)-{\xi }_{{D}_{j}}\left(\widehat{o}\right)\right)}^{2}$$

(6)

$${\mathfrak{D}}_{2}\left({D}_{i},{D}_{j}\right)=\sum_{\left({\widehat{o}}_{1},{\widehat{o}}_{2}\right)\in {O}^{2}}{\left({\xi }_{{D}_{i}}\left({\widehat{o}}_{1},{\widehat{o}}_{2}\right)-{\xi }_{{D}_{j}}\left({\widehat{o}}_{1},{\widehat{o}}_{2}\right)\right)}^{2}$$

(7)

$${\mathfrak{D}}_{3}\left({D}_{i},{D}_{j}\right)=\sum_{\left({\widehat{o}}_{1},{\widehat{o}}_{2},{\widehat{o}}_{3}\right)\in {O}^{3}}{\left({\xi }_{{D}_{i}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},{\widehat{o}}_{3}\right)-{\xi }_{{D}_{j}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},{\widehat{o}}_{3}\right)\right)}^{2}$$

(8)

Then defining the proportion of each order distance as follows:

$${P}_{t}\left({D}_{i},{D}_{j}\right)={\mathfrak{D}}_{t}\left({D}_{i},{D}_{j}\right)/\left({\mathfrak{D}}_{1}\left({D}_{i},{D}_{j}\right)+{\mathfrak{D}}_{2}\left({D}_{i},{D}_{j}\right)+{\mathfrak{D}}_{3}\left({D}_{i},{D}_{j}\right)\right), t=\mathrm{1,2},3$$

We then calculate the average proportion of each order distance, denoted as ${P}_{1}\left({\mathbb{D}}\right)$, ${P}_{2}\left({\mathbb{D}}\right)$, and ${P}_{3}\left({\mathbb{D}}\right)$. We select ${w}_{1}=1$, and ${w}_{2}={P}_{1}\left({\mathbb{D}}\right)/{P}_{2}\left({\mathbb{D}}\right)$, ${w}_{3}={P}_{1}\left({\mathbb{D}}\right)/{P}_{3}\left({\mathbb{D}}\right)$ respectively. To simplify the calculation, , ${w}_{2}$ and ${w}_{3}$ can be chosen as the nearest integers.

The second step involves creating a harmonic log set for X. This is accomplished by calculating the average distribution of ${\mathbb{D}}$, followed by its average distance. After this, we can evaluate whether ${\mathbb{D}}$ is a harmonic log set. If it is, the second step is complete. If it isn't, we compare the distances of individual logs in ${\mathbb{D}}$ to the average distribution of ${\mathbb{D}}$ and eliminate those logs whose distances to the average distribution exceed twice the average distance of ${\mathbb{D}}$. After removing these logs, we are left with a smaller log set. We then repeat the second step until a harmonic log set for X is obtained.

The third step involves creating benchmarks. After a harmonic log set for X is created, we define it as a benchmark set if it contains a reasonable number of reasonably long logs. The definition of what constitutes a reasonable number or length depends on the system's complexity. For small and simple systems, the benchmark set can be small, and the logs can be short. However, for larger and more complex systems, the benchmark set should be larger, and the logs should be longer. As such, determining what is a reasonable size and length requires experience.

Once the benchmark set is determined, we define the average operation chain distribution of the benchmark set as the benchmark distribution, and the average distance of the benchmark set as the benchmark distance.

Finally, the fourth step involves using the benchmark distribution and benchmark distance to evaluate whether a given log was created through normal operations. We define a threshold, with this paper suggesting four times the benchmark distance as a suitable value. However, this value can be adjusted for different systems. We first calculate the distribution distance between the log and the benchmark distribution; if it is less than or equal to the threshold, it is deemed to be a log file generated through normal operations. If it exceeds the threshold, it is likely a log file generated through careless mouse clicks.

5 Experiment

We conducted an experiment by analysing and processing a genuine log file obtained from a recently developed TMS. Given the significant size of the log file, the results from this single experiment should be compelling.

5.1 Log introduction

As the evaluation algorithm is based purely on the analysis of operation logs, we requested our industry partner to provide a real log file, without disclosing other information about the TMS system.

The log file is in Microsoft Excel format and contains 632,575 operation records from May 31, 2021, to December 6, 2021. The file structure is quite simple, with three columns: the first column denotes the user (operator) ID, the second represents the operation, and the last one is the time stamp of the operation. Figure 3 provides a screenshot of a portion of the log file.

As shown in Fig. 3, both the user ID and operation are given in Chinese. To process this log file, we need to convert both user ID and operation into an abstract form.

The first step involves processing users. We group all operations in the log file based on user ID, then assign each unique user a new abstract user ID in the form of “u0”, “u1”, “u2”, etc. The order of user IDs is sorted based on the number of operations recorded in the log. For clarity, we use UID in this paper to represent the newly assigned abstract user ID. Users with a higher number of operations are assigned smaller numbers in UID. The log file contains a total of 1261 unique users. The highest number of operations executed by a single user is 22,882, and there are 94 users who have executed more than 1000 operations.

Figure 4 shows a screenshot of the user table. The first column is the abstract UID, the second column is the original UID, the third column represents the number of operations executed by the user, and the last column shows the logarithm of the operation number. We took the logarithm of the operation number because its range is vast and challenging to present clearly in a linear form. Figure 5 shows the distribution of the operation number in logarithmic form. From this figure, we find that approximately 800 users have executed more than 100 operations (a logarithm value equals 2).

We adopt a similar approach to process the operations in the log file. We group all operations based on their names and assign each unique operation an Operation ID, referred to as OID, in the form of "o0", "o1", "o2", etc. The OID order is determined based on the frequency of the operation in the log file. Operations with a higher occurrence are assigned a smaller number6 in OID.

In total, the log file contains 275 unique operations. The operation with the highest frequency occurs 53,066 times, and there are 37 operations that occur more than 1000 times.

Figure 6 displays the first ten rows of the operation table. The first column signifies the Operation ID (OID), the second column indicates the frequency of the operation, the third column represents this frequency in logarithmic form, and the final column displays the name of the operation in Chinese.

Figure 7 showcases the distribution of operation occurrences expressed in logarithmic form. From this figure, we can discern that approximately 24 operations occur more than 10,000 times, while around 50 operations occur fewer than 10 times (with a logarithmic value less than 1).

The subsequent step involves generating logs, which are sequences of operations, for each user. A screenshot of these generated logs is depicted in Fig. 8. Each row represents a log that records the sequence of operations performed by a user. The first number denotes the sequence index, followed by the User ID (UID) within brackets. The length of the log comes next, and finally the sequence of operations. Given that the initial 10 operation sequences are considerably lengthy, the figure can only display the starting segment of each sequence.

5.2 Determine the weight vector

Following the pre-processing, we have generated 1261 unique operation logs based on different users. The next step involves determining the weight vector.

We commence by selecting the longest 30 logs as a potential benchmark set, under the assumption that users who carry out most operations are more likely to be genuine users, and their log results would be more fitting as a benchmark. We have calculated the proportion of distribution distances for the first three orders. The first order computes the occurrence of each operation, the second order calculates operation pairs, and the third order calculates sequences of three consecutive operations.

To ascertain the weight of each order, we first compute the pair-wise operation sequence distribution distances of the three different orders for the first 30 logs individually. We then calculate the proportion of each order distance among all three orders ${P}_{1}\left({D}_{i},{D}_{j}\right)$, ${P}_{2}\left({D}_{i},{D}_{j}\right)$, and ${P}_{3}\left({D}_{i},{D}_{j}\right)$. Some of these results are depicted in Table 1. In this table, the first column shows the compared pairs, with the proportions of the three distribution distances displayed in the following columns.

Table 1 The first three orders of operation sequence distribution distance between u0 and other 6 logs

Full size table

For the 30 log files, we have 435 unique pairs. After calculating all the pair-wise distances among them, we find that the average proportion of the first-order distribution distance ${P}_{1}\left({\mathbb{D}}\right)$ is 0.7486. The proportion of the second-order distribution distance ${P}_{2}\left({\mathbb{D}}\right)$ is 0.1843, and the proportion of the third-order distribution distance ${P}_{3}\left({\mathbb{D}}\right)$ is 0.0673. To simplify the calculation and to ensure that all three orders contribute almost equally to the final weighted total distance, we have chosen the weights for the three orders as follows: $\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)=(1, 4, 10)$.

5.3 Get benchmarks

The next step is to establish the benchmark set. We initially consider the 30 longest operation logs as a potential benchmark set. We calculate the average distribution and the distance (using the weight vector $\mathcal{W}=(1, 4, 10)$, which was derived from the previous step) between each of these 30 logs and the average distribution, as well as the overall average distance.

The average distance turns out to be 0.0976, while the maximum distance is 0.3336. The distances of all the 30 logs are depicted in Fig. 9. If we use twice the average distance as a threshold, there are three logs, namely u5, u9, and u16, whose distances to the average distribution exceed this threshold. As a result, we remove these three logs from the benchmark set and recalculate the average distribution using the remaining 27 logs.

After excluding the three logs from the potential benchmark set, we compute the distances within the updated set. The results of this calculation are presented in Fig. 10. The newly calculated average distance is 0.07677, and thus, the updated threshold is 0.1535. Using this new threshold, we further identify and remove two logs, u1 and u7, from the benchmark set.

We continue repeating the process outlined above until we eventually have a benchmark set consisting of 21 logs. These logs all have distribution distances to the average distribution that fall under the established threshold. Figure 11 illustrates these distances. The average distance is 0.05094 and the corresponding threshold is 0.1019. It's evident that all the distribution distances are under 0.1, thus affirming that the 21 logs constitute the final benchmark set. Consequently, the benchmark distance is set at 0.05094.

5.4 Evaluation of logs

Having established the benchmark set, benchmark distribution, and benchmark distance, we can now proceed to evaluate the remaining logs. We set a threshold of four times the average distance to decide whether a log has been generated through proper system usage.

Considering there are more than 1000 individual logs based on different users, we will present the results in distinct user ranges for clarity.

Figure 12 displays the normalized distance (the real distance divided by the benchmark distance) between the top 100 log files and the benchmark distribution. From this figure, we can observe that most of the logs align with proper system usage. However, there are 10 logs that could be generated by abnormal usage as their normalized distance exceeds the threshold of 4. Within this range of log files, the log with the fewest number of operations contains 979 entries.

Similarly, Fig. 13 presents the normalized distance for the next 200 logs (ranging from u100 to u299). From this illustration, we observe that the majority of the logs indicate proper system usage. Nonetheless, there are 7 logs that could potentially stem from abnormal usage, as their normalized distance exceeds the threshold of 4. Within this range of log files, the log with the fewest number of operations contains 636 entries.

Similarly, Fig. 14 displays the normalized distance for the next 300 logs, ranging from u300 to u599. This diagram again shows that the majority of these logs appear to be generated through proper system usage. Within this range of log files, the log with the fewest number of operations contains 350 entries.

Figure 15 illustrates the normalized distance for the next 400 logs, spanning from u600 to u999. Notably, the pattern changes around log u800. Beyond this point, the normalized distances increase as the size of the logs decrease. Log u800 comprises 183 operations, so based on this observation, it is estimated that this evaluation method is applicable when a log contains more than 200 operations.

6 Discussion

In this section, we discuss the advantages and limitations of the LEOSD approach. Particularly, we pay attention to how this method, which doesn't account for different roles within a TMS, could impact the effectiveness of identifying sloppy users. Subsequently, we compare this approach with traditional methods.

6.1 Advantages of LEOSD

LEOSD represents a lightweight approach to data analysis that requires minimal semantic information from the log file. Consequently, TMS consumers have the option to code their log files such that sensitive commercial data or personally identifiable information isn't revealed when they submit their log files to a third party for analysis.

The simplicity of LEOSD not only facilitates analysis of log files from different TMSs but also allows easy comparison of results. As a system-independent approach, it's possible to establish a benchmark to evaluate and compare usage scenarios across different TMSs. This notion, although based on logical reasoning, warrants further validation through experimental data.

6.2 Limitation of LEOSD

The simplicity of LEOSD and its lack of reliance on semantic interpretation can be advantageous, but these features also serve as limitations. In terms of identifying sloppy users, this approach lends itself well to analyzing operation chain distributions, as legitimate usage is likely to follow certain patterns. However, the lack of semantic understanding of operations in the log files may hinder the detection of malicious usage, such as information theft or system damage. In such cases, introducing a more specific data analysis system might be necessary for effective log file auditing.

6.3 Role differentiation

Users of a TMS may assume different roles, with each role having its unique set of operations. Consequently, users from diverse roles could result in significantly different operation chain distributions.

LEOSD, in its current form, does not account for role differences, making it potentially inefficient for TMSs with several distinct roles. One possible solution could involve dividing users into role groups and subsequently calculating the average operation chain distribution for each group. This classification process could be executed manually by assigning each user a role based on information provided by the TMS consumer.

Alternatively, the classification could be automated if the TMS file records each user's role or privileges. Moreover, the application of unsupervised machine learning techniques [26] could be utilized to classify users based on their operation chain distribution, potentially identifying "sloppy users" as a distinct role.

6.4 Compare to other work

The challenge of identifying sloppy users in a TMS is a novel issue faced by industry and represents a research gap in the field. To the best of our knowledge, no published work directly addresses this problem. Thus, in this paper, we can only draw comparisons between our task and approach with some related research topics. These include identifying security attacks [27] from log files and recognizing abnormal outliers like malicious operations, such as shill bidding [28], amid the normal operations of genuine users [25].

To underscore the uniqueness of LEOSD, we employ a comparative table (Table 2) to highlight its differences from traditional approaches used in outlier detection.

Table 2 Comparison between LEOSD and traditional outlier detection approaches

Full size table

7 Conclusion and future work

Current research on log analysis for outlier identification primarily focuses on identifying usage patterns for maintaining system security. While this is undeniably vital, we argue in this paper that log analysis should also be employed to identify outliers resulting from 'sloppy users'—users who manifest system resistance passively by being intentionally careless in data entry, contributing inaccurate and haphazard data. Considering that the quality of data greatly impacts system success, decision-making, and the operational efficiency of organizations, we propose that this form of system resistance is an emerging concern. This is particularly relevant as technology becomes increasingly integrated with industry and mandatory system use expands, a situation already evident in China's transport industry.

For this research, we made two assumptions: the majority of users are genuine, and the operation sequence distribution of sloppy users will differ from that of genuine users. We also hypothesized that these sloppy users could be identified through log file analysis.

To address these hypotheses, we proposed the LEOSD method, capable of identifying sloppy users and excluding their operations from system logs. Consequently, the refined log files offer a more reliable source for analysts to evaluate a TMS, ultimately aiding TMS providers in improving their products and developing superior solutions.

Our experiment, conducted on log files from a real-world TMS system, corroborates our hypotheses and fulfills our expectations. It demonstrates that LEOSD can effectively identify sloppy users within log files.

Looking towards future research, we believe that the LEOSD method warrants further exploration for additional applications, such as aiding TMS providers in enhancing product usability [8] and prioritizing new functional requirements [5].

Another prospective direction is standardizing TMS log files, possibly through the implementation of the Common Information Model (CIM) [29]. This could streamline the log analysis process.

Additionally, the integration of contemporary data science techniques, such as Artificial Neural Networks (ANN) [30], could be a promising approach to further enhance the identification of sloppy users.

Data Availability

The data sets used and analyzed during the study are available for research and verification purposes upon reasonable request to the corresponding author.

References

Schlingensiepen J, Nemtanu F, Mehmood R, McCluskey L (2015) Autonomic transport management systems—enabler for smart cities, personalized medicine, participation and industry grid/industry 4.0. In: Sładkowski A, Pamuła W (eds) Intelligent transportation systems – problems and perspectives. Springer, Cham, pp 3–35
Google Scholar
Choudhary P, Dwivedi RK (2022) A novel algorithm for traffic control using thread based virtual traffic light. Int J Inf Technol 14:115–124
Google Scholar
Askci (2023) The title translates 2022 analysis and forecast of market size and development trends in China's intelligent logistics industry. Askci. https://www.askci.com/news/chanye/20220607/1500521882473.shtml. Accessed 25 May 2023
Sharma S, Vijayvargiya S (2022) Modeling of software project effort estimation: a comparative performance evaluation of optimized soft computing-based methods. Int J Inf Technol 14:2487–2496
Google Scholar
Sadiq M, Devi VS (2022) A rough-set based approach for the prioritization of software requirements. Int J Inf Technol 14:447–457
Google Scholar
Mortazavi SAR, Safi-Esfahani F (2019) A checklist based evaluation framework to measure risk of information security management systems. Int J Inf Technol 11:517–534
Google Scholar
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26
Article Google Scholar
Sagar K, Saha A (2017) A systematic review of software usability studies. Int J Inf Technol. https://doi.org/10.1007/s41870-017-0048-1
Article Google Scholar
Kim H, Kankanhalli A (2009) Investigating user resistance to information systems implementation: a status quo bias perspective. MIS Q 33(3):567–582
Article Google Scholar
Ngafeeson M (2015) Understanding user resistance to information technology in healthcare: The nature and role of perceived threats. In Transactions of the international conference on health information technology advancement
Chi W-C, Lin P-J, Chen S-L (2020) The inhibiting effects of resistance to change of disability determinations system: a status quo bias perspective. BMC Med Inf Decis Mak 20(1):82
Article Google Scholar
Marakas GM, Hornik S (1996) Passive resistance misuse: overt support and covert recalcitrance in IS implementations. Eur J Inf Syst 5(3):208–219
Article Google Scholar
Coetsee LD (1993) A practical model for the management of resistance to change: an analysis of political resistance in South Africal. Int J Public Admin 16(11):1815–1858
Article Google Scholar
Hirschheim R, Newman M (1988) Information systems and user resistance: theory and practice. Comput J 31:398–408
Article Google Scholar
Lapointe L, Rivard S (2005) A multilevel model of resistance to information technology implementations. MIS Q 29(3):461–491
Article Google Scholar
Delone WH, McLean ER (2003) The Delone and McLean model of information systems success: a ten-year update. J Manag Inf Syst 19(4):9–30
Article Google Scholar
Kim E, Kuan K, Rasikbhai M, Penm J, Gunja N, El Amrani R, Poon S (2022) Passive resistance to health information technology implementation: the case of electronic medication management system. Behav Inf Technol 41:1–22
Google Scholar
M. Thielsch, S. Meeben and G. Hertel, “Turst and distrust in information systems at the workplace.,” Peer J., p. 21, 12 September 2018.
Kuna H, García-Martinez R, Villatoro F (2014) Outlier detection in audit logs for application systems. Inf Syst 44:22–23
Article Google Scholar
Schulz A, Breiter A (2013) Monitoring user patterns in school information systems using logfile analysis. Next generation of information technology in educational management. ITEM 2012. IFIP advances in information and communication technology. Springer, Berlin
Google Scholar
Benjelloun F, Oussous A, Bennani A, Belfkih S, Lahcen A (2021) Improving outliers detection in data streams using LiCS and voting. J King Saud Univ Comput Inf Sci 33(10):1177–1185
Google Scholar
Landauer M, Kopik F, Wurzenberger M, Rauber A (2020) System log clustering approaches for cyber security applications: a survey. Comput Secur 92:1–17
Article Google Scholar
Schiff G, Volk L, Volodarskaya M, Williams D (2017) Screening for medication errors using an outlier detection system. J Am Med Inform Assoc 24(2):281–287
Article PubMed PubMed Central Google Scholar
He S, Zhu J, He P, Lyu M (2016) Experience report: system log analysis for anomaly detection. In 2016 IEEE 27th international symposium on software reliability engineering
Zeufack V, Kim D, Seo D, Lee A (2021) An unsupervised anomaly detection framework for detecting anomalies in real time through network system’s log files analysis. High-Confid Comput 1(2):100030
Article Google Scholar
Usama M, Qadir J, Raza A, Arif H, Yau K-LA, Elkhatib Y, Hussain A, Al-Fuqaha A (2019) Unsupervised machine learning for networking: techniques, applications and research challenges. IEEE Access 7:65579–65615
Article Google Scholar
Ambre A, Shekokar N (2015) Insider threat detection using log analysis and event correlation. Procedia Comput Sci 45:436–445
Article Google Scholar
Trevathan J (2009) Detecting shill bidding in online English auctions. Handbook of research on social and organizational liabilities in information security. IGI Global, Hershey
Google Scholar
Popovic DS, Varga E, Perlic Z (2007) Extension of the common information model with a catalog of topologies. IEEE Trans Power Syst 22(2):770–777
Article ADS Google Scholar
Abiodun OI, Jantan A, Omolara AE, Dada KV, Mohamed NA, Arshad H (2018) State-of-the-art in artificial neural network applications: a survey. Heliyon 4(11):e00938
Article PubMed PubMed Central Google Scholar

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

School of Information Engineering, Chang’an University, Xi’an, China
Shaoyang Zhang
School of ICT, Griffith University, Brisbane, Australia
Lian Wen & Geraldine Torrisi
Computer Science and Engineering, New South Wales University, Sydney, Australia
Jicheng Li

Authors

Shaoyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lian Wen
View author publications
You can also search for this author in PubMed Google Scholar
Geraldine Torrisi
View author publications
You can also search for this author in PubMed Google Scholar
Jicheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lian Wen.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, S., Wen, L., Torrisi, G. et al. Identifying “sloppy” users in TMS through operation logs. Int. j. inf. tecnol. 16, 1319–1331 (2024). https://doi.org/10.1007/s41870-023-01489-z

Download citation

Received: 02 April 2023
Accepted: 28 August 2023
Published: 28 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s41870-023-01489-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Identifying “sloppy” users in TMS through operation logs

Abstract

Similar content being viewed by others

The use of Big Data Analytics in healthcare

Analyzing Healthcare Processes with Incremental Process Discovery: Practical Insights from a Real-World Application

Big Data Analytics: A Literature Review Paper

1 Introduction

2 Context

3 Literature background

4 Log evaluation through operation sequence distribution (LEOSD)

4.1 Informal description of the algorithm

4.2 Definitions and basic propositions

Definition 1

Definition 2

Proposition 1

Proposition 2

Definition 3

Definition 4

Proposition 3

Proposition 4

Definition 5

Definition 6

Definition 7

4.3 Formal description of the algorithm

5 Experiment

5.1 Log introduction

5.2 Determine the weight vector

5.3 Get benchmarks

5.4 Evaluation of logs

6 Discussion

6.1 Advantages of LEOSD

6.2 Limitation of LEOSD

6.3 Role differentiation

6.4 Compare to other work

7 Conclusion and future work

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation