1 Introduction

Transportation Management Systems (TMS) represent a class of software systems employed in supply chain logistics management [1]. They are designed to facilitate logistics enterprise management by optimizing transportation activities, thus increasing efficiency, enhancing customer service, minimizing costs, improving visibility, managing scheduling fees, enabling traceability, and managing deliveries [2].

Logistics play a critical role in any national economy. Many logistics enterprises utilize various TMS platforms developed by different providers. Furthermore, logistics enterprises are numerous and, the logistics sector is rapidly evolving. Consequently, TMS platforms have a substantial market in all countries. Figure 1 illustrates the growth of the TMS market in China [3].

Fig. 1
figure 1

Market value of TMS in China from 20017 to 2022. (unit 100M RMB, 2022 value is estimated; source: www.askci.com)

Transport management involves numerous intricate processes. A TMS is used by many different people in different roles and must have a comprehensive range of functionalities. The design and development of a user-friendly TMS that can accommodate a diverse user base is a challenging task. Traditionally, TMS development projects may start by collecting user requirements before proceeding with other development activities. However, software cost estimation is a significant challenge in this process, especially due to the diverse user base and rapidly changing demands [4]. An iterative development approach, where the user requirements can evolve over time, is more practical than a single development cycle.

In an iterative development approach, it is crucial to evaluate the current TMS version and prioritize requirements [5], guiding developers on what to add in the next version. TMS manufacturers can leverage several evaluation methods to understand their product’s 'real-time' usage, assess its user-friendliness, and guide its iterative improvement.

Methods available for TMS evaluation, include user feedback collection, functional probe methods, data flow methods, and log methods.

User feedback is often manually collected by manufacturers, but this approach may be biased, inefficient, and time-consuming due to the high cost of customer time. Therefore, automated methods have been proposed to overcome these challenges, such as functional probe, data flow, and log methods.

The functional probe method involves deploying probes within the system that record each function's entry and usage, uploading usage traces to a server for real-time evaluation [6]. Despite its advantages, this method requires user consent for data security and runs the risk of interrupting the user’s operations.

The data flow method focuses on the product cycle, and exchange of data, irrespective of function usage [7]. However, this method often faces resistance from users unwilling to share sensitive data with third parties for software evaluation.

The log method uses system logs, crucial for operation backtracking and security management, to evaluate the TMS. It records user traces, not user data, making it safer, cheaper, and less disruptive than other methods [8]. accurate and valid records from log files is challenging, especially because of the presence of 'sloppy' users. ‘Sloppy’ users operate the system carelessly for various reasons, resulting in abnormal log records. The careless use of the system by ‘Sloppy’ users tarnishes the accuracy and validity of information contained in log files, interferes with proper evaluation, and subsequently impedes efforts for system improvement. To address this issue, it is important to effectively identify (and then eliminate) which of the log records were created by sloppy users. To this end, we propose a novel method named Log Evaluation through Operation Sequence Distribution (LEOSD). The LEOSD method analyses operation sequences in logs to identify sloppy users, and remove their usage records to ensure accurate system evaluation.

Our method rests on the following assumption and hypotheses:

  • Assumption: TMS used in the industry may have sloppy users, but their numbers are usually smaller than genuine users.

  • Hypothesis A: Genuine users typically operate the TMS through meaningful patterns, generating log files of operations with a specific distribution.

  • Hypothesis B: As sloppy users operate the system carelessly, their operation sequence distribution will significantly differ from that of genuine users.

If these assumptions and hypotheses hold, then we can identify sloppy users based on the difference in their operation sequence distributions.

We conducted an experiment using a real industry log file containing 632,575 operation records, which affirmed our assumptions and hypotheses.

The remainder of this paper is organized as follows: Section 2 provides contextual information to understand the research motivation; Section 3 reviews relevant literature; Section 4 introduces the LEOSD algorithm; Section 5 discusses the experiment procedure and results; Section 6 presents issues and discussions; and finally, Sect. 7 provides a conclusion and future research directions.

2 Context

In Chinese transportation organizations, the use of TMS is typically mandatory. However, compulsory system use does not guarantee effective and proper utilization. Some users may misuse the system due to unintentional errors, such as entering incorrect values or mistakenly selecting the wrong fields. These errors may reveal issues with the system's user-friendliness. Conversely, others may use the system ineffectively due to resistance behaviors, such as inputting arbitrary incorrect values or indiscriminately clicking fields.

User resistance to information systems is a well-documented, persistent phenomenon identified as a primary cause of information system failure [9]. User resistance refers to the behaviours of those users expressing dissatisfaction with the system [10]. The causes of user resistance are multifaceted, with various theories proposed to explain it, such as status quo bias (SQB), where resistance is viewed through the lenses of rational decision-making, cognitive misperceptions, and psychological commitment [11]. Resistance can manifest as covert behaviors like apathy and sabotage or as more overt destructive behaviors [9]. Although an in-depth discussion of reasons for user resistance is beyond the scope of this paper, acknowledging its existence aids in understanding the problem that motivated our log evaluation technique.

Research [12,13,14] has shown a spectrum of resistance behaviors, ranging from overt hostility marked by aggressive or destructive behavior to covert actions characterized by apathy and passive resistance [15]. The latter is of particular interest in this paper. Passive resistance, often stemming from fear or stress about the perceived technological 'intrusion' into one's stable world, may manifest through the input of incorrect data [14]. These users give the appearance of system engagement while, they resist it covertly [12], using the system carelessly and inputting inaccurate or random data. These passive resistant users, henceforth referred to as 'sloppy users,' can significantly impact information quality and, consequently, system success [16].

Inaccurate data on a large scale can hinder organizational decision-making. The effect of sloppy users is especially pronounced in industries with large workforces, like China's transport sector. As information systems increasingly integrate with core industrial operations, system use becomes mandatory. While most studies of user resistance have occurred in voluntary contexts [17], passive user resistance in mandatory settings is an emerging and crucial problem. The implications of passive resistance concerning information quality and trust are significant and must be acknowledged.

Given the central role of TMS in effective goods transportation, the impact of sloppy users is self-evident. With the widespread use of TMS in the transport industry, increased economic pressures, the need for rapid goods movement, and large workforces, identifying system data from sloppy users is imperative, particularly in contexts like China, where system use is mandatory. Additionally, the issue of trust is tied to the reliability and credibility of data. Data reliability and credibility are significant factors influencing system effectiveness [18] and therefore must be considered in log analysis methods. To date, the development of log analysis techniques has largely overlooked the identification of sloppy user interactions. The focus in research has been on identifying system use extent, unintentional system errors, and system use patterns from a security perspective. However, as mandatory system use rises and larger workforces require information systems, recognizing the importance of log analysis aimed at identifying erroneous or inaccurate data arising from passively resistant users' activities is critical.

3 Literature background

Transport serves as the backbone of any economy, facilitating the exchange of goods and services. Transport Management Systems (TMS) harness IT to design, plan, and implement transport systems to efficiently and effectively meet the objectives of various processes involved in transportation. However, the success of a TMS or any information system depends on users interacting with the system as intended.

The Delone and McLean model of information system success provides a widely recognized framework for understanding system quality. The model identifies system use as a significant indicator of the success of any information system [16]. System use is related to other crucial factors of system quality, including system quality, service quality, information quality, and user satisfaction, all interacting to manifest as system benefits (e.g., cost savings, efficiencies, time savings, etc.). System use refers to the extent of system use, the way the system is used, and includes metrics such as usage patterns, number of transactions executed, and the number of interactions with the system [16].

System logs serve as a valuable resource to gain insights into system use. These logs capture the events occurring within the system when in use, storing information such as when, who, and how the system is used [19]. System logs record user interactions over time without interrupting the user, providing valuable information about system use. Evaluating these logs by seeking usage patterns can provide a foundation for assessing the system's user-friendliness [20]. However, the analysis of user logs has limitations, such as the lack of information about user goals and objectives [20]. Still, identifying patterns and nature of use can inform iterative system improvements.

While patterns in logs are insightful, outliers in the data can also prove invaluable. Outliers deviate from typical values or patterns [21], allowing for the identification of user behavior that falls outside 'usual' or 'expected' patterns of use. In the quest for understanding user behavior for iterative system improvement, these outliers can be just as (or even more) valuable as the 'normal' patterns found in system logs.

Once outliers are identified, subsequent investigations with users may uncover insights about unintentional careless clicks, erroneous system use, or even intentional careless usage. Identifying outliers in system logs can also help detect intrusions, fraudulent or malicious system use, and inform cybersecurity measures [19, 22]. In settings such as healthcare, outlier detection is crucial to identify anomalies like abnormal heart rhythms or medication errors, for instance [23]. Thus, the identification of anomalies or outliers in system logs is a vital and worthwhile task.

Traditional outlier detection in logs could be done manually, but this approach is time-consuming and prone to human error. Moreover, with the surge in technology use in organizations, the volume of logs generated is immense, making manual methods impractical [24]. Hence, significant efforts are being made to develop techniques for automatically identifying outliers in log data sets. Most of these efforts emphasize intrusion detection, with comparatively less focus on management system logs [19].

Various methods are used for outlier detection in data mining, including distribution-based methods (using statistical testing to identify data deviating from probability distributions), depth-based methods (considering points in k-dimensional space, with shallower points identified as outliers), distance-based methods (often using Euclidean measures), density-based methods (providing a likelihood that the object is an outlier), RNN-based methods, regression analysis, and subspace methods [19].

The most popular framework for abnormal detection based on log file analysis is summarized in Fig. 2, reproduced from Zeufack's paper [25].

Fig. 2
figure 2

Log-based anomaly detection framework (reproduced from Zeufack’s paper [25])

The proposed framework for anomaly detection in TMS log data involves four stages: Log collection, log parsing, feature extraction, and anomaly detection. This approach is in line with the general framework for log-based anomaly detection. However, it introduces a crucial difference in the log parsing phase compared to traditional techniques.

Conventionally, log parsing relies on both syntax notation and semantic understanding of the log file. This approach necessitates analysts to comprehend the meaning of the log file and interpret the user's intentions for each operation. Such in-depth understanding is vital to identify malicious activities such as hacking attacks.

However, the method proposed in this research simplifies the log parsing stage. The authors are suggesting a lightweight approach that can easily analyze log files from various TMS without needing a semantic understanding of those log files. Therefore, our log parsing technique is straightforward and swift, as our parser operates only on the syntax level. It aims to extract essential information such as operation id, user id, and time sequence by splitting operations.

This feature of our approach allows for high efficiency in processing generic log files without the need for detailed knowledge about the specific TMS. Consequently, it streamlines the process of identifying anomalous behavior, particularly in the context of passive user resistance and 'sloppy users.' It offers a novel way to ensure the quality and reliability of data, crucial to system effectiveness and organizational decision-making.

4 Log evaluation through operation sequence distribution (LEOSD)

The LEOSD (Log Evaluation based on Operation Sequence Distribution) algorithm proposed in this section aims to identify "sloppy users", log files generated through careless interactions in a TMS.

4.1 Informal description of the algorithm

In this subsection, we will use informal language to give a simple description of this algorithm, therefore, readers, even without strong mathematics background, can quickly grasp its essence. We then will give a precise mathematical description in the following subsections.

The algorithm contains two phases. The first phase is a learning phase, and the second phase is an evaluation phase.

  • Learning Phase: A set of log files are collected to form the training samples. This algorithm assumes that most log file samples were generated through normal operations, while a small number of samples are abnormal, meaning they were created through careless mouse clicks or passive resistance. The learning phase calculates an average distribution of the operation sequences for the whole set of training samples, and then compares the distance between the average distribution to the distribution of every individual training sample. We then remove those with their distribution distance significantly larger than the average distance, to form a more reliable training set. This process repeats until no single log file in the training set has its distribution too far from the average distribution. We now have a benchmark set of log files, from which we can calculate the average distribution and the average distance between this distribution and the individual samples. This gives us the benchmark distribution and standard distance.

  • Evaluation Phase: In the evaluation phase, the aim is to evaluate whether a given log file was generated through normal operation or not. We first calculate the distance between the operation sequence distribution of this log file and the benchmark distribution. If the distance is much larger than the standard distance, we consider that the log file was likely generated through careless clicking, hence, it can be identified as an outlier or an abnormal interaction.

The core premise of this algorithm is the assumption that operation sequences resulting from normal interactions will have a different distribution pattern than those generated through careless or resistant user behaviour. In the next subsections, we will present a more detailed and mathematical description of the LEOSD algorithm.

4.2 Definitions and basic propositions

Definition 1

(Operation and log) Let Χ be a TMS system; this system has \(k\) types of operations \({o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\) that will be captured by a log. They are simply called operations, and the set of all operations is denoted as \(O\). When a user uses Χ, the system will generate a log, which is a sequence of operations. A log is denoted by \(D=\left\{\left({o}_{1},{o}_{2},\dots ,{o}_{n}\right)|{ o}_{i}\in O,1\le i\le n\right\}\), while \(n=\left|D\right|\) is the number of operations in the log and it is called the length of log.

Definition 2

(Operation chains) Let X be a TMS system, with operation set \(O\). \(\forall t>0, l={(o}_{1},{o}_{2},\dots ,{o}_{t})\) is a sequence of operations with \(\forall 1\le i\le t,{o}_{i}\in O\), then \(l\) is called a tth order operation chain of X. The set of all \(t\mathrm{th}\) order operation chains of X is denoted as \({O}^{t}\). The set of all operation chains is denoted as \(\mathcal{O}=\bigcup_{t=1}^{\infty }{O}^{t}\).

Please be aware that an operation chain may contain repeating operations. For example \(\left({o}^{1},{o}^{1}\right)\) is also a second order operation chain of X.

Proposition 1

Let X be a TMS software, \(O\) is its operation set with \(k\) different types of operations. Then for any positive integer \(t\), there are \({k}^{t}\) different types of operation chains, means: \({\left|{O}^{t}\right|=k}^{t}\).

Proof Let \(l={(o}_{1},{o}_{2},\dots ,{o}_{t})\) be a tth order operation chain of X, for any \(1\le i\le t\), because \({o}_{i}\in O\), \({o}_{i}\) has \(k\) different types of choices. Also, because an operation chain can contain repeating operations, and each position of the operation is independent to each other, therefore, the total number of combinations is \({k}^{t}\).

According to Proposition 1, we know that X has \(k\) different first order operation chains, they are denoted as: \(\left({o}^{1}\right),\left({o}^{2}\right),\dots ,\left({o}^{k}\right)\) respectively. Similarly, it has \({k}^{2}\) second order operation chains. Please note that what we discuss here are possible operation chains based on their forms, it is different from possible operation chains in log files generated through valid operations. Because of the property of a TMS, some operation chains are impossible in real world. For example, it is impossible to delete an order record in a database immediately after the databased is initialized, because the initialization operation will remove all order records in the database and leave no order records to be deleted.

Proposition 2

Let X be a TMS, \(D=\left({o}_{1},{o}_{2},{o}_{3},\dots ,{o}_{n}\right)\) be a log of X. Then for any positive integer \(t\le n\), \(D\) contains \(\left(n-t+1\right)\) operation chains of tth order.

Proof This proposition can be proved through construction. For any positive integer \(t\le n\), \(\left({o}_{1},{o}_{2},\dots ,{o}_{t}\right),\left({o}_{2},{o}_{3},\dots ,{o}_{t+1}\right),\dots ,\left({o}_{n-t+1},{o}_{n-t+2},\dots ,{o}_{n}\right)\) are the \(\left(n-t+1\right)\) operation chains with order of \(t\) in \(D\).

According to Proposition 2, for a log with the length of \(n\), there are \(n\) first order operation chains and \(\left(n-1\right)\) second order operation chains in it.

Definition 3

(Operation-chain distribution) Let X be a TMS with its operation set \(O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}\) and \(\mathcal{O}\) as its operation chain set. \(D=\left({o}_{1},{o}_{2},{o}_{3},\dots ,{o}_{n}\right)\) is a log of X. Define operation-chain distribution (simply called distribution if without confusion) of \(D\) as a function from operation chain set to real numbers, denoted: \({\xi }_{D}:\mathcal{O}\to {\mathbb{R}}\). The value of the function is given in Eq. (1):

$${\xi }_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)=\left\{\begin{array}{c}\frac{{\mathrm{\rm N}}_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)}{n-t+1}, \quad t\le n\\ 0, \qquad \qquad\qquad t>n\end{array}\right.,$$
(1)

while \({\mathrm{\rm N}}_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\) is the number of occurrences of operation chain \(\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\in {O}^{t}\) in \(D\). \({\mathrm{\rm N}}_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\) is also called the count of \(\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\) in \(D\).

Example: Let X be a TMS with its operation set \(O=\left\{{o}^{1},{o}^{2},{o}^{3}\right\}\). \(D=\left({o}^{1},{o}^{2},{o}^{3},{o}^{2},{o}^{3}\right)\) is a log of X. According to Definition 3, the counts of all first order operation chains in \(D\) are: \({\mathrm{\rm N}}_{D}\left({o}^{1}\right)=1\), \({\mathrm{\rm N}}_{D}\left({o}^{2}\right)={\mathrm{\rm N}}_{D}\left({o}^{3}\right)=2\). For second order operation chains, \({\mathrm{\rm N}}_{D}\left({o}^{1},{o}^{2}\right)=1\), \({\mathrm{\rm N}}_{D}\left({o}^{2},{o}^{3}\right)=2\), \({\mathrm{\rm N}}_{D}\left({o}^{3},{o}^{2}\right)=1\), and the counts of other second order operation chains are 0. Then, based on Eq. (1) we calculate the distribution as: \({\xi }_{D}\left({o}^{1}\right)=1/5\), \({\xi }_{D}\left({o}^{2}\right)={\xi }_{D}\left({o}^{3}\right)=1/5\), \({\xi }_{D}\left({o}^{1},{o}^{2}\right)=1/4\), \({\xi }_{D}\left({o}^{2},{o}^{3}\right)=1/2\), \({\xi }_{D}\left({o}^{3},{o}^{2}\right)=1/4\), and the distribution of other second order operation chains are 0. Similarly, it is not difficult to calculate the distribution of other order operation chains; we don’t represent them to save paper space.

Definition 4

(Distribution distance between two logs) Let X be a TMS with operation set \(O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}\). \({D}_{1}\) and \({D}_{2}\) are two logs, the two logs may have different length. Define \(\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)\) as an order weight vector with \(\forall t>0, {w}_{t}\ge 0\), then the distribution distance with \(\mathcal{W}\) between \({D}_{1}\) and \({D}_{2}\) is defined in Eq. (2):

$${\mathfrak{D}}_{\mathcal{W}}\left({D}_{1},{D}_{2}\right)=\sum_{\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\in {O}^{t},t=1}^{\infty }{w}_{t}{\left({\xi }_{{D}_{1}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)-{\xi }_{{D}_{2}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\right)}^{2}$$
(2)

Order weight vectors play a crucial role in defining the impact of different orders of operation chains in the system. In essence, they reflect the operational requirements and usage patterns inherent to the specific software system in question.

For instance, consider a TMS that has numerous functions necessitating a fixed sequence of operations to execute. In this scenario, a standard operation would naturally create log entries containing many such operation chains, which would be reflected in the high order distribution. To capture this importance of sequence, high order components in the weight vector would carry non-zero values. This setup ensures that the model effectively recognizes and accounts for operation chains that play a crucial role in the functionality of the system.

On the other hand, let's imagine a different system where most functions can be executed with only one or two operations. In this case, the high order operation chains would not be significant. Therefore, in the weight vector, all weight components associated with order larger than 2 could be set to 0. This configuration ensures that the model prioritizes those operation chains that are most relevant to the system's use and functionality.

In essence, the selection of a suitable weight vector is dictated by the specifics of the software system in question and may require adjustment based on experience and iterative analysis. This adjustable nature allows the model to accommodate the diverse and complex usage patterns that can occur across different systems, enhancing the accuracy and applicability of the LEOSD algorithm.

Proposition 3

Let X be a TMS and \(D\) is a log of X; \(\mathcal{W}\) is an order weight vector, then \({\mathfrak{D}}_{\mathcal{W}}\left(D,D\right)=0.\)

Proof This proposition can be easily proved through the definition, details omitted.

Proposition 4

Let X be a TMS software and \({D}_{1},{D}_{2}\) are two logs of X; \(\mathcal{W}\) is an order weight vector, then \({\mathfrak{D}}_{\mathcal{W}}\left({D}_{1},{D}_{2}\right)={\mathfrak{D}}_{\mathcal{W}}\left({D}_{2},{D}_{1}\right)\ge 0\).

Proof This proposition can be easily proved through the definition, details omitted.

Definition 5

(Average distribution of operation chains) Let X be a TMS with operation set \(O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}\). \({\mathbb{D}}=\left({D}_{1},{D}_{2},\dots ,{D}_{m}\right)\) are \(m\) logs of X. \(\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)\) is an order weight vector. Define average distribution of \({\mathbb{D}}\) in Eq. (3)

$${\xi }_{\mathbb{D}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)=\frac{1}{m}\sum_{j=1}^{m}{\xi }_{{D}_{j}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)$$
(3)

Similarly define distribution distance between a log \(D\) to the log set \({\mathbb{D}}\) as in Eq. (4):

$${\mathfrak{D}}_{\mathcal{W}}\left(D,{\mathbb{D}}\right)=\sum_{\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\in {O}^{t},t=1}^{\infty }{w}_{t}{\left({\xi }_{D}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)-{\xi }_{\mathbb{D}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},\dots ,{\widehat{o}}_{t}\right)\right)}^{2}$$
(4)

Definition 6

(Average distance) Let X be a TMS software with operation set \(O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}\). \({\mathbb{D}}=\left({D}_{1},{D}_{2},\dots ,{D}_{m}\right)\) are \(m\) logs of X. \(\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)\) is an order weight vector. Define average distance of in Eq. (5) as:

$${\mathfrak{D}}_{\mathcal{W}}\left({\mathbb{D}}\right)=\frac{1}{m}\sum_{j=1}^{m}{\mathfrak{D}}_{\mathcal{W}}\left({D}_{j},{\mathbb{D}}\right)$$
(5)

Definition 7

(Harmonic log set) Let X be a TMS software with operation set \(O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}\). \({\mathbb{D}}=\left({D}_{1},{D}_{2},\dots ,{D}_{m}\right)\) are \(m\) logs of X. \(\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)\) is an order weight vector. \({\mathbb{D}}\) is called a harmonic log set if \(\forall i, 1\le i\le m\), \({\mathfrak{D}}_{\mathcal{W}}\left({\mathbb{D}},{D}_{i}\right)<2{\mathfrak{D}}_{\mathcal{W}}\left({\mathbb{D}}\right)\).

4.3 Formal description of the algorithm

Let X be a TMS with operation set of \(O=\left\{{o}^{1},{o}^{2},{o}^{3},\dots ,{o}^{k}\right\}\); \({\mathbb{D}}=\left({D}_{1},{D}_{2},\dots ,{D}_{m}\right)\) are logs of X, \(\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)\) is order weight vector.

The initial step involves determining the comparison order and the weight vector. Our estimation is that calculating the first three orders should suffice for most systems. This implies that for the order vector \(\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)\), whenever \(\forall t>3, {w}_{t}=0\). For the initial three orders, we aim to select appropriate weights such that they contribute similarly when calculating the final distance. To accomplish this, for each pair of logs in \({\mathbb{D}}\), we first calculate their distribution distances of the three distinct orders. For instance, for logs \({D}_{i},{D}_{j}\in {\mathbb{D}}\), their distances for the first three orders are provided by Eqs. (68):

$${\mathfrak{D}}_{1}\left({D}_{i},{D}_{j}\right)=\sum_{\widehat{o}\in O}{\left({\xi }_{{D}_{i}}\left(\widehat{o}\right)-{\xi }_{{D}_{j}}\left(\widehat{o}\right)\right)}^{2}$$
(6)
$${\mathfrak{D}}_{2}\left({D}_{i},{D}_{j}\right)=\sum_{\left({\widehat{o}}_{1},{\widehat{o}}_{2}\right)\in {O}^{2}}{\left({\xi }_{{D}_{i}}\left({\widehat{o}}_{1},{\widehat{o}}_{2}\right)-{\xi }_{{D}_{j}}\left({\widehat{o}}_{1},{\widehat{o}}_{2}\right)\right)}^{2}$$
(7)
$${\mathfrak{D}}_{3}\left({D}_{i},{D}_{j}\right)=\sum_{\left({\widehat{o}}_{1},{\widehat{o}}_{2},{\widehat{o}}_{3}\right)\in {O}^{3}}{\left({\xi }_{{D}_{i}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},{\widehat{o}}_{3}\right)-{\xi }_{{D}_{j}}\left({\widehat{o}}_{1},{\widehat{o}}_{2},{\widehat{o}}_{3}\right)\right)}^{2}$$
(8)

Then defining the proportion of each order distance as follows:

$${P}_{t}\left({D}_{i},{D}_{j}\right)={\mathfrak{D}}_{t}\left({D}_{i},{D}_{j}\right)/\left({\mathfrak{D}}_{1}\left({D}_{i},{D}_{j}\right)+{\mathfrak{D}}_{2}\left({D}_{i},{D}_{j}\right)+{\mathfrak{D}}_{3}\left({D}_{i},{D}_{j}\right)\right), t=\mathrm{1,2},3$$

We then calculate the average proportion of each order distance, denoted as \({P}_{1}\left({\mathbb{D}}\right)\), \({P}_{2}\left({\mathbb{D}}\right)\), and \({P}_{3}\left({\mathbb{D}}\right)\). We select \({w}_{1}=1\), and \({w}_{2}={P}_{1}\left({\mathbb{D}}\right)/{P}_{2}\left({\mathbb{D}}\right)\), \({w}_{3}={P}_{1}\left({\mathbb{D}}\right)/{P}_{3}\left({\mathbb{D}}\right)\) respectively. To simplify the calculation, , \({w}_{2}\) and \({w}_{3}\) can be chosen as the nearest integers.

The second step involves creating a harmonic log set for X. This is accomplished by calculating the average distribution of \({\mathbb{D}}\), followed by its average distance. After this, we can evaluate whether \({\mathbb{D}}\) is a harmonic log set. If it is, the second step is complete. If it isn't, we compare the distances of individual logs in \({\mathbb{D}}\) to the average distribution of \({\mathbb{D}}\) and eliminate those logs whose distances to the average distribution exceed twice the average distance of \({\mathbb{D}}\). After removing these logs, we are left with a smaller log set. We then repeat the second step until a harmonic log set for X is obtained.

The third step involves creating benchmarks. After a harmonic log set for X is created, we define it as a benchmark set if it contains a reasonable number of reasonably long logs. The definition of what constitutes a reasonable number or length depends on the system's complexity. For small and simple systems, the benchmark set can be small, and the logs can be short. However, for larger and more complex systems, the benchmark set should be larger, and the logs should be longer. As such, determining what is a reasonable size and length requires experience.

Once the benchmark set is determined, we define the average operation chain distribution of the benchmark set as the benchmark distribution, and the average distance of the benchmark set as the benchmark distance.

Finally, the fourth step involves using the benchmark distribution and benchmark distance to evaluate whether a given log was created through normal operations. We define a threshold, with this paper suggesting four times the benchmark distance as a suitable value. However, this value can be adjusted for different systems. We first calculate the distribution distance between the log and the benchmark distribution; if it is less than or equal to the threshold, it is deemed to be a log file generated through normal operations. If it exceeds the threshold, it is likely a log file generated through careless mouse clicks.

5 Experiment

We conducted an experiment by analysing and processing a genuine log file obtained from a recently developed TMS. Given the significant size of the log file, the results from this single experiment should be compelling.

5.1 Log introduction

As the evaluation algorithm is based purely on the analysis of operation logs, we requested our industry partner to provide a real log file, without disclosing other information about the TMS system.

The log file is in Microsoft Excel format and contains 632,575 operation records from May 31, 2021, to December 6, 2021. The file structure is quite simple, with three columns: the first column denotes the user (operator) ID, the second represents the operation, and the last one is the time stamp of the operation. Figure 3 provides a screenshot of a portion of the log file.

Fig. 3
figure 3

A screenshot of part of the log file

As shown in Fig. 3, both the user ID and operation are given in Chinese. To process this log file, we need to convert both user ID and operation into an abstract form.

The first step involves processing users. We group all operations in the log file based on user ID, then assign each unique user a new abstract user ID in the form of “u0”, “u1”, “u2”, etc. The order of user IDs is sorted based on the number of operations recorded in the log. For clarity, we use UID in this paper to represent the newly assigned abstract user ID. Users with a higher number of operations are assigned smaller numbers in UID. The log file contains a total of 1261 unique users. The highest number of operations executed by a single user is 22,882, and there are 94 users who have executed more than 1000 operations.

Figure 4 shows a screenshot of the user table. The first column is the abstract UID, the second column is the original UID, the third column represents the number of operations executed by the user, and the last column shows the logarithm of the operation number. We took the logarithm of the operation number because its range is vast and challenging to present clearly in a linear form. Figure 5 shows the distribution of the operation number in logarithmic form. From this figure, we find that approximately 800 users have executed more than 100 operations (a logarithm value equals 2).

Fig. 4
figure 4

A screenshot of the users and the number of operations

Fig. 5
figure 5

The distribution of operation numbers on users

We adopt a similar approach to process the operations in the log file. We group all operations based on their names and assign each unique operation an Operation ID, referred to as OID, in the form of "o0", "o1", "o2", etc. The OID order is determined based on the frequency of the operation in the log file. Operations with a higher occurrence are assigned a smaller number6 in OID.

In total, the log file contains 275 unique operations. The operation with the highest frequency occurs 53,066 times, and there are 37 operations that occur more than 1000 times.

Figure 6 displays the first ten rows of the operation table. The first column signifies the Operation ID (OID), the second column indicates the frequency of the operation, the third column represents this frequency in logarithmic form, and the final column displays the name of the operation in Chinese.

Fig. 6
figure 6

The first 10 rows of the table of operations

Figure 7 showcases the distribution of operation occurrences expressed in logarithmic form. From this figure, we can discern that approximately 24 operations occur more than 10,000 times, while around 50 operations occur fewer than 10 times (with a logarithmic value less than 1).

Fig. 7
figure 7

The distribution of operation occurrences

The subsequent step involves generating logs, which are sequences of operations, for each user. A screenshot of these generated logs is depicted in Fig. 8. Each row represents a log that records the sequence of operations performed by a user. The first number denotes the sequence index, followed by the User ID (UID) within brackets. The length of the log comes next, and finally the sequence of operations. Given that the initial 10 operation sequences are considerably lengthy, the figure can only display the starting segment of each sequence.

Fig. 8
figure 8

A screenshot of user operation sequences

5.2 Determine the weight vector

Following the pre-processing, we have generated 1261 unique operation logs based on different users. The next step involves determining the weight vector.

We commence by selecting the longest 30 logs as a potential benchmark set, under the assumption that users who carry out most operations are more likely to be genuine users, and their log results would be more fitting as a benchmark. We have calculated the proportion of distribution distances for the first three orders. The first order computes the occurrence of each operation, the second order calculates operation pairs, and the third order calculates sequences of three consecutive operations.

To ascertain the weight of each order, we first compute the pair-wise operation sequence distribution distances of the three different orders for the first 30 logs individually. We then calculate the proportion of each order distance among all three orders \({P}_{1}\left({D}_{i},{D}_{j}\right)\), \({P}_{2}\left({D}_{i},{D}_{j}\right)\), and \({P}_{3}\left({D}_{i},{D}_{j}\right)\). Some of these results are depicted in Table 1. In this table, the first column shows the compared pairs, with the proportions of the three distribution distances displayed in the following columns.

Table 1 The first three orders of operation sequence distribution distance between u0 and other 6 logs

For the 30 log files, we have 435 unique pairs. After calculating all the pair-wise distances among them, we find that the average proportion of the first-order distribution distance \({P}_{1}\left({\mathbb{D}}\right)\) is 0.7486. The proportion of the second-order distribution distance \({P}_{2}\left({\mathbb{D}}\right)\) is 0.1843, and the proportion of the third-order distribution distance \({P}_{3}\left({\mathbb{D}}\right)\) is 0.0673. To simplify the calculation and to ensure that all three orders contribute almost equally to the final weighted total distance, we have chosen the weights for the three orders as follows: \(\mathcal{W}=\left({w}_{1},{w}_{2},\dots \right)=(1, 4, 10)\).

5.3 Get benchmarks

The next step is to establish the benchmark set. We initially consider the 30 longest operation logs as a potential benchmark set. We calculate the average distribution and the distance (using the weight vector \(\mathcal{W}=(1, 4, 10)\), which was derived from the previous step) between each of these 30 logs and the average distribution, as well as the overall average distance.

The average distance turns out to be 0.0976, while the maximum distance is 0.3336. The distances of all the 30 logs are depicted in Fig. 9. If we use twice the average distance as a threshold, there are three logs, namely u5, u9, and u16, whose distances to the average distribution exceed this threshold. As a result, we remove these three logs from the benchmark set and recalculate the average distribution using the remaining 27 logs.

Fig. 9
figure 9

The distribution distance between the first 30 user logs and the average distribution

After excluding the three logs from the potential benchmark set, we compute the distances within the updated set. The results of this calculation are presented in Fig. 10. The newly calculated average distance is 0.07677, and thus, the updated threshold is 0.1535. Using this new threshold, we further identify and remove two logs, u1 and u7, from the benchmark set.

Fig. 10
figure 10

The distribution distances between the 27 logs and their average distribution

We continue repeating the process outlined above until we eventually have a benchmark set consisting of 21 logs. These logs all have distribution distances to the average distribution that fall under the established threshold. Figure 11 illustrates these distances. The average distance is 0.05094 and the corresponding threshold is 0.1019. It's evident that all the distribution distances are under 0.1, thus affirming that the 21 logs constitute the final benchmark set. Consequently, the benchmark distance is set at 0.05094.

Fig. 11
figure 11

The distribution distances between the 21 logs and their average distribution

5.4 Evaluation of logs

Having established the benchmark set, benchmark distribution, and benchmark distance, we can now proceed to evaluate the remaining logs. We set a threshold of four times the average distance to decide whether a log has been generated through proper system usage.

Considering there are more than 1000 individual logs based on different users, we will present the results in distinct user ranges for clarity.

Figure 12 displays the normalized distance (the real distance divided by the benchmark distance) between the top 100 log files and the benchmark distribution. From this figure, we can observe that most of the logs align with proper system usage. However, there are 10 logs that could be generated by abnormal usage as their normalized distance exceeds the threshold of 4. Within this range of log files, the log with the fewest number of operations contains 979 entries.

Fig. 12
figure 12

The normalized distribution distance between u0 and u99

Similarly, Fig. 13 presents the normalized distance for the next 200 logs (ranging from u100 to u299). From this illustration, we observe that the majority of the logs indicate proper system usage. Nonetheless, there are 7 logs that could potentially stem from abnormal usage, as their normalized distance exceeds the threshold of 4. Within this range of log files, the log with the fewest number of operations contains 636 entries.

Fig. 13
figure 13

The normalized distance of u100–u299

Similarly, Fig. 14 displays the normalized distance for the next 300 logs, ranging from u300 to u599. This diagram again shows that the majority of these logs appear to be generated through proper system usage. Within this range of log files, the log with the fewest number of operations contains 350 entries.

Fig. 14
figure 14

The normalized distance of u300–u599

Figure 15 illustrates the normalized distance for the next 400 logs, spanning from u600 to u999. Notably, the pattern changes around log u800. Beyond this point, the normalized distances increase as the size of the logs decrease. Log u800 comprises 183 operations, so based on this observation, it is estimated that this evaluation method is applicable when a log contains more than 200 operations.

Fig. 15
figure 15

The normalized distribution distance between u600 and u999 to the benchmark distribution

6 Discussion

In this section, we discuss the advantages and limitations of the LEOSD approach. Particularly, we pay attention to how this method, which doesn't account for different roles within a TMS, could impact the effectiveness of identifying sloppy users. Subsequently, we compare this approach with traditional methods.

6.1 Advantages of LEOSD

LEOSD represents a lightweight approach to data analysis that requires minimal semantic information from the log file. Consequently, TMS consumers have the option to code their log files such that sensitive commercial data or personally identifiable information isn't revealed when they submit their log files to a third party for analysis.

The simplicity of LEOSD not only facilitates analysis of log files from different TMSs but also allows easy comparison of results. As a system-independent approach, it's possible to establish a benchmark to evaluate and compare usage scenarios across different TMSs. This notion, although based on logical reasoning, warrants further validation through experimental data.

6.2 Limitation of LEOSD

The simplicity of LEOSD and its lack of reliance on semantic interpretation can be advantageous, but these features also serve as limitations. In terms of identifying sloppy users, this approach lends itself well to analyzing operation chain distributions, as legitimate usage is likely to follow certain patterns. However, the lack of semantic understanding of operations in the log files may hinder the detection of malicious usage, such as information theft or system damage. In such cases, introducing a more specific data analysis system might be necessary for effective log file auditing.

6.3 Role differentiation

Users of a TMS may assume different roles, with each role having its unique set of operations. Consequently, users from diverse roles could result in significantly different operation chain distributions.

LEOSD, in its current form, does not account for role differences, making it potentially inefficient for TMSs with several distinct roles. One possible solution could involve dividing users into role groups and subsequently calculating the average operation chain distribution for each group. This classification process could be executed manually by assigning each user a role based on information provided by the TMS consumer.

Alternatively, the classification could be automated if the TMS file records each user's role or privileges. Moreover, the application of unsupervised machine learning techniques [26] could be utilized to classify users based on their operation chain distribution, potentially identifying "sloppy users" as a distinct role.

6.4 Compare to other work

The challenge of identifying sloppy users in a TMS is a novel issue faced by industry and represents a research gap in the field. To the best of our knowledge, no published work directly addresses this problem. Thus, in this paper, we can only draw comparisons between our task and approach with some related research topics. These include identifying security attacks [27] from log files and recognizing abnormal outliers like malicious operations, such as shill bidding [28], amid the normal operations of genuine users [25].

To underscore the uniqueness of LEOSD, we employ a comparative table (Table 2) to highlight its differences from traditional approaches used in outlier detection.

Table 2 Comparison between LEOSD and traditional outlier detection approaches

7 Conclusion and future work

Current research on log analysis for outlier identification primarily focuses on identifying usage patterns for maintaining system security. While this is undeniably vital, we argue in this paper that log analysis should also be employed to identify outliers resulting from 'sloppy users'—users who manifest system resistance passively by being intentionally careless in data entry, contributing inaccurate and haphazard data. Considering that the quality of data greatly impacts system success, decision-making, and the operational efficiency of organizations, we propose that this form of system resistance is an emerging concern. This is particularly relevant as technology becomes increasingly integrated with industry and mandatory system use expands, a situation already evident in China's transport industry.

For this research, we made two assumptions: the majority of users are genuine, and the operation sequence distribution of sloppy users will differ from that of genuine users. We also hypothesized that these sloppy users could be identified through log file analysis.

To address these hypotheses, we proposed the LEOSD method, capable of identifying sloppy users and excluding their operations from system logs. Consequently, the refined log files offer a more reliable source for analysts to evaluate a TMS, ultimately aiding TMS providers in improving their products and developing superior solutions.

Our experiment, conducted on log files from a real-world TMS system, corroborates our hypotheses and fulfills our expectations. It demonstrates that LEOSD can effectively identify sloppy users within log files.

Looking towards future research, we believe that the LEOSD method warrants further exploration for additional applications, such as aiding TMS providers in enhancing product usability [8] and prioritizing new functional requirements [5].

Another prospective direction is standardizing TMS log files, possibly through the implementation of the Common Information Model (CIM) [29]. This could streamline the log analysis process.

Additionally, the integration of contemporary data science techniques, such as Artificial Neural Networks (ANN) [30], could be a promising approach to further enhance the identification of sloppy users.