1 Introduction

Mobile communication devices have become an integral and indispensable part of everyday communication. For example, the number of smartphone users in Germany has risen since its invention to currently over 60 million [19] which corresponds to around 72% of Germany’s current population. This is also accompanied by a steady increase in the use of these devices for planning, commissioning and carrying out criminal activities, and thus in the number of devices to be analyzed in the course of criminal investigations.The challenge here is to search through the huge amount of communication data on a device for the often little case-relevant information. In the case of organized crime, even entire networks of mobile communication devices usually have to be examined.

The forensic investigation of mobile communication devices includes, on the one hand, the physical and logical backup and reconstruction of data on mobile devices, such as smartphones or tablets, and, on the other hand, the content analysis of text, image, audio and video data. Together with available metadata, such as timestamps, log files, geo and contact data, apps used, etc., a variety of forensic questions can be answered and a kind of digital profile of the user can be generated.

While much of the work addresses the provisioning and recovery of data on mobile devices [10, 12, 15] or the exploration of database structures [1, 2, 5, 20], little work is dedicated to their content analysis.

In text-based communication, such as SMS or various messenger services, understanding the content of the conversation depends heavily on the existence of a chat history that is as complete as possible. However, this is often not the case, and there are many reasons for this. However, reconstruction is possible, especially in group chats, by linking different devices. A positive side effect is the drastic reduction of the analysis effort, since it is now sufficient to analyze the chat history of a single conversation participant. The subsequent detection of coherent conversations enables a more error-tolerant search, while preserving the context. This is particularly necessary for the later assessment of the meaning and significance by an investigator.

When assessing the case-specific relevance of parts of the communication, classic machine learning models fail due to the lack of availability of annotated training data and the special characteristics of mobile communication in the forensic context. This can be remedied by a knowledge model that incorporates the investigator’s experience and case-specific knowledge. However, the typically challenging and time-consuming creation of such a model can be supported by incorporating a term recommendation system.

In this paper, we present MoNA, a prototype application that implements the aforementioned problem-solving approaches.

2 MoNA’s Analysis Process

The focus of the analysis process integrated in MoNA is on linking communication data and reducing it to case-relevant data. As can be seen in Fig. 1, the entire process can be divided into several steps. Most of these steps are performed automatically, but some of them actively involve the user.

Fig. 1
figure 1

Interactive process for intelligent analysis of communication data. After the extraction and recovery of chat histories, identical chats from different devices are merged. This is followed by the detection of conversations for each chat, each of which contains contiguous messages. Finally, a knowledge model is used to classify conversations that are relevant to the process and return them to the user for evaluation

After the initial extraction of chat histories from the respective databases on each device, deleted content is recovered using the Forensic SQLite Data Recovery Tool [14] to fill in gaps. By comparing all entries in the backups with the current main database, deleted messages or entire chat histories can be recognized and restored. If every backup has a time stamp, the period in which the respective data was deleted can also be determined.

Subsequently, identical chats from different devices are each merged into a single common chat, which represents the maximum amount of available information.

This is followed by conversation detection, which divides a chat into several coherent conversations. A conversation consists of a set of related messages, which are in a common temporal context. It is precisely these conversations, and not the individual messages, that are ultimately used to determine relevance [16,17,18].

In order to correctly classify conversations as case relevant, a knowledge model is required that enables arbitrarily complex search queries that go far beyond classic text searches. In MoNA, this model is called a term tree. When creating the term tree, the user’s knowledge can be included, e.g., specific case knowledge. After applying the term tree, the system automatically returns all conversations that contain at least one message classified as case relevant. As mentioned at the beginning, the user only has to assess a usually significantly reduced number of conversations compared to the total amount of data.

2.1 Reconstruction of Chat Histories

It is entirely possible that certain data cannot be restored on a mobile device. Nevertheless, incomplete or deleted chats can be reconstructed, provided that not one but several devices are present in the context of the forensic examination. For this purpose, MoNA searches for identical chats on all of these devices.

Fig. 2
figure 2

Complete reconstruction of a group chat using three devices. Devices A, B and C each contain their own versions of the common group chat, which have gaps at different points in time. A gap marked as dashed and with “X” represents a deleted or not received message. Merging the different versions leads to the reconstruction of the complete chat history

Subsequently, all messages of the same chats are compared. The basic idea is that messages that were deleted on one mobile device can still be present in the same chat on other devices. Therefore, if a chat exists on two devices, but certain messages of the chat are only on one device, these messages were certainly deleted or not received on the other device. To check whether two messages are identical, a comparison of the message ID or the message content in combination with the transmission time is suitable, depending on the circumstances. If deleted or not received messages were found, MoNA automatically inserts them into the respective gapped chats. The chance of being able to completely reconstruct as many chats as possible is higher the more devices are included in the analysis process. Since the content of identical chats is guaranteed to be the same at the end of the process step, both the user and MoNA only need to analyze these chats in a single, complete version. For demonstration purposes, Fig. 2 shows an example of reconstructing a group chat using three devices, each of which contains parts of the entire chat history.

2.2 Conversation Detection

After the recovery and reconstruction of individual chat messages \(m \in M\) described in the last sections, their grouping into individual conversations is done as shown in Equation 1.

$$\begin{aligned} c= (m_1,...m_n|t^m_i-t^m_{i+1}<=\epsilon , \forall i=1...n) \end{aligned}$$
(1)

Each conversation thus consists of the set of messages that were exchanged consecutively at times \(t_i\) and \(t_{i+1}\), without exceeding an individually determined maximum response time \(\epsilon\) as detailed in [16]. This grouping makes the use of conservative word matching algorithms more promising, since the larger number of words in a conversation naturally increases the probability of a search hit compared to individual messages.

2.3 Term Tree

After determining the conversations \(C = c_1, ..., c_n\), the next step is to find out which of these conversations contain crime-related relevant messages. Determining relevant conversations regarding a specific case, requires the inclusion of investigator knowledge for several reasons, as detailed in [16].

Taking into account the fact that a relevance decision is based to a not insignificant extent on empirical knowledge, MoNA relies on a rule-based approach that allows complex systems of syntagms to be described in a tree structure. This knowledge structure, called a term tree, is represented in Fig. 3. A syntagma is a set of linguistic elements (here terms) that occur together in a local context (here message). A term is represented not only by a word, but by a vector \(\mathbf {t}=(w_0,...,w_n,p_0,...,p_k)\), where \(w_i\) denotes a set of linguistic variations (word variants, synonyms, group-specific expressions, etc.) and \(p_i\) denotes a set of pattern definitions (here regular expressions).

Fig. 3
figure 3

Term tree as a central classification element for deciding relevance of individual conversations

A syntagma syn is then the obligatory combination of different terms \(t_i\) in a message \(m_j\) in the sense of a conjunction, i.e. \(syn=t_0 \wedge t_1 \wedge ...\). A term tree \(\xi =syn_0 \vee syn_1 \vee ...\) is then the disjunctive conjunction of different syntagms. Of course, this principle can be applied recursively, i.e. \(\xi _{total}=\xi _0\vee \xi _1\vee ...\), allowing the reuse of cross-case or case-independent knowledge encoded in this way, which successively limits the generation effort to term trees for case-specific knowledge.

If at least one syntagma matches a conversation, it is classified as case-relevant and highlighted accordingly.

3 Term Recommendations

Probably the biggest challenge in creating the knowledge base is finding appropriate terms to generate the syntagms. The approach of a term recommendation engine as implemented in MoNA is shown schematically in Fig. 4.

Fig. 4
figure 4

Structure of the Term Tree Recommender. The Recommender can provide recommendations with or without prior knowledge. The algorithm is based on an LDA over the entire communications network

Here, a topic modeling based on a term co-occurrence matrix with the help of the Latent Dirichlet Allocation (LDA) [4] forms the basis for the recommendation of terms. The optimal number of topics can be set arbitrarily or determined by probabilistic coherence as shown in [8]. If the term tree does not contain any terms, i.e., there is no prior knowledge, the investigator can then choose the topic that is likely to best answer the current forensic question. Subsequently, the 50 terms that are most likely to be represented in the chosen topic are suggested. The term tree can then be supplemented with case-relevant terms from this selection.

Much more interesting is the case that some words or syntagms are already contained in the term tree. These entries can, for example, be based on knowledge from the interrogations of witnesses or suspects, but also on recommendations without prior knowledge, as just discussed. Griffiths et al. [7] explain how the LDA can be used to predict whether \(w_{n + 1}\) is the next word in a sequence \(s = w_1,w_2...w_n\) of words. However, they point out that the words in s do not necessarily have to follow each other in the text, but that it can also be an unordered set of terms [6, 7].

In this work, s consists of the already known, relevant words in the term tree. For another word \(w_{n+1}\), a prediction is made whether it is associated with the existing important words in s or whether it is thematically similar to them. The only requirement imposed on the set of words in s is that the contained terms have a high probability in the same topic [7]. This applies to the existing words in the term tree at least if they were determined from one of the case-relevant topics. The probability that \(w_{n+1}\) is associated with s can be formulated as a conditional probability \(p(w_{n+1}|s)\) [6] and calculated using Equation 2.

$$\begin{aligned} P(w_{n+1|s;\beta })=\frac{\sum _t\prod _{i=1}^{n+1}\beta _{w_i}^{(t)}}{\sum _t\prod _{i=1}^{n}\beta _{w_i}^{(t)}} \end{aligned}$$
(2)

The \(\beta\)-matrix computed during LDA is now needed for prediction because it contains the \(\beta _{w_i}^{(t)}\), i.e., the probability of the word w in topic t [7]. If \(w_{n+1}\) is thematically similar to the words s, it has a high probability in the same topics as the words in s, which leads to a higher value for \(P(w_{n+1|s;\beta })\) [7]. Similar to the recommendations without consideration of prior knowledge, the 50 terms with the highest thematic similarity to prior knowledge are suggested to the investigator. Both approaches generally lead to similar recommendations, but incorporating prior knowledge provides more specific terms, at least in part.

4 Joint Semantic Analysis

Up to this point, the communication of a single communication channel, e.g., a service like WhatsApp or Telegram, across all participating devices, was considered. But communication in real life is not limited to one channel and especially not to one modality. The term modality is used to refer to a communication medium, such as text, image audio or video. We have to take into account that people do not communicate by just writing texts. Rather, texts are interspersed with images, videos and voice messages. These different modalities can add important new or repeat existing information. Especially in the first case, a common understanding of all the modalities used is necessary in order to correctly understand the communication and subsequently transmitted information. Details of the approach described below have already been discussed by the authors in [21].

Aiming at explaining the coherent semantic content and hidden connections in a mobile communication consistently, we formally formulate the joint semantic analysis as follows:

$$\begin{aligned} \tilde{e} = {\text {argmax}}_{\theta } \tilde{P}(e|d_{cm}; \theta ) \end{aligned}$$
(3)

where e is the semantic context in the conversation data D, which is mostly represented by a topic and possibly connected to a concrete crime, \(d_{cm} \in D\) denotes a single message spread via the communication channel \(c \in D_c=\) {WhatsApp, Telegram, email...} and represented in the modality \(m \in D_m=\) {Text, Image, Audio, Video ...}. D is temporally and semantically coherent and chronologically structured. We use \(\theta\) to denote the set of parameters inferred during topic modeling that captures the latent semantics in the data.

The crucial task is to find an intermodal relationship that implies a semantic concept between different modalities and channels. Therefore, at first a textual representations for all non-textual modalities has to be determined. At first, we need to map the content of all multimedia data into a textual semantic space in order to extract topics. In this way, the entire communication space becomes searchable. Subsequently, the semantic linking (intermodal correspondence) can be determined by considering the entire context in communication.

For image data, the traditional classification approach [23] or image captioning [11] can be used, where the former delivers only discrete labels like people or car, etc., while the latter describes the coherent information of image as a whole scene with a natural sentence, e.g., a man is holding a gun in a bank. Instead of focusing on describing semantic content of an image, the semantic interpretations and the relations between image and text can be determined as shown in [13]. In future research, a scene graph will be extracted to determine how it contributes to the understanding of a conversation [22]. Similar, a video can also be translated to a textual representation, i.e., a natural sentence with respect to the content [9]. The audio data can be transcribed into text form by means of Automatic Speech Recognition (ASR) [3]. Once the semantic textual representation of the multimedia data is available, the coherent semantic topics of the data can be extracted by using LDA.

5 Conclusion and Future Work

Analyzing communication data from mobile devices is a time-consuming and error-prone task. Current analysis applications can support this process so far, but do little to reduce the effort. In this work, therefore, a process chain was presented that significantly reduces the analysis effort after extraction and recovery of the communication data in three steps.

In a first step, the messenger data from different devices is linked together, allowing the communication history to be reconstructed almost completely. As a positive side effect, deduplication significantly reduces the reading effort. The next step is to divide the chat history messages into temporally related conversations, which preserves the context of a message and reduces the need to match each relevant message. At the same time, the associated context preservation simplifies the interpretation of the results by investigators. In a final step, process-relevant conversations are filtered with the help of a semantic rule-based knowledge base, the so-called term tree, which again drastically reduces the search effort.

To reduce the effort of knowledge base creation, a two-stage recommendation system based on extracted topics and semantic coherence can be used. In particular, semantically and thematically similar terms are suggested for already existing case-relevant knowledge of the investigator. Since real communication usually spans different channels and modalities , all media data are mapped into a common text-based semantic space and in this way jointly included in the determination of case-relevant communication.

The proposed process chain has been implemented in a prototype application, the Mobile Network Analyzer (MoNA), and is currently being evaluated by various investigative agencies (see Fig. 5). A time-limited trial version is available for downloadFootnote 1, along with instructions for obtaining unrestricted usage rights. Nevertheless, the current implementation is limited to the messengers WhatsApp, Facebook and Telegram. Future research will address the reverse engineering of further communication services. The results will be integrated into upcoming versions of MoNA at irregular intervals. Furthermore, the joint semantic analysis of different modalities currently places high demands on the hardware and is therefore not included in the trial version.

Fig. 5
figure 5

Screenshot of the current MoNA version in which the presented concepts are implemented

Future work needs to address empirical evaluation to highlight the superiority of the presented approach compared to existing solutions. Opportunities for further development exist primarily in the formation of user profiles through stylistic analyses across different communication channels.