1 Introduction

Over the years, it is possible to see the growth of the data volume published on the web due to numerous applications from which the data can be generated and used. However, interoperating among these data has always been a challenge and several publication patterns have been proposed (Algemili 2016; Neves et al. 2020; Yi 2019) as the foundation of what is called the semantic web. Today, it is motivated by search engines and other tools that benefit from better-described data, semantic web principles are increasingly being disseminated by major content producers such as news websites, among others. The interest in sharing, consuming, and understanding Web data has motivated the availability of datasets in the web and datasets repositories have been created.

According to Hassan and Twinomurinzi (2018), the publication of open government data has been received significant attention in the last few years. In the public sector, there is a great interest on the part of governments to have their information published openly, following the example of the United Kingdom (Shadbolt et al. 2012) where the website data.govFootnote 1 gathers information of public interest, such as government transparency data. As another example, the European CommissionFootnote 2 published various datasets openly on its data portal to allow European institutions access to an array of open datasets and apply these data in new and innovative ways, thereby unlocking their economic potential. In other countries, transparency laws allow information access. For instance, in Brazil, the information released by the government is accessed through “Portal da Transparência”.Footnote 3 One of the main areas of government interest is in public security, which lacks specialized manpower and adequate investments to produce trusted national statistics (Cavalcanti 2017).

In this scenario, it is difficult to find quality data in public security, as in the case of missing persons. A survey carried out by Unit (2016) shows that approximately 400,000 incidents of missing persons were reported to the police in the UK between 2016 and 2017. The same occurs in Brazil, where another survey shows that eight people disappear every hour in the country (Caraffi 2017). The study conducted throughout the Brazilian territory evaluated the scenario of disappearance from 2007 to 2016. The registration of 694,007 disappearances was verified during the 10 years between all the states, or that approximately 69,400 people disappear per year. Being a recent study, the analysis in question emphasizes the frequency of cases related to the missing persons in the country, proving it to be an important phenomenon to be studied. It is also a challenging problem for countries with a large population, such as the United States, Brazil, or China.

Due to the high rate of disappearances and their different causes, adequate monitoring of missing persons is necessary, through documentation and registration of each case. The purpose of these registries is to be a tool to help collect information that contributes to the search and dissemination of missing persons. Usually, these registries are created in Brazil through policy reports. There are some initiatives of the Brazilian government to encourage the localization of missing persons through an online disappearance registry, as in the Ministry of Justice.Footnote 4 However, the adhesion is low, totaling, until the writing of this article, 1206 registrations of missing and found persons. In general, the task of collecting and disseminating this information to society as a means of assisting in the solution of each case is delegated to entities such as Non-Governmental Organizations (NGOs), e.g. “Desaparecidos do Brasil”,Footnote 5 and regional police agencies such as Police Station of the National Register of Missing Persons.Footnote 6 However, in most cases, those organizations act separately and without integrating this data.

The main information disclosed is data such as gender, age, date of disappearance, region, and additional information provided by family members or acquaintances. However, when analyzing information registered online, both in NGOs and government websites, there is a great lack of data, which includes the absence of information about the ethnicity of the individual and lack of information, such as the name of the disappeared (Oliveira 2007). An example of this is the police records, which are one of the few tools for help in the problem of missing person cases, that they are subject to countless tabulation and filling problems, which can lead to a lack of data (Oliveira 2007). Another reason is that no institution centralizes and standardizes the information (Claudiano 2014). Since each entity has its own set of rules to deal with the problem, information heterogeneity hinders the interoperability process between the data. Therefore, given many cases of missing persons and dissemination issues, it is evident that the lack of integrity in the records of missing persons compromises understanding and the disclosure of these cases. This highlights the need for better approaches to indexing and understanding the information of missing person cases from the distributed repositories on the web.

The proposal to integrate heterogeneous data repositories in a structured and unified manner is a promising standardization alternative since it allows a faster and more comprehensive exchange of information. After this process, it is easier to understand how each case occurs and extract insights into dealing with the missing person’s problem. Therefore, this work aims to present a framework for extraction, validation, grouping, and metadata attribution, following the Web Semantic patterns, to address the lack of quality of these data. The framework was used for gathering Brazilian missing person data from NGO and governmental websites. To improve the understanding of the collected missing people cases, we submit those standardized data to machine learning algorithms to get new insights into the problem. In this way, it is intended to help the entities delegated in solving the missing person problem get new relevant information to assist to solve the problem. Besides that, they can improve the quality of their repositories, enabling the development of other more innovative applications that explore the semantics behind the data more efficiently.

This paper is organized as follows. First, Sect. 2 presents the related work. Section 3 describes the framework operation. Section 4 discusses the application of the framework on an actual set of missing person dataset and the result of this application. Finally, Sect. 5 concludes the work with the final considerations, and Sect. 6 enumerates the future works.

2 Related work

In Beno et al. (2009), the authors present a framework for web page data extraction and apply semantics to them. AgentMat runs agents, which are programs designed specifically to extract data from a particular website. The architecture has components defined via XML. Each component uses regular expressions to parse the pages and it is still possible to assign categories to the extracted content. Like the present work, AgentMat also faces the problem of heterogeneity in the way content is published on the Web. The component-based approach proposes a standard for writing the extraction components and an execution sequence, encouraging users to work collaboratively. The categorization of information also resembles one of the processes proposed by this work. While AgentMat uses the metadata of the pages to generate the classification, this work proposes using techniques use of the extracted data to assign new metadata, increasing the value of the information.

On the other hand, in Chaulagain et al. (2017) the authors propose a cloud-based web extractor for Big Data applications, which makes use of Amazon solutionsFootnote 7 so that the system can be used in the cloud. The approach used a queue to receive URLs to be processed and it allows the allocation of scraping engines to be made according to the need for processing. Scraping engines use a library to turn the page information into a document object model (DOM) where multiple-element selectors for the parser are used with the HTML parser library to extract the required content from the DOM. Content is formatted according to configuration, filtered, and stored in a database. The DOM representation allows the developer to navigate through the tags and collect only the relevant information.

In Yang et al. (2010), the authors also deal with the collection and structuring of data for its application. In their work, it is presented a climate data system that was designed to collect data automatically from different sources, concentrating the information in a central repository and delivering it through a web interface. The system architecture is divided into modules for processing predefined pages and collecting data that is generally accessible via FTP within the selected set of sources. A fundamental part of the work is to know the structure of the reference website so that the data can be consulted. In Gogar et al. (2016), the authors automatically collect data from web pages and create a model combining textual and layout information to solve the problem of creating a new extraction wrapper for each new web page information extraction. To solve the heterogeneity of all those pages, the authors use deep neural networks to learn a wrapper that can extract information from previously unseen pages. Because it is a model that requires training, this approach becomes applicable to specific domains. On the other hand, the variety of data formats and the poor quality of the data available in NGOs and governmental agencies’ websites make it difficult to use automated techniques.

The creation of solutions to government data problems has been already explored in the literature. For example, in Tayal et al. (2015) the authors used data mining techniques to understand and detect crimes and criminals in India. Also, in Chi et al. (2020) the authors explored the semantic web network to an ethnic understanding of Taiwanese indigenous peoples. For the scenario of missing persons, Jadhav et al. (2017) presents an IoT solution to the detection of missing persons. Unlike the other approaches, we explored in this paper the use of classification models in the missing person scenario. Despite these models being used for outcome prediction to get new insights into the data.

In the context of this work, the publication of structured content in a free and accessible way is fundamental for intelligent applications to make use of this content. One initiative to encourage the growth of free-form data sharing is linking open data (LOD). The objective of the project is to promote the development of the data web by providing data sets that are available under open licenses, converting them to RDF by the principles of linked data, and publishing them on the web (Bizer et al. 2009). Therefore, in Assaf et al. (2015) the authors propose a scalable automatic approach for extracting, validating, correcting, and generating descriptive linked dataset profiles. The system includes indexing the dataset, recommending domain vocabularies for the metadata, identifying the knowledge domain to which the dataset belongs, and generating the profile of the dataset composed of descriptive and structural metadata. In the present work, we use a set of techniques to enrich the data, and the framework used in this work can be helpful to other domains.

3 Proposed data collection framework

This section presents a framework for extraction, validation, grouping, and metadata attribution to these databases to make the extraction and unification of the data for specific domain repositories easier,. The framework consists of a set of tasks for structuring data, semantic attribution, and data normalization using the approach mediator and wrapper (Ozsu and Valduriez 2011).

The mediator is presented as a specific collector which manages the whole process of querying external data. It decides which tasks (wrappers) must be assigned to each of the input data. The wrappers have the role of translating requests made by the mediator by processing and collecting each result of the request in the selected local database. Thus, each collector works as a process manager and each wrapper as a parallel production activity. This process is illustrated in Fig. 1, in which it is divided into three steps.

Fig. 1
figure 1

Overview of the data collection, normalization, and metadata attribution processes

The extraction process is performed using data scraping. For each website, a specialized extractor is defined that retrieves information through XPath queries or regular expressions. Extractors must inherit an abstract class with specific methods provided by the solution. When defining an extractor, it is defined which data the extractor will generate and by which process (tasks) that data must go through. The definition of each collector is made using an XML document, which is read and instantiated in the framework.

The data recovered by the extractor passes through a set of tasks defined by the developer. The task’s purpose is to extract information, enrich the data, or perform pre-processing tasks (i.e., stemming, unit conversion). The framework has a set of predefined tasks that can be used by any collector, such as the normalizer and the semantic annotator. Figure 2 represents different tasks defined for distinct data within a collector.

Fig. 2
figure 2

Collector process

Each task performs an atomic action on the data and the user can queue up different normalization tasks in their collector. Normalization tasks perform syntactic changes in the data, such as capitalization, data encoding processing, synonyms change, abbreviation resolution, etc. In addition to normalization tasks, they can be used tasks to enrich the data. From using metadata generation methods, one can infer data that is not necessarily explicit. In this context, natural language processing (NLP) methods, optical character recognition (OCR), or any other semantic aggregation tool or technique that the user application of the framework requires. Pre-programmed enrichment tasks for OCR are available to the user, which makes use of Tesseract,Footnote 8 NLP tasks such as named entity recognition, POS tagging using Apache OpenNLPFootnote 9 and semantic annotation, in which it can be used DBpedia Spotlight.Footnote 10 Furthermore, to the available tasks, the user can include code in the PHP language itself, creating a class that implements a task interface, where a task receives a collection of text and must also produce a collection of texts.

Finally, the indexed information is stored using RDF format. The index can be used to search and access information sources or collect additional information. In this step, the data already has a structure of semantic representation of the attributes using vocabularies of the user’s choice. The default framework storage is done in AllegroGraph,Footnote 11 but new connectors for other RDFbased (or other NoSQL) databases can be included in the framework.

When working with large volumes of data, there is a high probability of finding duplicated data. Therefore, it is necessary to treat and verify these duplicate elements before inserting them into the database. To avoid duplicates, the user can define, for each extractor, the set of data that makes up the identifier of the data. Before inserting a new record into the database, it is queried for its existence in the database using the data set defined as the identifier, where:

  • If there is no record with the same identifier, a new object is created, and all the metadata related to it persisted.

  • Otherwise, the new metadata to be inserted is checked.

  • If they already exist in the database, the duplicates are not inserted.

  • If not, the new data is persisted and referenced to the existing registry object, increasing the metadata for that registry.

3.1 Unified data repository of missing persons

One difficulty finding missing persons is finding vehicles that disseminate information about them. In general, these data are made available by NGOs and public security agencies that attempt to share this information to the public. However, the lack of data quality, the lack of information, website’s format and low popularity, among other problems, make it difficult for the data to be viewed and shared by users. Part of the problem can be the low budget of these organizations, which cannot always afford better-quality computational solutions.

In this case, there are problems of duplication and completeness of data, where more than one source reproduces the same information and data that could be complementary from different websites, are disclosed separately. Still, in this issue, another problem found is how that the HTML of the outreach website is structured. Usually, information is presented in single blocks without identifying what data is being presented or presented in image format and PDF documents, making it difficult to collect and identify items. To create the unified index, an extensive web search was conducted for websites that contain data on missing Brazilian persons.

Initially, the structures of the website and the types of data were offered by them were studied to make it possible to standardize the data collection. After the analysis, 20 different properties were requested during the registration of disappearance occurrences: name, age, date of birth, date of disappearance, status (if the missing person is already found or not), gender, height, weight, skin tone, nickname, disappearance status, disappearance city, hair color, eye color, disappearance place, disappearing circumstance, localization date (for the missing persons that reappeared), missing person photo, disappearance feature and additional data about the disappearance. The source from which the data was extracted was also collected to preserve the origin of these data. For each source, a collector was instantiated in the framework, which generates data with properties of ontologies known in the literature that were linked, such as FOAFFootnote 12 and DBpedia Ontology.Footnote 13

In the scenario of missing persons, the data are disseminated heterogeneously on the websites. In the selected sources, the information is presented in natural language text formats that do not have metadata for easy identification or are arranged in a key-value-like format, but also without the use of metadata. The information is presented in image format or PDF documents in some cases. Information is generally presented in such disparate formats that automated data identification solutions, such as Gogar et al. (2016), are not applicable or need the creation of such a training base that they become unworkable. To address these issues, the following tasks were used:

Regular expression (REGEX): As we are working in a scenario with a lot of natural text without any tabulation patterns, is common to find different types of properties mixed in a single text field (Bui and Zeng-Treitler 2014). To minimize this problem, we create some REGEX rules for trying to extract the max of information mixed in these fields. With this process, we were able to find height and width in the same field or extract information about hair and eye colors and skin tone.

Semantic annotation: In the analyzed sources, some records of the disappearances had descriptive fields with additional reports on the timing and form of the disappearance. This information was described in natural language, without any metadata associated with the terms throughout the text. For example, instead of disclosing the place of the disappearance of a person as a field of the record, this information became agglomerated with several others, losing the values of this data. To retrieve them, the collectors of these sources were instantiated to use the semantic annotation task. In the task available in the framework, DBpedia Spotlight is used, which has been parameterized to process the text and return only place names. All names returned by the tool are related to DBpedia resource identifiers.

OCR/PDF parser: With the need for information to be easily inserted into pages to be visually well presented, it is common for data to be released in a compressed, easy-to-view format such as png, jpeg, and pdf. This problem also happens with the website of public agencies, as is the case of www.policiacivil.pe.gov. br source. For this source, all data generated by the collector passed through the OCR task, which retrieved the data from the disappeared record. Regular expressions were used to filter the correct data.

Gender inference: One of the data extracted by the collectors is the gender class to which a missing person is inserted. In some cases, this information is not informed by the source. To solve this problem, a task for gender inference has been created. It was used the name of the disappeared person already collected and, through the Demographic Census of 2010, attempted of inference of the gender. The Official Brazilian Census,Footnote 14 organized by IBGE,Footnote 15 provides the number of people that have a given name according to their gender. IBGE also provides an APIFootnote 16 for direct queries to the application database. So, with these data, it is possible to infer how gender a missing person can fit in a probabilistic way. To consider that a name X is of the gender Y, a threshold of 0.9 was used. Thus, a person X is considered of the gender Y whenever the division between the number of people X of the gender Y by the sum of people X of all genders is greater than 90%. If this does not happen or this name does not appear in the IBGE database, no gender is inferred for this name.

Finally, we have to normalize the data extracted from the web. The normalization task has functionalities for data standardization. This is the pre-storage step, in which all information is verified before it is inserted into a database. All collectors use two simple normalization tasks for the scenario presented: (1) the characters are transformed to lowercase, and (2) all possible white spaces are removed at the beginning and end of the extracted data. Besides, in this scenario, a single missing person may be appearing on different websites. Thus, when entering the data of this same person, one must take precautions to reduce duplicate data. For lack of a primary key in these data, identifiers in the collectors were the properties name, disappearance date, and disappearance place of the individual.

3.2 Results of the proposed framework for data integration

After the extraction process, 11,751 records of missing persons were collected and distributed among the 15 selected repositories. It was possible to see a distribution of almost 800 disappearances per website. Given the statistics of missing persons in Brazil (Sect. 1), this analysis exemplifies how the entities that disseminate information on the Web cannot keep up with the intensity with which cases of disappearance take place in the country. Due to the lack of interaction between the sources, we found 743 duplicate records, treated by the framework. Among these duplicates, 163 could be used to complement information from the registry, incomplete in the used sources. Table 1 shows the initial composition of the base after normalization and duplicate filtering. The dataset before the normalization was collected in 2018 and is publicly available in XML formatFootnote 17 (Gomes and Souza 2020).

Table 1 Data before and after semantic enrichment

There are 11,008 single missing persons. Of these missing person registries, 20 were recovered through specialized collectors to extract information from images. However, even with the collection of information from different sources, it is possible to see the lack of homogenization of the base. Trying to reduce the problem of lack of homogenization, metadata attribution processes were applied. To exemplify the improvement of the base, the base was composed of more than 70% missing data about gender. These processes almost completely reduced the amount of these missing data. Also, for the age of the missing persons, it was possible to complete most of the data through the difference in years between the values of the attribute of birthday and that of date of disappearance.

More information regarding cities, states, skin, weight, height, hair, and eye colors of missing persons can be extracted from texts using natural language texts. Such type of information is also available in fields as moreCharacteristics and additionalData. In the sources used, the reports of the disappearances are described in these types of fields and they contain many important information in a single text. Thus, we use a process called Named Entity Recognition to extract only words that refer to places to infer from where the missing persons disappeared. Regular expressions were also used to detect patterns in text. For example, “x.xx m” was adopted for height and “x.xx kg” for weight. Also, metrics were used to calculate and supplement missing current births. Table 1 presents the final basis. The growth of the base after the application of the metadata assignment tasks proposed in the framework of Fig. 1 is highlighted using boldface.

From the results presented in Table 1, the data grew by approximately 54% for the birth date property, 36% for the age properties, 195% for the gender, 0.6% for states, 28% for cities, 5% for hair color, 5% for eye color, 8% for height, and 4% for weight. With these new data volumes, the overall growth of the data was over 11.33%. This has made the base more homogeneous, with its consistent, correlated and centralized information improving the way it is viewed, disseminated and understood in cases of disappearance. Finally, the data generated by the framework application is made available for use by other applications through a SPARQL endpoint and for querying others on a website.Footnote 18 SPARQL queries can be made to understand how the missing person data are distributed in the unified index through the endpoint.

4 Data mining analysis on missing person cases

The scenario of missing persons still needs a better understanding of each case. Currently, some research areas work with data visualization and knowledge discovery. In the data mining and data warehouse areas, techniques and concepts are developed to understand massive sets of data exploring their behavior to create useful knowledge and, if needed, try to predict the behavior of these data in the future (Han et al. 2011). In this paper, we explored classification models in the missing person scenario. Even though these models can be used for outcome prediction, the main goal here is to extract information that can be used to help in the closing and understanding of the disappearance cases, not be able to classify if one person could be found or not. With the results of applying these techniques in the missing person scenario, it is possible to identify the main trends making it viable to make strategic decisions through the knowledge generated.

After the unification and semantic enriching of the missing person cases extracted by the framework process, we used interpretive classification models to mine the information contained in the dataset and try to understand how the cases are divided. The main reason for this is the discovery of patterns among the disappearance to try to help the public agencies and NGOs solve these cases. To do this process, we create a data mining pipeline presented in Fig. 3 divided into three main steps: (1) dataset preprocessing (2) model training and selection, and (3) results’ analysis. The following sections are going to describe each step present in the process.

Fig. 3
figure 3

Data mining pipeline

4.1 Data preprocessing

The purpose of data preprocessing is to complete and prepare the data for applications of techniques for knowledge discovery in databases (KDD) (Fayyad et al. 1996). The dataset used here in all the data mining processes is the same linked database result of the framework process (Table 1). Besides the tasks already performed by the framework, in the preprocessing step, some additional tasks were performed, such as outliers detection, treatment of missing data, and removal of correlated attributes.

First, we analyzed all the sources and the features that each one contains. As in the following steps, we complete the missing data using the entire dataset. As we are working with natural language texts, we remove all the sources with too many missing features to prevent future noisy data. One source had to be removed from the KDD processes due to the lack of information extracted from this website (most of the features were not informed).

For the attributes gender, skin tone, state, hair color, and eye color the textual data treatment was carried out to set them as categorical features. For all the instances that did not have one of these features informed, we created an additional category named NF (not informed).

The status feature was transformed into a binary attribute. So, the instances that have the value of the status feature equal to 0 are people that are still missing and the cases that have the value of the status feature equal to 1 are people that were found.

For height and weight, it was noted that the outliers were typos during the registration of the field. Because of that, they were treated through the standardization of the values in the International System of Units (Thompson and Taylor 2008). For example, height values can be displayed in centimeters (e.g. 170 cm) or in meters (e.g., 1.7 m), depending on the website from which the information was extracted.

To complete the remaining of the missing numerical values, we used the k-nearest neighbors (KNN) algorithm to impute the missing values (in this case, it was used K = 5). So, after that, all the height, weight, and age values were completed. All instances with nonnull height, weight, and age data were used to analyze the imputation procedure. We performed a crossvalidation with five folds. The average mean absolute error (MAE) was 12.1, 29.3, and 8.2 for age, height, and weight. It is worth noting that data quality is a problem in Brazilian missing person records, and the original data contains presumably wrong values. For instance, an instance shows a 12-year-old person with 158 kg and 54 cm. The predicted value for weight was 58.6 kg, which seems more reasonable.

Finally, the last normalization step consisted of onehot encoding of categorical attributes and re-scaling of numerical data between 0 and 1 with the min–max scaling algorithm to prevent any problem with feature scaling.

All the other attributes that were not previously normalized were discarded from the KDD process. After that, the final dataset was composed of 54 features and 7949 missing person cases, where 756 were found or reappeared (status = 1) and 7193 were still missing (status = 0). In the following sections, we use REAP to refer to the class of people that were found or reappeared and MISS to refer to the class of people that are still missing.

4.2 Model selection

We extended the pipeline presented in Fig. 3 for better visualization, as the model selection flow is not simple to understand. Figure 4 shows the flowchart of the entire process for data separation, training, hyperparameter tuning, and model evaluation. To help in the scenario of missing persons, logistic regression (LR), decision tree (DT), and random forest (RF) models were selected to be evaluated in this pipeline. They were selected because those models allow the extracting of a greater interpretation of the cases.

Fig. 4
figure 4

Model selection workflow. Read 1 as dataset separation step, 2 and 3 as the Stratified cross-validation steps, 4 as the grid search step, 5 as the oversampling step, 6 best hyper-parameter tuning model selection step, training, validation, and testing sets to select the best 7 as test evaluation step, and 8 best model selection step

We first have to separate the training and testing sets (step 1). The preprocessed base was divided into two subsets to create the training and test sets. One subset was composed of all the instances that have some numerical feature imputed from KNN in the normalization processes or any NF feature, and the other one containing all the instances that have no data imputed or NF features. This process was done to ensure that the imputed data can predict the data with any modification and that there is no bias during the model training process due to the imputed features. The subset containing only data that do not have any imputed data was composed of 1672 instances (21% of the total dataset) being 159 from the REAP class and 1513 from MISS class (we are going to use the term “Not Imputed dataset” to refer to this). The remaining 7949 instances (79% of the total dataset) composed the second subset (we are going to use the term “Imputed dataset” to refer to this).

We performed a cross-validation step to separate the models (steps 2 and 3). To make an equal separation between the target classes, we performed a stratified ratio separation. Two K-fold stratified (SKF) with 10 separations each (K = 10) were used since we have two datasets, being one for the Imputed dataset and the other for the Not Imputed dataset. In the end, a merge was performed between the imputed SKF and the not imputed SKF. So, it was performed 100 fold executions at the end of the process.

As the imputed data is large, we discarded part of the Imputed SKF. This process was performed to reduce this difference in the proportion of training data and reduce the bias caused by the imputed data as a consequence. Besides, this process was used to generate variability in training data. This process aims to reduce the bias caused by the data imputation. Finally, to ensure that the training step has no bias due to the imputed data, the testing set was composed only of not imputed data.

It is necessary to choose the metrics carefully to not mask any results due to the imbalance of the target classes, as we are working in an imbalanced scenario.

So, as a model evaluation, we select the following metrics: F-measure, F2-score, recall, balanced accuracy, the area under the curve (auc), and Matthews correlation coefficient. To select the best classifier configuration, we performed a hyper-parameter optimization of the decision tree parameters and the random forest classifiers with a grid search (GS) approach (step 4). The training set was divided into two subsets to perform the cross-validation during the GS: GS training set and GS validation set.

To perform the balancing of the target classes, an oversampling process was carried in the GS training dataset (step 5). To do this, the minority data (REAP class) were synthetically replicated using the synthetic minority over-sampling technique (SMOTE). SMOTE is a technique to deal with imbalanced datasets by drawing a line between instances from the minority class, and introducing synthetic examples along with those line segments (Chawla et al. 2002). After this process, the ratio between the two target classes in the GS training set was established at 50/50.

During the hyper-parameter optimization, we variate the criterion, and min samples leaf parameters of DT and bootstrap, max depth, and n estimators parameters of RF in a range of values to select the best model of each grid search step (see Table 2). The best classifiers parameters (see Table 3) were selected based on the f2-score during the grid search (step 6). We used the F2-score to give greater weight to the recall score.

Table 2 Configuration used here for Grid Search
Table 3 Hyperparameters selected by GS

Finally, we evaluated the model selected in the Grid Search in an unseen test set to validate the best model using the previously presented metrics (step 7). This unseen test set is only composed of data that did not have any imputed value, that is, data on disappearance without any changes. The best models were obtained (step 8) and evaluated statistically and interpretively (Sect. 4.3).

4.3 Analysis of the predictive results

This section presents the results obtained by the data mining process. We separated the results into two sections: model performance and Interpretive evaluation. In the first, we presented a numerical analysis of the models concerning their predictive performances and, in the second, we present the interpretation and discussion regarding the main features observed in the missing person cases.

4.3.1 Model performance

The performance evaluation was carried out to assess the predictive ability of each model. Figure 5 shows the boxplot of the 100 tests performed during the crossvalidation step. All models achieved results larger than 0.5 for all the metrics and the boxplots present a small standard deviation. However, the decision tree presented some outliers showing that the generalization ability is lower than the other classifiers.

Fig. 5
figure 5

Boxplot of the results obtained by DT, LR, and RF concerning the six performance metrics considered here

The balanced accuracy and f2-score show that the models can predict both classes with the same proportion. This can also be confirmed by the confusion matrix, where all the models have low proportions of false negatives and positives (Fig. 6). It is necessary to emphasize that the high recall values are also relevant for the problem of missing persons as it shows that the models have a high capacity to bring relevant cases.

Fig. 6
figure 6

Confusion matrix of the three models

A statistical evaluation was performed to analyze the statistical significance of the results. We used the ScipyFootnote 19 package to perform this step. For all the statistical evaluations we adopted a p value < 0.05 to reject the null hypothesis. Initially, we evaluated if the population is normally distributed. As a result, none of the metrics presented a normal distribution and thus a non-parametric test was performed. We computed the Kruskal–Wallis H test to evaluate if all groups’ median population is equal. The null hypothesis was rejected (p values ≤ 1.519e−09) for all the metrics and we applied the Dunn post-hoc to detect in which cases the differences occur. Figure 7 presents a heat map of the Dunn-test for pairwise comparisons. Only the Decision Tree and Random Forests classifiers had no differences among the populations. This can be seen once that the p value is above 0.05 for balanced accuracy, recall, and f2-score metrics.

Fig. 7
figure 7

Dunn-test results

We can see that all three classifiers achieved good results in the performance evaluation. However, the logistic regression was the one that had less performance concerning the six performance metrics considered here (Fig. 5). The confusion matrix (Fig. 6) also showed that the best decision tree could perform great prediction scores for both target classes. Also, from the statistical evaluation, the decision tree and random forest obtained statistically similar results.

4.3.2 Interpretive evaluation

According to Caraffi (2017), about eight people disappear in Brazil every hour. Due to the high number of missing persons in this country, public agencies and NGOs have been working to publicize missing person occurrences on the Web to find a faster solution to such cases. However, the lack of standards and structure, the poor quality of the data disclosed, and the low visibility of these repositories prevent this information from being widely consumed by the population. Also, it is difficult for public agencies to manage and understand the characteristics of missing person cases and how to minimize this problem. This emphasizes the need for works that present a more anthropological study of missing person cases (Calmon 2019).

This study performed a data mining process to fill the mentioned gap. The selected classification models can create an interpretive evaluation and produce the importance of each feature. In this sense, we selected the top 15 features of the models, and Table 4 presents the feature importance values grouped by technique. We analyzed the main characteristics of the decision tree after pruning it. The final tree is presented in the Appendix.

Table 4 Top 15 feature importance for the best model of each classifier

Age is the most important feature for all the methods tested here. The disappears of a person according to age can be related to several factors. According to Biehal et al. (2003), Bureau (2017) and Oliveira and Vieira (2017), in the case of children, the main occurrences involve child trafficking, organ sales, slave labor, prostitution, pedophilia, and illegal adoption. In adults, factors such as chemical dependency and abstention from their daily responsibilities, such as involvement with debts and troubled relationships, contribute to the disappearance. In (Bureau 2017; Ferreira 2013; Sanford and Ibrahim 2012), the authors discuss the disappearance of the elderly due to loss of memory (to factors as Alzheimer’s disease), the flight of adolescents and children from homes because of mistreatment or personal dissatisfaction, and also cases in which victims of accidents where the corpses are not recognized (Silva et al. 2009). However, the Public Ministery of Rio de JaneiroFootnote 20 indicates that most parts of the localization occurrences are from the voluntary returnFootnote 21 from the missing person.

It is important to notice that, the Decision Tree showed that persons that disappear under or equal to 18 years have more chance to be found. The same information is present in the National Register of Missing Children and Adolescents,Footnote 22 a Brazilian government site. Also, age, height, and weight are strongly related to people’s growth and changes as they get older.

The state of residence23 of a missing person is another important feature. This is related to the public policy of each state to help in the missing person problem. For example, some Brazilian states such as Rio de Janeiro (RJ), São Paulo (SP), and Minas Gerais (MG) have more governmental and NGO initiatives to collect and spread information about missing persons. SOS missing children24, Missing persons MG,Footnote 23 and the Civil Police of São PauloFootnote 24 are examples of these initiatives. Also, some states such as São Paulo have their own initiatives to understand the profiles of missing persons in their regions (Poliano et al. 2016). Social networks, which are used for some of these initiatives, also play an important role in helping in the process of dissemination and finding of missing persons (House 2019). However, this information can be lost as the missing person scenario has no system that selected the best candidates to disseminate the like in other scenarios (Hu et al. 2018; Lee et al. 2015).

Skin tone and hair color are other important features. People with black and brown skin tone have more likely to be found or reappear than other people. However, the Public Ministry of Rio de Janeiro indicates that almost twice as many people in these characteristics disappeared between January 2013 and February 2018. Finally, it is necessary to remember that gender is another important piece of information to try to predict if a person could reappear or not regarding the people who were found in the base.

5 Concluding remarks

Due to the high rate of disappearances and their different causes, adequate monitoring of missing persons is necessary, through documentation and registration of each case. It is difficult to find quality data in public security, as in the case of missing persons. In general, the task of collecting and disseminating this information to society to assist in the solution of each case is delegated to entities such as non-governmental organizations (NGOs) and regional police agencies. However, in most cases, those organizations act separately and without integrating this data. This highlights the need for better approaches to indexing and understanding the information of missing person cases from the distributed repositories on the web.

This work presented a framework to support the collection, structuring, standardization, and semantic enrichment of web data for to create specific domain repositories. As proposed, the framework allows the user to use existing tasks to improve data quality. In addition, extensions of the framework can be made to insert new tasks specific to the user’s domain. To enhance the understanding of the data collected, we submit those standardized data to data mining approaches.

The proposed framework was used within a real world application with missing persons from Brazil. Fifteen websites were used as data sources, which contained 11,751 instances. The enrichment process increased the dataset size by about 11%. The collected data are publicly available to other users and intelligent applications with the hope of assisting the competent authorities to develop public policies to be alert to new civil disappearances.

The mining process showed that age has great importance. The disappears of a person according to age may be related to several factors such as slave labor, chemical dependency, and involvement with debts loss of memory (to factors as Alzheimer’s disease). Also, the Brazilian states with more governmental and NGO initiatives to collect and spread information about missing persons better deal with this problem. Skin tone and hair color are also important features according to our machine learning classifier. It is worth mentioning the most important features according to the classifier are not conclusive, as the dataset lacks information that allows us to carry out an in-depth investigation of data correlations. Therefore, those features can represent casual correlations, and a larger and more detailed dataset is needed to better analysis.

The tool developed in this work is availableFootnote 25 to the community. The main contribution of this work lies in the application of the techniques in a real and current scenario. In addition, it is envisaged that the free software community can contribute to improving the proposed framework.

6 Future works

We highlight some improvements that intend to be carried out as future works. For instance, the insertion of data deduplication techniques to help identify entities is important. This is one of the major problems when integrating data from the web. Also, techniques for inference of regular expressions can assist in the data scraping process.

Studies show that social networks play an essential role in visualizing data on missing persons. Thus, the intention is to develop an additional module that will act as a system for disseminating this information in social networks. Furthermore, we aim to explore methods that increase the visualization of the structured data generated by the framework.