1 Introduction

Open sources have existed for many years, but the explosion of the Internet and the World Wide Web (WWW) motivates several cyber security professionals and researchers to publish journals and articles on cyber threats, cyber-criminal profiling, and information gathering (Amaro et al. 2018). The current state of Artificial Intelligence (AI)/Machine Learning (ML) analysis in the context of open source is challenging. To effectively utilize publicly available unstructured data in cyber crime investigation, the researchers are developing methods for identifying, gathering, and organizing it. The term “OS” of OSINT stands for “Open Source”; it refers to the publicly accessible source from which the user obtains information for his intelligence. The term “Information” is a crucial component of OSINT, which is freely available information. You don’t need to be a hacker to use OSINT in your daily life. Maybe you’ve already used OSINT but didn’t realize it. All internet users use OSINT tactics in some way, such as in the online search for a firm, school, university, or individual. It does not matter where the information is stored and whether it is obtained from social networks, photographs, videos, blogs, newspapers, or tweets, as long as it is public, accessible, and legitimate. With the help of proper knowledge gained using OSINT, we can achieve significant competitive advantages such as profiling criminals, tracking employees of companies, and tracing organized crime. The availability of information and the methods for gathering it are continuously changing with time.

Previously, open-source focused on social events, open addresses, and interviews. However, today’s information is on the Web, and methods for retrieving it are becoming far more complicated, innovative, and open to everybody. Actionable intelligence may be gathered from public and unclassified sources due to the proliferation of social networks and the real-time information exchanges accessible today. The significance of OSINT has accelerated a dispute about how intelligence from data should be gathered from multiple sources between the military, the government, and the commercial sector. Some challenges include relevant data acquisition, exploitation, and dissemination to address specific intelligence requirements (Nouh et al. 2019).

In recent years, several journals have published work on cybercrime, cyber security vulnerabilities, and the potential threats posed to individuals, businesses, and countries. Open source data has brought challenges in acquiring and maintaining the necessary skills to use the heterogeneous information and the electronic media available. Several tools and techniques are publicly available that can help cyber investigators to investigate cyber threats using OSINT (Revell et al. 2016). The various characteristics of cyber criminals can be easily collected using OSINT tools and techniques. These characteristics of cyber criminals help to profile the type of criminals (Edwards et al. 2022). OSINT is becoming a more important discipline in current criminal investigations as a technique for resolving several types of cases. LEAs and investigation agencies rely primarily on open data for information verification, evidence gathering, and intel generation for various investigations, including money laundering, fraud detection, human and arms trafficking, etc.

To facilitate crucial decision-making and security measures like sentiment analysis and riot control, disaster response, counter-extremism, misinformation measures, threat intelligence, incident response, blockchain and dark web monitoring, data leak detection and social engineering countermeasures, etc., a variety of government bodies, including military organizations and intelligence agencies, relying on the huge and up-to-date nature of open data.

The remaining of the paper is structured as follows. The need for OSINT review is described in Sect. 1.1. Where we can used the OSINT is introduced in Sect. 1.2. The most recent OSINT applications are introduced in Sect. 2. We covered OSINT workflow in Sect. 3. A thorough examination of the various OSINT tools and methodologies, as well as suggestions on how to select the right option for a particular task, are covered in Sect. 4. The discussion of the paper in which we discussed the present practices is described in Sect. 5 of the paper. The work is concluded in Sect. 6, which covers the challenges and future research directions. The contribution of the paper is graphically summarised as shown in Fig. 1.

Fig. 1
figure 1

Graphical overview of the paper

1.1 Need for OSINT review

The amount of information available online can be helpful to individuals and businesses. Lack of awareness and knowledge can lead to potential harm by developing false beliefs about unreliable information, which raise information security assurance issues. Due to internet access and security vulnerabilities, it is common for the number of published research to expand rapidly before they are assessed. Some studies may give unclear, confusing, or conflicting results or be available in different languages. So based on the above information, the following research gaps are found, lack of clarity to generalize the system security of other countries, lack of cyber security and cyber defense problem solving using OSINT, Lack of ways to use OSINT in robust and automated models, lack of knowledge about proper selection of tools, techniques, and processes, based on data availability and target. Automatic and self-propelled cyber crime and cyber threats investigation model according to the requirement. Therefore a systematic review is necessary for OSINT applications from a perspective of cyber security, which provides the right and unbiased information.

1.2 Where can open source intelligence be used?

Ethical hackers, penetration testers, and security experts identify potential flaws in the system using open-source tools and techniques. Defects commonly found include web-based life, open ports or unbound web-related gadgets, unpatched programming (e.g., websites running old customization of essential CMS elements), leaked resources, or unprotected unintended whole-code pasting containers for sensitive data such as resources. Some of the significant fields where OSINT is used are Law enforcement agencies (LEAs), Governmental bodies, and Corporate and cyber security.

1.2.1 LEAs

We have seen a reduction in HUMINT (Human Intelligence), for example, inspectors marching the streets and knocking on people’s doors. OSINT use massively increased due to individuals spending so much of their lives online. This is especially helpful for criminal investigations such as cases of human trafficking to money laundering.

1.2.2 Governmental bodies

Military groups use open data to carry out efficient investigations and as counter-espionage. National security agencies use OSINT methods to detect threat groups like terror cells, maintain efficient incident response for riots and natural disasters, perform sentiment analysis, disprove rumors, and conduct many other civic responsibilities.

1.2.3 Corporate and cyber security

Companies are prioritizing their information security more due to significant financial losses caused by a single ransomware attack or compromise of corporate data. OSINT frameworks play a crucial role in threat intelligence, including penetration testing and incident response, but can also be used to counter persistent issues like social engineering-based attacks or even just irresponsible ways of data handling. Although OSINT strategies vary from case to case, there are fundamental methods for producing valuable intelligence throughout the data collection and processing phases. OSINT helps security experts to tackle cyber threats, such as identifying which vulnerabilities can be effectively exploited, blocking threats of imminent attacks, etc. Typically in cyber-investigating work, investigators need to distinguish and connect many focal points of information to identify cyber threats. For example, a single compromising tweet may not be a source of concern, but if it is linked to a dangerous cluster known to be dynamic in a particular industry, the associated tweet may be viewed with a suspicious perspective. One of the most important things to know about OSINT is its use in conjunction with other intelligence subtypes. Information extraction from closed sources, such as internal telemetry, dark Web networks, and shared external intelligence networks are typically used to verify and validate OSINT. The various applications of OSINT are shown in Fig. 2.

Fig. 2
figure 2

Applications of OSINT (Khera 2021)

2 State of art

This paper has tried to incorporate the findings from previous research related to OSINT (open-source intelligence) tools and techniques. To further understand and enhance progress in OSINT research, we have formulated five Research Questions (RQs) based on a comprehensive review of OSINT tools, techniques, and their applications. The main objective of these questions is to enhance our understanding of OSINT and its utilization.

  1. 1.

    RQ1 What are the various categories of OSINT and how we can utilize them? In previous studies, OSINT is used only for the purpose of translation, exploitation, analysis, and dissemination but due to the advent of technologies, it can be used in various applications. OSINT is categorized according to applications such as Geospatial intelligence, Signal Intelligence, Imagery Intelligence, Human intelligence, and Social media Intelligence. The details of these categories are discussed in Table 1.

  2. 2.

    RQ2 What are the main strengths and weaknesses of different OSINT tools and techniques? This question’s actual aim was to identify the strengths and weaknesses of various OSINT tools and techniques from several aspects such as publicly available, application, processing time, input type, and reliability score. The details of tools and techniques are described in tables of Sect. 4.

  3. 3.

    RQ3 what are the benchmark research works available that integrate OSINT with different fields to maximize the utilization of OSINT? This question mainly focussed on the sources of research work published and how much it is relatable to OSINT and these works are being regularly cited and preferred in previous studies. The details are described in Sect. 2.

  4. 4.

    RQ4 Is there any mechanism or workflow to utilize the OSINT to exploit publically available data? There are some research work, government guidelines, and reports published in which they explored the OSINT workflow. This workflow is quite helpful to exploit the OSINT tools and techniques according to requirements. Our aim is to explore the different relevant workflows on the basis of their characteristics and applications. The general workflow is shown in Fig. 5

  5. 5.

    RQ5 How we can utilize OSINT tools and techniques in social network analysis, opinion extraction, cyber security, counter-terrorism, cyber defense, cybercrime investigation, criminal profiling, surveillance, etc. There are some very good research articles, letters, magazines, and blogs available that exhaustively exploited OSINT tools and techniques. It signifies the applications of OSINT in various domains are quite useful and suitable for real-life based scenarios (like tracing criminals, counter-terrorism, tracking of ships/Flights, and surveillance). Therefore the main aim of this question is to identify the framework which integrates OSINT with Machine Learning/Deep Learning/Artificial Intelligence that can give better results. This question also focuses on the model result without OSINT so we can easily compare them by including them with OSINT. The all above points we have addressed in Sects. 2 and 4.

2.1 Research process

We have used various keywords for state of art such as OSINT, OSINT tools and techniques, cybercrime and organized crime detection using OSINT, counter cyber criminals using OSINT, Threat intelligence using OSINT, application of OSINT in cyber defense, security intelligence, disaster management, social opinion extraction, sentiment analysis, malware analysis, vulnerability assessment, national security surveillance, and counter misinformation. Using these keywords we have searched the following electronic databases like google scholar, ACM digital libraries, Web of Sciences, science direct, and IEEE Xplore. We also searched different tools and techniques of OSINT, different magazines related to OSINT, blogs of OSINT, and newsletters of OSINT. We have included those works which are related to OSINT, applications of OSINT, and possible domains to integrate OSINT. Other characteristics are citation score, reputed journals/publications, and real-life-based scenarios. Figure 3 shows the analysis of resources accessed and used to study the OSINT trends in cyber security using various AI/ML/DL techniques.

Fig. 3
figure 3

Overview of resources used for the analysis

2.2 Extracting social opinion and emotions

Aye and Aung (2020) proposed a model that used adjectives, negations and intensifiers of text to determine the user’s opinion on a given text. Their model is only for the Myanmar language. Kandias et al. (2017) the author has done an experiment on 450 users of Facebook to determine the level of stress. Yadu and Shukla (2020) use five different classification techniques to analyze the emotions contained in tweets from the Indian Air Asia Service. Prabhakar et al. (2019) used AdaBoost (Ensemble), which also integrates other classifiers for building a robust classifier. The precision of the model is 84.5%. According to Wadawadagi and Pagi (2020) the use of the deep neural network can be elongate to other tasks of sentiment analysis. These analysis tasks include partnership categorization, computer translation, query response, and subject recognition. Other tasks such as polarity detection and opinion symbols are also included in sentiment analysis. Hashida et al. (2018) proposed a model based on distributed multichannel representation to enable a hybrid interpretation of text data.

Naseem et al. (2019) proposed a model based on BiDirectional short-term memory (BiLSTM) and hybrid word representation, which increases the accuracy of the model. They are considering terms out of vocabulary (OOV), grammar, polysemy, syntax, and word feelings in airline sentimental research tweets. Soomro et al. (2020) proposed a model in which they analysed more than 18 million tweets. All tweets are associated with the novel coronavirus. Tweets have been analyzed to see a relationship between the number of coronavirus infections and public mood. Due to this the number of cases is increasing or decreasing. The author of the paper (Abdul-Mageed and Diab 2014) proposed a model in which sentiment analysis was used mainly to understand arabic sentiments through YouTube comments and Twitter tweets. Garcia and Berton (2021) proposed a model using Twitter data to perform a sentiment analysis in Portuguese. Their aim was to analyze the consequences of the pandemic in two geographic locations, the United States and Brazil.

Mishra et al. (2019) proposed a model that analyzed the sentiments of various reviews. Using model results they created a hotel recommendation system. Using ML methods, Jain and Dandannavar (2016) studied different steps to accomplish sentiment analysis using Twitter data. They collected data from Twitter and used Natural Language Processing (NLP) for preprocessing. Then to obtain features related to the sentiment, they performed feature extraction. They used classifiers such as decision tree (DT), support vector machine (SVM), and naive bayes (NB) to train a model. Shuai et al. (2018) proposed a model with the help of Chinese hotel reviews. They also used Doc2Vec to organize the data into balanced negative and positive emotions. For training of the Doc2Vec model, data are integrated and assessed using various classifiers such as SVM, logistic regression (LR), and NB. According to the findings of his work, with an F measure score of 81.16%, the SVM gave the best result compared to the other classifiers. Based on the state of art of extracting social opinion and emotions we analysed that extracting opinions and emotions is important because they can cause damage to individual, community, and country. In most of the cases the data set is collected randomly and no standard data set is available. So if we use the OSINT tools and techniques with AI/ML/DL then it enhanced the results.

2.3 Cyber crime and organized crime

Digital forensics and open-source intelligence are the two broad types of cyber crime investigations. Cyber crime is divided into the following categories by the Council of europe convention: crime against the CIA (confidentiality, integrity, and availability), content crime, computer crime, and other types of cyber crime. We have considered terms like computer crime, technological crime, high-tech crime, digital crime, and electronic crime to describe cyber crime. Human trafficking, pornography, child pornography, assassination, drug sales, terrorist activity, cybercrime markets, and cryptocurrency exchange are among the eight primary cyber crimes highlighted by Nazah et al. (2020). According to the information, the choice of tools and techniques has a greater impact on cyber investigations. We found that no single tool or procedure can collect all the evidence that investigators require due to which they use different combinations of tools and techniques to conduct a cyber-crime investigation. Quick and Choo (2018) proposed an framework based on OSINT that increases the accuracy of arresting the culprit and applied OSINT in digital forensics to improve the analysis of criminal intelligence. Delavallade et al. (2017) proposed a model that is totally based on social network data. The model extracts the crime indicator from different social networks and predicts future crimes.

Shestak and Koscheeva (2021) accurately described crimes that belong to the category of cybercrimes, how the coronavirus pandemic led to an increase in cybercrime rates, and their possible mitigation procedures. A framework has been implemented for the analysis of incidents using related tweets, number of retweets, hashtags, and Uniform Resource Locator (URL) connections. The geo-location, expert direction text position, and tweet time were extracted from the phrase using the n-gram method, which has an accuracy rate of 80% (Roddy and Holt 2022; Martin et al. 2020; Sinha et al. 2022; Kakkar 2020). Malhotra et al. (2021) proposed a model to assess whether a tweet is related to Crime Hub events. They used feature depiction, tokenization, stop, trailing, and TF-IDF extraction for classification. A study was conducted to keep track of the frequent crime hub in the Indian States. Hub Watch can be used to collect information on potential risks, fraud, and the supply of public transportation. Three classifiers were used for decision planning namely Bayes Network (BN ), KNearest Neighbors (kNN), and DT.

Fig. 4
figure 4

cyber investigation with OSINT tools (Kao et al. 2018)

The authors proposed a fusion architecture to help investigators identify and recreate crime scenes. Each stage combines logical procedures with abstract principles. To ensure that LEAs can enhance the effectiveness and efficiency of an investigation, some open-source tools of each phase in the proposed fusion architecture are shown in Fig. 4. Crime Hub data are classified into point division and link division. This approach properly categorised 467 tweets into link division and 3168 tweets from Crime Hub into point division with an accuracy of 94.66% and 76.85%, respectively (Valluripally et al. 2019). Kadoguchi et al. (2020) research led to the development of a device. This device can be used for real-time tracking of various crime hub locations across the Indian States. They used various ML classifier for classification Crime Hub fraud, SVM has the highest precision among all classification methods that is 97.28%. Statistics on the approaches to the attacker’s target have been displayed as a strategy to assist hackers in avoiding cyber crime hub, incident, and hub maintenance (Roddy and Holt 2022).

The system was given the term (Intelligent Transportation System) by the researcher. Detection and tokenization methods based on ontologies have been used to preprocess Twitter data as textual context information to identify a real-time crime hub. Then, the value was quantified with IDF extraction and then classified. The SVM method offers the maximum precision, according to the results. It has a 91.1% accuracy rate when looking at the current state of crime hub delays and 86.3 % of the Crime Hub jam. Crawling the dark web can be useful for investigators as it can help to learn about new risks, new drug varieties, and new vendors (Celestini et al. 2017). Five methods for data mining are described by Edwards et al. (2015): natural language processing (NLP), information extraction, social network analysis (SNA), computer vision (CV), and ML. These methods use technologies to scrape data from an internet source about a criminal or terrorist group’s links. Liao et al. (2016) proposed iACE, a technology that can be used to automatically gather intelligence from numerous sources and analyse data relationships. AI may be trained to search for patterns that signal criminal activity in forensic data, such as network traffic. Phishing is a social engineering attack as stated by Krombholz et al. (2015) and Ivaturi and Janczewski (2011). Both Pienta et al. (2018) and Gupta et al. (2017) suggested a phishing assault taxonomy, but did not provide sufficient analysis of email phishing attacks.

The experiments of Albladi and Weir (2017) and Halevi et al. (2013) predicted that personality trait has a direct influence on a user’s ability to detect phishing email attacks but due to inconsistencies in their results they developed a hypotheses model that stated, that user detector ability is influenced by personality traits, trust, competence, and motivation. Tandale and Pawar (2020) has provided an overview of several forms of phishing attacks and detection methods. Additionally, they presented several different phishing mitigation techniques. They concluded that among all existing techniques of anti-phishing using ML 100% accuracy can be achieved in detecting phishing. Alabdan (2020) provided a review and complete evaluation of recently phishing attack strategies for awareness of phishing techniques and about the different types of attacks. Rastenis et al. (2020) presented an e-mail-based phishing attack classification model that addresses all the shortcomings of the previous phishing attack classification model. Kathrine et al. (2019) discussed many phishing attacks and the most recent preventive methods. Their study demonstrates how to detect and distinguish phishing attack using ML algorithms.

Kunju et al. (2019) surveyed various methods to detect phishing attacks. Their survey demonstrated various ideas and techniques to detect attacks. They also state that several of the methodologies offered are ineffective in providing effective attack solutions. Aleroud and Zhou (2017) presented the taxonomy of phishing strategies, as well as their vectors and countermeasures. The paper highlighted the vulnerabilities frequently exploited, and the taxonomy provides guidelines for developing various successful and efficient anti-phishing techniques. Cui et al. (2017) proposed a method to calculate the number of HTML tags utilized in DOM attacks. They used clustering for creating clusters of attacks that take place within a specified range of distance and stated that these clusters can be aggregated and used for detecting the phishing attacks. Their findings revealed that his strategy can detect a substantial number of new phishing attacks.

Wang et al. (2020) proposed a phishing avoidance approach. On an Android smartphone, they installed an optical character recognition system. To verify the effectiveness of the suggested preventive technique they tested hijacking attacks. They stated that their suggested OCR approach overcomes the drawbacks and constraints of existing solutions and is effective enough to detect phishing websites. Churi et al. (2017) proposed a prototype model for determining a website is a phishing site. They claimed in their research that current phishing protection frameworks are ineffective for the detection of phishing sites. They used combination of visual cryptography and code-generating techniques. Their suggested methodology creates an image, divides it into two parts using visual cryptography, and then combines these two shares to create an image captcha. In order to identify the genuine site from phishing sites, the user is requested to match the site with the image captcha. Stafford (2020) explored the effects and causes of phishing attacks and also how users are affected by attacks. Humans are susceptible to phishing because of their characteristics and behaviors such as narcissism, susceptibility, and frequent email use. The results show that spear phishing is the most targeted phishing technique.

2.4 Cyber security and cyber defense

Senekal and Kotzé (2019) used NLP to analyze WhatsApp chats for a remarkable investigation, which estimates the large-scale vandalism of South Africa. Some researchers observed the use of OSINT in criminal investigations that involve coordinated misbehavior and cybercrime (Kao et al. 2018). Currently, the availability of online services is more, which routes the growth of a large amount of digital information (Herrera-Cubides et al. 2020). AlKilani and Qusef (2021) for additional security, proposed integration between some OSINT techniques and the corresponding clauses of the ISO 27001 standard. In the context of integrating them with the global standard ISO 27001, they presented a collection of OSINT tools/techniques that address hiring screening, background checks, and vendor risk assessment.

Raj and Meel (2022) proposed a model of fake news detection. In their model, they used two real world data sets Medival202 (Pogorelov et al. 2020) and CovidHeRA (Dharawat et al. 2020). They classify the tweets into two classes that is real and fake. To identify fake and real news, they used the following features: gender, media usage, sentiment polarity, follower count, friends count, status counts, retweet counts and favorite counts. Edwards et al. (2017) proposed a classification model in which they classify the person of the organization as an employee and a non-employee, with the help of which we can find vulnerability and also prevent attacks on the organization by social engineering. The classifier used for classification is DT. They used the sub-classifiers such as ‘Name subclassifier’, ‘Activity subclassifier’, ‘Writing fingerprint subclassifier’, ‘Link analysis subclassifier’, ‘Friend subclassifier’, and ‘Geographic subclassifier’. Yuan et al. (2021) proposed a model named Domain Adversial and Graph Attention-Neural Network for detection of fake news. Where two real-world datasets “2015 MediaEval twitter dataset (Boididou et al. 2015) and Weibo dataset” (Hu et al. 2020) were used for training and testing purposes. In the presented model, a domain discriminator, multi-modal data features extractor, and a graph-attention-base fake news classifier are integrated. To extract textual features, Bi-LSTM and image features, pre-trained VGG-19 model were used.

Cinelli et al. (2022) investigated the coordinated and non-coordinated Twitter account by considering the 2019 UK political election. For data collection they used twitter API by considering the election-related hastag, influential political accounts, parties offical accounts, and political leader accounts. Islam et al. (2022) proposed a validation tool using cyber threat intelligence to automates the validation of security alerts and incidents on the basis of security operation centres. They collected ‘Indicators of compromise’ using OSINT form public websites and MISP (MISP F 2021). Ch et al. (2020) proposed a model to analyse the rate of cyber crimes state-wise. They use various ML techniques to classify the cyber crime. NB and K-means is used for classification and clustering respectively. They used various attributes for classification like a victim, incident, offender, age of the offender, harm, year, location, and cybercrime.

Ganesan and Mayilvahanan (2017) proposed a methodology that helps to identify the unpredicted patterns. They collected the data from various web pages and databases that helps to classify the cyber crimes. The classes are of cyber crimes are cyberbullying, identity theft, scams, stalking, robbery, harassment and defamation. A multi-functional cybercrime intelligent system framework was proposed by Nouh et al. (2016). The main objective of this framework was to reduce the use of cognitive biases throughout the investigation process. This technique provides six main steps: identifying the issue, developing a hypothesis, gathering data, assessing the hypothesis, selecting a related hypothesis, and continuously monitoring incidents. Aslan et al. (2018), proposed a methodology to detect the social networks accounts (like twitter accounts) which are related to cyber security. They used various ML techniques such as Random Forest, DT and SVM etc to automatically detect the suspicious accounts. They also used some behavioural features that are extracted from the collected tweets to identify the suspicious accounts.

Abbass et al. (2020) proposed a methodology that helps to predict the various social networks based cyber crimes such as cyberbullying, cyber harassment, cyber stalking, cyber hacking and cyber scam etc. They used the data gathered from various social networks websites. They used Multinomial Naïve Bayes (MNB), KNN, and SVM for classification of cyber crimes into different classes. Kumar et al. (2020) proposed a methodology in which they used the type of crime data, time, and location for predicting crime in certain regions in India. They used KNN for prediction and crime which are predicted by this methods are robbery, accident, violence, gambling, murder, and kidnapping. The demographic and geographic information from past years’ incidents have been utilized to predict the terrorist activities in India. They used the AI to predict the terrorist incidents (Verma et al. 2019). Carloni (2014) proposed a methodology based on various ML techniques to detect and predict the cyber attacks. They used previous cyber crime data to train and test the model.

3 OSINT workflow

The OSINT workflow involves the following phases such as collection, processing, analysis, knowledge extraction, dissemination, and planning as shown in Fig. 5.

Fig. 5
figure 5

OSINT workflow (Akhgar et al. 2017)

The collection is a phase in which we collect data from open sources according to the objective (Herrera-Cubides et al. 2020). In this phase, we collect data from social media, websites, forums, reports (NGO, government, court, and law enforcement), articles (academics research, journalists), media (news, interviews, video, and audio recordings), booklets, books, etc. In the processing phase, the collected data is processed according to the objective and synthesized in such a way that can be easily understandable. The data will be categorized into relevant and non-relevant. Additionally, it checks the reliability of the sources from where the data is collected. In this phase, we also do a translation if required such as if we get data in a different format/language then we translate it into the required format/language.

In the analysis phase, we evaluate the processed or relevant data. We do lexical analysis (Ghazi et al. 2018), SNA (Stieglitz et al. 2018) and Geo-spatial analysis (VoPham et al. 2018) etc. The lexical analysis involves aggregation and analysis of text data collected from open sources. We also analyze the most frequently used terms, social media accounts of people, and their demographic characteristics. In SNA, we examine the social media accounts, connections, interest areas, etc. This analysis also analyzes the connected network of particular users/communities and their objectives. In Geo-spatial analysis, we find the location of targets or groups of targets using different open-source tools and analyze the location coordinates. In the Knowledge extraction phase, we do the extraction of relevant information from the collected data to fulfill the objective of the task.

Dissemination is a process that helps policymakers, students, researchers, law enforcement, governments, organizations, etc. Dissemination is done from an educational and research point of view. For education purposes, we educate the person about their data privacy and security to handle their data from cybercrime or other security issues. In research cases, dissemination helps to improve the methods, tactics, and techniques to prevent the data for digital forensics. Additionally, it helps to utilize open-source data for investigation. The plan is prepared for a particular task by analyzing and processing the other stages of OSINT workflow in the planning phase.

4 OSINT tools and techniques

OSINT has some generic categories such as Geo-Spatial intelligence (GEOINT), Human Intelligence (HUMINT), Signal intelligence (SIGINT), Imagery Intelligence (IMINT), and Social Media intelligence (SOCMINT). The advantage, disadvantages, and applications of these categories are listed in Table 1 (Omand et al. 2012; Williams and Blum 2018).

Table 1 Advantages, disadvantages and applications of generic categories of OSINT

The data can also be physically assembled, but this is time-consuming and easier to use at a later stage. The tool facilitates the selection phase by allowing you to collect information from multiple targets in minutes. It’s your job to detect if a username is available and assume it’s on all social media sites. One way is to log in to all internet-based life sites (perhaps you don’t have the niftiest idea about them all!) And test your username on those sites. Another option is to use an open-source device that links to different websites you can easily remember and checks usernames on each website without delay and it takes minimal time. Operate multiple devices to gather all data related to the target then combine and analyze it. The general attack surface for utilizing the different OSINT-based tools and techniques is shown in Fig. 6.

Fig. 6
figure 6

General attack surface (Bazzell 2016)

Historical data is generally used to verify or investigate a particular incident. Several archives are available of historical data, such as the collection of newspapers, maps, books, manuscripts, science information, birth records, marriage license and death records, etc. Some historical data websites are listed in Table 2.

Table 2 Historical data extraction websites

With the help of historical data, investigation and verification become easier. OSINT tools and techniques can automate historical data extraction and information verification.

Official data leak repositories provide us the leaked data such as spying and correction data, company information, and restricted official materials. These data can be collected using the official leak repositories. Some of them are listed in Table 3.

Table 3 Official data leak repository

These data help in the verification and investigation of crime as well as in training the DL-based models. To access various online services, people use a unique username. Using the username, we can easily collect the data related to the particular user. Table 4 explores some username tools and techniques. These tools help to find the account on social networks., like lullar is an OSINT tool that automatically generates the URLs of user-profiles of different social networks.

Table 4 User name check tools

The real name is also used to collect data about the target. Some tools and techniques based on real names are listed in Table 5. Using these tools, based on the real name, we can collect other information such as email address, location, social media accounts, images, phone numbers, and age.

Table 5 Real name search tools and techniques

Email investigation examines an email’s header and body for information about the sender and recipient. In header analysis, we can obtain information such as Received by, X-Received, Return path, Received-SPF and authentication, etc., received by contains IP address of the server, SMTP-id and date and time at which the email was received. X-Received is a parameter not defined in the official protocol standard of the internet. They are created by the mail transfer agent and contain the same as the Received-by parameter. The return path includes the IP address of the server, the receiver email address, encryption information etc. Received-SPF includes the IP address of the sender along with the hostname.

There are some parameters such as pass, which means the email source is valid; soft-fail, which means a fake source possible; neutral, which means source validity is difficult to understand as certain and unknown, representing SPF record is not found. In email body analysis, we check the language of the content, signature and attachments etc. And we also searched for email IDs in different search engines, email look-up tools, breached data related to that email, and on several social networks to collect information, as shown in Fig. 7. The open-source tools for email investigation are listed in Table 6.

Fig. 7
figure 7

Email attack surface for investigation

Table 6 Email investigation tools

The phone number is also the key entity that collects information about the target. Some phone number-based tools and techniques are listed in Table 7. We can quickly get the phone number owner details and device information using the phone number.

Table 7 Phone number search

Internet protocol (IP) address helps to collect the information such as Geographic location of device, Time zone, Area code and ISP etc. Some IP address-based geo-location gathering tools are listed in Table 8.

Table 8 IP geolocation information tools

We can easily collect the list of blacklisted IPs using tools. These blacklisted IP are continuously updated, which may help train the DL model to detect cyber attacks. Some of the directories of blacklisted IPs are listed in Table 9.

Table 9 Blacklisted IP address

Image is also one of the key entities with the help of which we collect information about the particular image. Image search tools help to collect images related to various incidents such as crime, education, breaking news, historical image, politics and elections. These tools help to investigate cyber incidents such as Google image search (2022), Bing image search (2022), Yahoo Images (2022), Yandex Image search (2022) and Baidu (2022). Some other Image search tools and techniques are listed in Table 10.

Table 10 Image search tools

Nowadays, reverse image search is a trending technique that helps gather information about a particular image. Reverse image search tools help to collect information like the device used to capture, the location and source of the image etc. Some reverse image search tools and techniques are listed in Table 11.

Table 11 Reverse image search

Image manipulation check tools help to analyze the images. These tools also help to collect information like metadata of the image, location of the captured image, hidden pixels etc. Some tools and techniques for image manipulation check are listed in Table 12.

Table 12 Image manipulation check

Video search tools help to find information such as the source of videos, type of content in video, and meta-data of video. These tools also help to collect intelligence on crimes and information about victims and help to investigate the crimes such as Google video search (Google 2022c), Yahoo video search (Yahoo 2022) and Bing Video (Microsoft 2022b). Some other video search tools are listed in Table 13.

Table 13 Video search

Geospatial search tools help to collect location information and analyze incidents at any particular location, street view, and other information related to target locations. Google maps (2022b), Bing Maps (Microsoft 2022a), and Yandex map (2022) are common geospatial tools. Some Geospatial tools and techniques are listed in Table 14.

Table 14 Geospatial search tools

Air movement tracking OSINT tools and techniques helps to gather a flight’s path and current location and track the target’s live location on the move. It also helps monitor air traffic at any location at any given time. Some air movement tracking OSINT tools are listed in Table 15

Table 15 Air movement tracking

The main step of conducting maritime research is to detect a vessel’s AIS (Automatic Identification System). You can see the vessel’s name, position, destination, vessel type, and other information with the help of AIS. OSINT tools and technology can be used to track vessels that are breaking the law, smuggling drugs, smuggling humans, fishing illegally, and practicing naval maneuvers (Smithrae 2021). Some of OSINT tools and techniques of ships tracking are listed in Table 16.

Table 16 Maritime movements tracking tools

Using the Siamese Network with the Region Proposal Network, Shan et al. (2020) proposed a marine ship tracker. They made changes to the CNN in the Siamese subnetwork to increase the effectiveness of feature abstraction and presented an adaptive search area extraction technique to lessen the impacts of shakings. However the proposed tracker works for limited vessels and weather conditions. Yang et al. (2022) proposed a coastal ship tracking network that integrates segmentation and visual object tracking in general, which can track and segment ships simultaneously. They used ERM with a more advanced feature pyramid fusion approach. The approach enhances the network’s feature map during feature extraction and fusion which will help to increase tracking accuracy. They used the Large Maritime Dataset, which contains speed boats, cargo ships, passenger ships, fishing ships, and unmanned ships. Wang et al. (2021) have proposed a model for ship detection using various ML classifiers such as Random Forest, SVM, LR, KNN and LDA etc. They used satellite images to train the model. In preprocessing, they used an SDT filter to remove the noise from images. They used google earth data for the validation. Spadon et al. (2022) proposed a model which addresses the behavior of AIS messages using the various ML as well as DL algorithms under the irregular and noisy data. Zhang et al. (2021) proposed a model that is RoDAN, which fuses three dimensions such as scale, motion and region for tracking information. They used an ASPP module to scale dimensions that helps to get more accurate information to track the ships. They used marine public datasets named the SMD (Prasad et al. 2017) and HSD dataset suitable for ship tracking i.e MarDCT (Gundogdu et al. 2016). Package tracking tools helps to track the package which helps to get the update about the package. These tools helps to track the package of smuggling drugs, weapons, and illegal packages. Some OSINT tools and techniques of package tracking are listed in Table 17.

Table 17 Package tracking tools

5 Discussion

The challenges of OSINT investigation are: automating the collection process, analysis, and knowledge extraction, integration of several open data sources, filtering out misinformation and irrelevant data, globalization of OSINT tools and techniques, awareness of privacy, ethical and legal consideration, and prevent misuse of OSINT tools and techniques.

In ethical consideration, OSINT tools and techniques must be used sincerely and legally for legitimate works. The publicly available Data does not mean that they are not sensitive data. Because if someone’s political views, medical data, religious beliefs, and family and friends’ information are publically available these data might be used for threatening or blackmailing. For example, religious beliefs may lead to the government and public conviction towards terrorism. Due to this, we have to ensure that we must follow the ethics, data privacy rules and regulations such as General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), and other applicable laws respective to their locations.

GDPR (EU 2016) was introduced by European Union to regulate the processing of the personal data of European citizens. OSINT also involves the processing of personal data. So some of the aspects of GDPR related to OSINT are accountability (everything used and how you are collecting data must be documented), legal basis (you need to have consent (article 6.1.a)), legal obligation and legitimate interest, principles(article 5) and data subjects rights (article 14). The principles involve lawfulness which states that the collection method which is going to be used for collection must be legally verified (do not break the authentication protocol to collect any type of data), Fairness states that collect only data that has related to your investigations, Transparency states that the data processor used, type of data processed, and purpose of processing collected data must be transparent, Data minimalization (you should try to process the minimal personal data as much as possible), Storage Limited (do not store the data longer than it needed), Integrity and confidentiality (the data must be accessed by only you and your client, you should store data in encrypted form). The data subject rights involve the right of notification, right of access, right of rectification, right to the erasure of data, and right to restriction of data processing.

In the case of surveillance (Miller et al. 2022), we gathered a huge amount of information from multiple sources that helps to track and monitor the individual, group, and organization. This information can raise the issue of civil liberties and freedom of expression and that’s why we have to ensure that we do not violate any rights of individuals. There are various issues such as data privacy, data protection, issue of civil liberties, freedom of expression, etc. According to GDPR, you can process and store the data until it is needed. Otherwise, you must permanently delete all information regarding a particular investigation. The collected data must be stored in encrypted form and accessed by only you and your client. OSINT is helpful for government, LEAs, and cyber security professionals but it must be used ethically for legitimate work. You should ensure that data collection must be done with the rules and regulations and must not break the authentication protocols. Furthermore, organizations/individuals should have a clear understanding of the laws and regulations that govern the collection, storage, and use of OSINT. This includes understanding the rights of individuals and organizations concerning their personal information and being aware of the penalties for non-compliance.

OSINT must be used according to law and by considering the data privacy and protection policies (Rajamaki and Simola 2019). OSINT is legal by definition but it is necessary to ensure that data collected by researchers, investigators, government agencies, and LEAs did not publish even if data is publicly available. We should not break the authentication protocol for collecting data. In other words, we can say that OSINT must be used in a restricted manner for legitimate and non-harmful activities. We need to make a balance between what we collect and what we store or mention in the report after investigation. In any inquiry, we collect every possible data but we need to ensure that in the final report, the information is under the scope of GDPR and data must be in encrypted form and accessed only by you and your client.

In misinformation consideration, During the early stages of the COVID-19 pandemic, states were desperate for any means to combat the economic, health, and human consequences of the disease. The severity of the crisis prompted states to utilize or consider using any available tools, even those that had been previously limited to national security purposes. The utilization of intelligence and surveillance tools traditionally used for security, intelligence, and law enforcement for pandemic surveillance highlights the drastic measures that many states have taken to curb the spread and minimize the impact of COVID-19.

The technologies which are used for national security surveillance (like location monitoring, facial recognition, and social media intelligence) are deployed to combat the covid-19 pandemic. For example, The United States, developed technology using social media intelligence (like blogs, news, social networks, and government sources) to counter the spread of the Covid-19 pandemic (Berman et al. 2020). Some other countries (like Israel, Pakistan, South Korea, Hong Kong, and China, etc.) also used their national security intelligence (such as location monitoring, CCTV surveillance, Face recognition, tracking individuals/groups, GPS and cell phone tracking, etc.) to control the spread and counter Covid. Due to covid millions of people are died and it affected the economic conditions worldwide. Due to this using the national security intelligence system and OSINT tools and techniques to counter the pandemic is justified. But the OSINT tools and techniques must be utilized by considering the cyber laws and regulations, data privacy, data protection, freedom of expression and civil liberties, etc.

For the collection process web crawlers, web scrapers, and API-based tools are used. In analysis and knowledge extraction, we generally do semantic analysis, analysis of threat patterns, correlation with other events, and also occurrences of data to find the relation between different separated pieces of information to get the relevant information. And also used data mining techniques, NLP techniques, and SNA to extract relevant information from collected data. In the integration of multiple open-source data, we integrate the aggregated data from various social networks, the internet, and also from the dark web. In the case of the detection of fake information, filtering of irrelevant data, and misinformation generally ML and DL models are proposed. When we are using OSINT tools and techniques we need to focus on user privacy, user family privacy, user-friend privacy, and the privacy of their co-workers. And Follow GDPR protection rules and regulations. OSINT is legal because the data sources are publicly available. We also need to remember that publicly available data are prone to abuse (Castelle 2018), cyber aggression (Kumar et al. 2018), cyberbullying (Hosseinmardi et al. 2015; Wachs et al. 2019), cyber gossip (García-Fernández et al. 2022), and cyber harassment (Abarna et al. 2022).

Misuse of the OSINT tools and techniques causes, loneliness, depression, distress, and victims may commit suicide in the worst case. We can positively use the OSINT for HR recruitment, counter cyber criminals, prevent the spread of misinformation, detection of fake information, profiling of cybercriminals, and doing digital forensics, etc. Profiling cybercriminals in the darknet market is one of the use cases of OSINT tools and techniques. The challenge, however, is that profiling a person who in reality does not pose a threat leads to discriminatory and unfair attitudes then may affect victims. OSINT helps to profile the cyber attacks and improves sophisticated cyber attacks (Akinrolabu et al. 2018). It is benificial in private sector and also work as resource of public interest in governments (Lande and Shnurko-Tabakova 2019). This also helps in classified investigation and secret operations (Larsen et al. 2017). It helps in investigation and strategic planning to counter crimes (Akhgar 2016).

6 Conclusion and future work

This paper discussed the current state of open source intelligence (OSINT). It revealed that current techniques are inadequate due to their lack of effectiveness in real-world scenarios. This paper focuses on highlighting the importance of integrating OSINT into cyber defenses, social networks, digital forensics, and other possible domains to identify the profile of culprits, cyber crimes, counter-terrorism, and cyber incidents. The paper also outlined basic OSINT search techniques and describes advanced OSINT tools. Proper tool selection based on available data and goals is crucial, but using a combination of tools is the best way to achieve accurate results.

In this paper, we also described open source data, social networks, various types of cybercrime, and OSINT tools and techniques with NLP/AI/ML/DL and this can be helpful for investigators to enhance the OSINT research and applications. Generally, IOCs are extracted from traditional blacklists like cleanMX and Phishtank covering only URLs, domains, IPs, and MD5 signatures but it does not include the criminal groups and other context information. So using OSINT we can add psychological and behavioral features. We conclude that OSINT can help to improve cyber security issues in various fields like phishing detection, threat intelligence, hate speech detection, fake news detection, human trafficking, child trafficking, criminal profiling, etc. OSINT also helps to monitor APT groups, fake images/video analysis, monitor malicious activities, and trace violent acts. In the case of phishing detection, we use the various available data sets to train the model. But if we use the automatic updating features, attack patterns, and IOCs, then the model might become more robust and efficient.

The challenges that need to be addressed are: extracting indicators of compromise and their relations from unstructured threat intel reports and using that extracted knowledge for threat hunting, making a consortium of criminals to track them, developing a framework of every possible OSINT tool and technique, automation of information gathering and extracting the intelligence from open source data using AI, establish real-time model using the OSINT and DL techniques to monitor the Advanced Persistent Threat(APTs) groups automatically, make a multi-platform and multi-lingual social networks based benchmark dataset for cybercrime detection with the help of OSINT tools and techniques. Automatic detection of human trafficking and child trafficking with OSINT and DL techniques also needs to be explored.