Keywords

Introduction

Artificial intelligence has made great leaps of progress since the time of the Logic Theorist programme (Simon, Newell, Shaw 1956) where a machine was created to think like a human. Since then, there has been a continuous evolution of AI in terms of algorithms and the associated computing power required to make it possible. These days one of the core pillars of AI are data. Without data, AI cannot be trained, and any inference is based on false information. Defence organizations are using AI in different spaces such as detection, planning, and field operations. The management of this data requires a structured storage area, and thus, designing and developing an AI repository for defence applications requires careful consideration of several factors. A well-designed AI repository needs to store and manage large volumes of data considering various parameters such as the type and volume of data to be stored, the purpose of the AI algorithm to be developed, the format in which the data is to be stored, and to provide access to this data.

In most cases, the data must be annotated with accurate labels, and the labelling process should comply with ethical and legal standards. Appropriate data management practices shall be defined regarding cataloguing, metadata, storage, and documentation. Access to the data should be restricted and secured to prevent unauthorized use and potential security risks, while specific authentication mechanisms should be applied keeping in mind the application domain. Versioning is also important to capture the evolution of the data and finding changes to datasets and manage incompleteness. Additionally, the repository should follow community-endorsed interoperability best practices to facilitate data exchange and reuse within and across relevant disciplines, such as security applications, enabling the researchers to advance their scientific work on a need-to-know basis. Finally, documentation of the data provenance and quality assurance processes should be meticulously maintained to ensure transparency and reliability of the AI models developed from the data. All the aforementioned parameters should be taken into account to select the best possible design approach for any individual repository.

Methodological Approach

The methodology for designing and developing the FDR involves several key steps to achieve its main objectives. The initial methodological step focuses on the identification and collection of datasets from various existing sources to meet the user requirements and use case needs. The datasets may include non-cooperative and cooperative tracking data, ground, marine, and airborne electro-optical data, radio signals, and other relevant data sources.

Once the datasets are gathered, a thorough assessment is conducted to evaluate the quality and relevance of the data as the second step. This assessment involves examining factors such as data accuracy, completeness, consistency, and data source reliability. It ensures that the selected datasets meet the desired criteria and are suitable for the intended AI algorithms and applications.

Data preparation is a crucial, third step that follows the assessment of data and involves cleaning, preprocessing, and transforming the datasets to make them suitable for AI algorithms. This process includes but is not limited to tasks such as data cleaning, normalization, and data formatting to ensure consistency and compatibility across different datasets.

Before storing the data, it is of utmost importance to clearly articulate the purpose of the AI algorithms that will be applied to the datasets. This involves identifying the specific objectives, tasks, or analysis that the AI algorithms will perform on the data. Defining the purpose of the algorithms contributes to the design of the repository and organization of the data in a way that aligns with the intended use cases.

This fifth step involves the actual design and development of FDR. The repository is designed according to the principles defined in the previous steps as well as to the chosen repository characteristics, including scalability, security, metadata management, versioning, data transfer, and collaboration features. The development process includes implementing the necessary software components, user interfaces, and backend infrastructure to support the repository functionalities.

The collected and prepared datasets are stored in the developed FaRADAI Dataset Repository. The repository should provide secure and scalable storage infrastructure capable of accommodating the volume and diversity of the collected datasets. Proper data organization, indexing, and storage practices are implemented to ensure efficient data retrieval and management.

The seventh step follows the development phase, where thorough testing and validation of the FDR functionalities are performed. This involves conducting several tests to ensure that the repository functions are as intended, including dataset uploading and downloading, metadata management, searchability, access control, versioning, and collaboration features.

Once the FDR is tested and validated, access is granted to the intended users. User roles and access rights are defined, allowing authorized partners to securely access and collaborate on the datasets stored in the repository. Proper access controls and authentication mechanisms are implemented to ensure data security and privacy. Once access is granted to the users, the final step includes the monitoring and tracking of users’ activities within FDR. This foresees implementing logging and auditing mechanisms to record user interactions, including dataset access, downloads, and modifications. By monitoring access, any unauthorized activities can be detected, ensuring in parallel the security and integrity of the data.

By following the aforementioned methodological steps, as also illustrated in Fig. 35.1, the design and development of the FDR can be carried out effectively and ensure the availability and accessibility of the relevant datasets for analysis and collaboration.

Fig. 35.1
A diagram of the designing and developing of the F D R. It presents a gear pattern with 8 steps such as identify the data, assess the data, prepare the data, ,the purpose of the A I algorithm, design and develop the F D R, data storage, testing phase, and grant access.

Methodological approach in steps for designing and developing the FDR

Implementation Principles and Decision Tree

In this chapter, the implementation principles that were considered to follow the appropriate design and development path up to the final release of FDR and the final step are described. In some of the following steps, there were various approaches to follow and were evaluated according to the specific needs of the project.

Identification and Collection of Datasets

The outcome of successful data identification and collection sets the foundation for FDR design. It directly influences the choice of storage solutions. For example, if data sources are distributed across various locations or systems, alternatives such as distributed databases or cloud-based storage can be explored to ensure efficient data access and retrieval. Moreover, with a vast volume of data, deduplication strategies might be implemented to optimize storage space, removing redundancy in datasets. Another alternative to consider is federated databases, allowing data to remain at its source but still be accessible through the FDR. These choices align the design with the specific needs and characteristics of the collected datasets (Fig. 35.2).

Fig. 35.2
A block diagram of the identification and collection of datasets. It presents the following stages, identification and collection of datasets, storage solution, locate storage, distributed database, cloud storage, and federated database.

Step 1: identification and collection of datasets

Data Assessment

Thorough data assessment informs the FDR’s design in several ways. High-quality, well-structured data may require less extensive preprocessing, affecting decisions regarding data cleaning and normalization. In cases where data quality is lower, the FDR’s design should incorporate advanced data cleaning algorithms to handle data imperfections and outliers effectively. The alternatives may involve developing custom data cleaning routines that align with the unique characteristics of the data or using specialized data transformation techniques to enhance the data’s readiness for AI algorithms.

Preparation

The preparation of data is critical for seamless integration into the FDR. Cleaned, normalized, and AI-ready datasets expedite processing and analysis. The design must ensure that the data is compatible with the repository’s structure. For datasets arriving in various formats, the FDR might include format transformation modules to standardize data. Alternatives involve focusing on a specific data format to minimize format transformation, but this may require stricter data source requirements or additional data preparation at the source. The choice between these alternatives depends on the extent of data format variation and the trade-offs between standardization and source data flexibility (Fig. 35.3).

Fig. 35.3
A block diagram of the data assessment. It presents the following stages, data assessment divided into well structured with minimum processing required and unstructured with data preparation. It is divided into data-cleaning algorithms, normalization techniques, and transformation techniques.

Steps 2 and 3: data assessment and preparation

Definition of the Purpose of AI Algorithms

Clarity in articulating the objectives of AI algorithms shapes the design of the FDR. Clear objectives facilitate data categorization and indexing, allowing for efficient data retrieval and usage. Alternatives might include advanced search functionalities if the use cases are complex or subject to frequent changes in algorithm objectives. In cases where algorithms serve multiple, distinct purposes, the FDR design may emphasize robust tagging and metadata management, enabling flexible search and retrieval based on different criteria (Fig. 35.4).

Fig. 35.4
A block diagram of the purpose A I algorithm. It presents the purpose of the algorithm divided into clear or limited objectives and complex or numerous objectives. Limited objectives divide minimum index and basic functionalities. Complex divides metadata and multi-criteria search.

Step 4: definition of the purpose of AI algorithms

Datasets Storage

The secure and scalable storage of data is crucial, especially for defence applications dealing with substantial data volumes. The FDR’s design shall account for storage that can accommodate growth. For enhanced data security, alternatives include encrypted storage solutions. Integrating cloud storage might also be considered, and a multi-cloud approach can provide redundancy and improved data availability. The choice between on-premises and cloud-based storage is a critical decision, and it influences both the repository’s architecture and its scalability (Fig. 35.5).

Fig. 35.5
A block diagram of the collection of datasets. The processes are identification and collection of datasets, storage solution divided into local storage with high-security medium scalability and cloud storage with medium security and high scalability.

Step 6: identification and collection of datasets

Design and Development of FDR

The design options for FDR components are directly shaped by the outcomes of the previous steps which are summarized here. For data-intensive tasks, such as AI algorithms that require substantial computational power, the FDR might opt for an asynchronous processing unit (APU) with a queue implementation to efficiently manage processor and storage-intensive tasks. When ease of data access is a priority, REST API Services might be preferred for communication between the various components of the FDR. Access to these services is only for the components of the FDR and no external access is allowed. To support user-friendliness, a Web Interface can be considered for time-efficient dataset management and administrative tasks. The web interface should make available to the visitor which datasets they have created (or have access to) and to manage their access rights. It should also allow certain users to manage access to the FDR. Additionally, a client module with command-line utilities could be developed to cater to specific user needs for uploading, downloading, and managing datasets. The client module can simplify interaction with FDR, thereby making its adoption much quicker and easier. The paradigm used with Git repositories of pushing and pulling will also be used here as technical persons are familiar with this. In this regard, versioning is also mandatory to ensure traceability and understand the various changes that have been made to the dataset. Retrieving the dataset(s) is only available through this client application, thereby restricting access to the datasets. The client application can also be distributed through secure means and its usage could have an expiry date, thereby deactivating old versions. Security is an important factor, therefore restricted access, token-sharing, and expiry should also be implemented. For the purposes of the FDR, a cloud-based solution, instead of an on-premises deployment, was selected for storing the datasets. The FDR could also support other storage options depending on the application domain. Keeping these in mind, the preliminary design review of the foreseen FDR is illustrated in Fig. 35.6.

Fig. 35.6
A schematic diagram of the design and development of the F D R. It presents web node, A P I and A P U with queue, cloud-based solution with file system, and client with the package. There are 3 types of services A P I, A P U, and a cloud system.

Step 5: design and development of FDR

Testing and Validation of the FDR Functionalities

The outcomes of testing and validation directly guide design refinements. If performance issues are identified during testing, alternatives can be explored. For example, optimizing database indexing can enhance data retrieval speed or implementing caching mechanisms for frequently accessed data can improve response times. Thorough testing can and will uncover specific requirements, driving the selection of design alternatives that align FDR’s capabilities with the performance and functionality demands of the users (Fig. 35.7).

Fig. 35.7
A block diagram of the testing functionalities. Testing and validation of the F D R functionalities are divided into no performance issues and identification of performance issues. It is divided into optimizing the database indexing and implementing caching mechanisms.

Step 7: testing and validation of the FDR functionalities

Granting Access

Access control requirements, driven by the specific needs of project partners, significantly influence the design decisions related to user roles, permissions, and access to the repository. Alternatives include the implementation of role-based access control (RBAC) for fine-grained permissions, enabling granular control over who can access and manipulate specific datasets. Additionally, employing single sign-on (SSO) solutions can streamline partner access by allowing users to access the FDR using their existing credentials. The choice of access control mechanisms directly impacts FDR design for user management, ensuring the protection and controlled access of sensitive data.

Access Monitoring

The implementation of real-time alerts and notifications for access monitoring directly enhances data security and accountability. Alternative approaches may include the utilization of machine learning-based anomaly detection algorithms to identify suspicious access patterns and potential security breaches. Another alternative is the implementation of blockchain-based access auditing, which provides an immutable ledger of data access events, offering enhanced data security and integrity. The choice of the most appropriate alternative should align with the level of access control and monitoring required for the project’s security and compliance needs.

Conclusions

This chapter details a systematic process that encompasses several key methodological steps. It begins with the identification and collection of diverse datasets from varied sources, spanning non-cooperative and cooperative tracking data, electro-optical data, radio signals, and more. These datasets are meticulously assessed to ensure their quality and relevance, aligning them with specific user requirements and use cases. Subsequently, data preparation measures are implemented, including data cleaning, normalization, and format standardization, rendering the datasets compatible with AI algorithms. To achieve the highest level of efficacy, the repository is designed with several critical characteristics. Security remains paramount, with robust access controls, user authentication, and role-based permissions to protect sensitive defence-related data. Metadata management and searchability ensure efficient dataset organization and retrieval. The FDR also supports versioning, format transformation, and long-term preservation to accommodate evolving AI needs and guarantee data integrity. Moreover, the chapter underscores the significance of monitoring data access within the repository. This monitoring mechanism enables the timely detection of unauthorized or unusual access patterns, reinforcing data security and user accountability.

In summary, this chapter provides insights into the methodological approach underpinning the creation of an AI dataset repository tailored for defence applications. The FDR’s design principles prioritize security, accessibility, and scalability, making it an invaluable asset for frugal and robust AI in the defence sector. Furthermore, the aforementioned methodology outlined in this study offers a versatile blueprint that extends beyond its immediate application, holding significant promise for similar use cases within the broader AI domain. Its systematic approach to data handling, encompassing identification, assessment, preparation, storage, access control, and monitoring, can be readily adapted to a multitude of AI applications. As the field of AI continues to evolve and confront new challenges, the adaptability of this methodology positions it as a foundation for building AI repository solutions that can flexibly respond to the ever-changing landscape of data and technology. It stands as a testament to the potential for sustainable and effective data management in the AI domain, enabling not only the optimization of current applications but also the preparation for future AI endeavours and their evolving data needs.