Abstract
This chapter is a response to the increasing demand for a versatile AI dataset repository in a rapidly evolving landscape that necessitates frugal and robust AI solutions for defence applications. In an era characterized by an exponential growth in data generation, the need for a repository that can accommodate diverse and extensive datasets has become paramount. The evolving nature of defence and security challenges, marked by the reliance on AI for critical decision-making, further underscores the urgency of this repository. With a wealth of non-cooperative and cooperative tracking data, electro-optical data, radio signals, and more at play, the chapter addresses the imperative to provide a secure, accessible, and scalable repository that aligns datasets with specific user requirements and use cases. This comprehensive repository design acknowledges not only the massive volumes of data at hand but also the importance of long-term data preservation, access control, and data integrity. It offers a methodological blueprint for building a robust AI dataset repository capable of meeting the ever-increasing demands of a dynamic AI landscape. In this context, the chapter includes the methodological approach for the creation of a specialized AI dataset repository, which is being developed within the framework of the EU-funded FaRADAI project (GA no. 101103386) and aims at advancing AI technology for defence applications. The FaRADAI Dataset Repository (FDR) plays a pivotal role in facilitating collaboration among project partners by serving as a centralized hub for secure and efficient data storage and access.
You have full access to this open access chapter, Download chapter PDF
Keywords
Introduction
Artificial intelligence has made great leaps of progress since the time of the Logic Theorist programme (Simon, Newell, Shaw 1956) where a machine was created to think like a human. Since then, there has been a continuous evolution of AI in terms of algorithms and the associated computing power required to make it possible. These days one of the core pillars of AI are data. Without data, AI cannot be trained, and any inference is based on false information. Defence organizations are using AI in different spaces such as detection, planning, and field operations. The management of this data requires a structured storage area, and thus, designing and developing an AI repository for defence applications requires careful consideration of several factors. A well-designed AI repository needs to store and manage large volumes of data considering various parameters such as the type and volume of data to be stored, the purpose of the AI algorithm to be developed, the format in which the data is to be stored, and to provide access to this data.
In most cases, the data must be annotated with accurate labels, and the labelling process should comply with ethical and legal standards. Appropriate data management practices shall be defined regarding cataloguing, metadata, storage, and documentation. Access to the data should be restricted and secured to prevent unauthorized use and potential security risks, while specific authentication mechanisms should be applied keeping in mind the application domain. Versioning is also important to capture the evolution of the data and finding changes to datasets and manage incompleteness. Additionally, the repository should follow community-endorsed interoperability best practices to facilitate data exchange and reuse within and across relevant disciplines, such as security applications, enabling the researchers to advance their scientific work on a need-to-know basis. Finally, documentation of the data provenance and quality assurance processes should be meticulously maintained to ensure transparency and reliability of the AI models developed from the data. All the aforementioned parameters should be taken into account to select the best possible design approach for any individual repository.
Methodological Approach
The methodology for designing and developing the FDR involves several key steps to achieve its main objectives. The initial methodological step focuses on the identification and collection of datasets from various existing sources to meet the user requirements and use case needs. The datasets may include non-cooperative and cooperative tracking data, ground, marine, and airborne electro-optical data, radio signals, and other relevant data sources.
Once the datasets are gathered, a thorough assessment is conducted to evaluate the quality and relevance of the data as the second step. This assessment involves examining factors such as data accuracy, completeness, consistency, and data source reliability. It ensures that the selected datasets meet the desired criteria and are suitable for the intended AI algorithms and applications.
Data preparation is a crucial, third step that follows the assessment of data and involves cleaning, preprocessing, and transforming the datasets to make them suitable for AI algorithms. This process includes but is not limited to tasks such as data cleaning, normalization, and data formatting to ensure consistency and compatibility across different datasets.
Before storing the data, it is of utmost importance to clearly articulate the purpose of the AI algorithms that will be applied to the datasets. This involves identifying the specific objectives, tasks, or analysis that the AI algorithms will perform on the data. Defining the purpose of the algorithms contributes to the design of the repository and organization of the data in a way that aligns with the intended use cases.
This fifth step involves the actual design and development of FDR. The repository is designed according to the principles defined in the previous steps as well as to the chosen repository characteristics, including scalability, security, metadata management, versioning, data transfer, and collaboration features. The development process includes implementing the necessary software components, user interfaces, and backend infrastructure to support the repository functionalities.
The collected and prepared datasets are stored in the developed FaRADAI Dataset Repository. The repository should provide secure and scalable storage infrastructure capable of accommodating the volume and diversity of the collected datasets. Proper data organization, indexing, and storage practices are implemented to ensure efficient data retrieval and management.
The seventh step follows the development phase, where thorough testing and validation of the FDR functionalities are performed. This involves conducting several tests to ensure that the repository functions are as intended, including dataset uploading and downloading, metadata management, searchability, access control, versioning, and collaboration features.
Once the FDR is tested and validated, access is granted to the intended users. User roles and access rights are defined, allowing authorized partners to securely access and collaborate on the datasets stored in the repository. Proper access controls and authentication mechanisms are implemented to ensure data security and privacy. Once access is granted to the users, the final step includes the monitoring and tracking of users’ activities within FDR. This foresees implementing logging and auditing mechanisms to record user interactions, including dataset access, downloads, and modifications. By monitoring access, any unauthorized activities can be detected, ensuring in parallel the security and integrity of the data.
By following the aforementioned methodological steps, as also illustrated in Fig. 35.1, the design and development of the FDR can be carried out effectively and ensure the availability and accessibility of the relevant datasets for analysis and collaboration.
Implementation Principles and Decision Tree
In this chapter, the implementation principles that were considered to follow the appropriate design and development path up to the final release of FDR and the final step are described. In some of the following steps, there were various approaches to follow and were evaluated according to the specific needs of the project.
Identification and Collection of Datasets
The outcome of successful data identification and collection sets the foundation for FDR design. It directly influences the choice of storage solutions. For example, if data sources are distributed across various locations or systems, alternatives such as distributed databases or cloud-based storage can be explored to ensure efficient data access and retrieval. Moreover, with a vast volume of data, deduplication strategies might be implemented to optimize storage space, removing redundancy in datasets. Another alternative to consider is federated databases, allowing data to remain at its source but still be accessible through the FDR. These choices align the design with the specific needs and characteristics of the collected datasets (Fig. 35.2).
Data Assessment
Thorough data assessment informs the FDR’s design in several ways. High-quality, well-structured data may require less extensive preprocessing, affecting decisions regarding data cleaning and normalization. In cases where data quality is lower, the FDR’s design should incorporate advanced data cleaning algorithms to handle data imperfections and outliers effectively. The alternatives may involve developing custom data cleaning routines that align with the unique characteristics of the data or using specialized data transformation techniques to enhance the data’s readiness for AI algorithms.
Preparation
The preparation of data is critical for seamless integration into the FDR. Cleaned, normalized, and AI-ready datasets expedite processing and analysis. The design must ensure that the data is compatible with the repository’s structure. For datasets arriving in various formats, the FDR might include format transformation modules to standardize data. Alternatives involve focusing on a specific data format to minimize format transformation, but this may require stricter data source requirements or additional data preparation at the source. The choice between these alternatives depends on the extent of data format variation and the trade-offs between standardization and source data flexibility (Fig. 35.3).
Definition of the Purpose of AI Algorithms
Clarity in articulating the objectives of AI algorithms shapes the design of the FDR. Clear objectives facilitate data categorization and indexing, allowing for efficient data retrieval and usage. Alternatives might include advanced search functionalities if the use cases are complex or subject to frequent changes in algorithm objectives. In cases where algorithms serve multiple, distinct purposes, the FDR design may emphasize robust tagging and metadata management, enabling flexible search and retrieval based on different criteria (Fig. 35.4).
Datasets Storage
The secure and scalable storage of data is crucial, especially for defence applications dealing with substantial data volumes. The FDR’s design shall account for storage that can accommodate growth. For enhanced data security, alternatives include encrypted storage solutions. Integrating cloud storage might also be considered, and a multi-cloud approach can provide redundancy and improved data availability. The choice between on-premises and cloud-based storage is a critical decision, and it influences both the repository’s architecture and its scalability (Fig. 35.5).
Design and Development of FDR
The design options for FDR components are directly shaped by the outcomes of the previous steps which are summarized here. For data-intensive tasks, such as AI algorithms that require substantial computational power, the FDR might opt for an asynchronous processing unit (APU) with a queue implementation to efficiently manage processor and storage-intensive tasks. When ease of data access is a priority, REST API Services might be preferred for communication between the various components of the FDR. Access to these services is only for the components of the FDR and no external access is allowed. To support user-friendliness, a Web Interface can be considered for time-efficient dataset management and administrative tasks. The web interface should make available to the visitor which datasets they have created (or have access to) and to manage their access rights. It should also allow certain users to manage access to the FDR. Additionally, a client module with command-line utilities could be developed to cater to specific user needs for uploading, downloading, and managing datasets. The client module can simplify interaction with FDR, thereby making its adoption much quicker and easier. The paradigm used with Git repositories of pushing and pulling will also be used here as technical persons are familiar with this. In this regard, versioning is also mandatory to ensure traceability and understand the various changes that have been made to the dataset. Retrieving the dataset(s) is only available through this client application, thereby restricting access to the datasets. The client application can also be distributed through secure means and its usage could have an expiry date, thereby deactivating old versions. Security is an important factor, therefore restricted access, token-sharing, and expiry should also be implemented. For the purposes of the FDR, a cloud-based solution, instead of an on-premises deployment, was selected for storing the datasets. The FDR could also support other storage options depending on the application domain. Keeping these in mind, the preliminary design review of the foreseen FDR is illustrated in Fig. 35.6.
Testing and Validation of the FDR Functionalities
The outcomes of testing and validation directly guide design refinements. If performance issues are identified during testing, alternatives can be explored. For example, optimizing database indexing can enhance data retrieval speed or implementing caching mechanisms for frequently accessed data can improve response times. Thorough testing can and will uncover specific requirements, driving the selection of design alternatives that align FDR’s capabilities with the performance and functionality demands of the users (Fig. 35.7).
Granting Access
Access control requirements, driven by the specific needs of project partners, significantly influence the design decisions related to user roles, permissions, and access to the repository. Alternatives include the implementation of role-based access control (RBAC) for fine-grained permissions, enabling granular control over who can access and manipulate specific datasets. Additionally, employing single sign-on (SSO) solutions can streamline partner access by allowing users to access the FDR using their existing credentials. The choice of access control mechanisms directly impacts FDR design for user management, ensuring the protection and controlled access of sensitive data.
Access Monitoring
The implementation of real-time alerts and notifications for access monitoring directly enhances data security and accountability. Alternative approaches may include the utilization of machine learning-based anomaly detection algorithms to identify suspicious access patterns and potential security breaches. Another alternative is the implementation of blockchain-based access auditing, which provides an immutable ledger of data access events, offering enhanced data security and integrity. The choice of the most appropriate alternative should align with the level of access control and monitoring required for the project’s security and compliance needs.
Conclusions
This chapter details a systematic process that encompasses several key methodological steps. It begins with the identification and collection of diverse datasets from varied sources, spanning non-cooperative and cooperative tracking data, electro-optical data, radio signals, and more. These datasets are meticulously assessed to ensure their quality and relevance, aligning them with specific user requirements and use cases. Subsequently, data preparation measures are implemented, including data cleaning, normalization, and format standardization, rendering the datasets compatible with AI algorithms. To achieve the highest level of efficacy, the repository is designed with several critical characteristics. Security remains paramount, with robust access controls, user authentication, and role-based permissions to protect sensitive defence-related data. Metadata management and searchability ensure efficient dataset organization and retrieval. The FDR also supports versioning, format transformation, and long-term preservation to accommodate evolving AI needs and guarantee data integrity. Moreover, the chapter underscores the significance of monitoring data access within the repository. This monitoring mechanism enables the timely detection of unauthorized or unusual access patterns, reinforcing data security and user accountability.
In summary, this chapter provides insights into the methodological approach underpinning the creation of an AI dataset repository tailored for defence applications. The FDR’s design principles prioritize security, accessibility, and scalability, making it an invaluable asset for frugal and robust AI in the defence sector. Furthermore, the aforementioned methodology outlined in this study offers a versatile blueprint that extends beyond its immediate application, holding significant promise for similar use cases within the broader AI domain. Its systematic approach to data handling, encompassing identification, assessment, preparation, storage, access control, and monitoring, can be readily adapted to a multitude of AI applications. As the field of AI continues to evolve and confront new challenges, the adaptability of this methodology positions it as a foundation for building AI repository solutions that can flexibly respond to the ever-changing landscape of data and technology. It stands as a testament to the potential for sustainable and effective data management in the AI domain, enabling not only the optimization of current applications but also the preparation for future AI endeavours and their evolving data needs.
Bibliography
Deloitte. (2021). The Age of With – The AI advantage in defence and security, (4), 4–20. Deloitte Analytics.
Van Wynsberghe, A. (2021). Sustainable AI: AI for sustainability and the sustainability of AI. AI and Ethics, 1(3), 213–218.
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1–48.
Tsay, J., Braz, A., Hirzel, M., Shinnar, A., & Mummert, T. (2022). Extracting enhanced artificial intelligence model metadata from software repositories. Empirical Software Engineering, 27(7), 176.
Shin, P. W., Lee, J., Kim, J., Shin, D., Lee, Y., & Hwang, S. H. (2020). A research in applying big data and artificial intelligence on defence metadata using Multi Repository Meta-Data Management (MRMM). Journal of Internet Computing and Services, 21(1), 169–178.
Acknowledgements
This project received funding from the European Defence Fund programme under grant agreement no. 101103386. The views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. |
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2025 The Author(s)
About this chapter
Cite this chapter
Kampas, G. et al. (2025). Methodological Approach for Designing an Artificial Intelligence Repository for Defence Applications. In: Gkotsis, I., Kavallieros, D., Stoianov, N., Vrochidis, S., Diagourtas, D., Akhgar, B. (eds) Paradigms on Technology Development for Security Practitioners. Security Informatics and Law Enforcement. Springer, Cham. https://doi.org/10.1007/978-3-031-62083-6_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-62083-6_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-62082-9
Online ISBN: 978-3-031-62083-6
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)