Expressing and Applying C++ Code Transformations for the HDF5 API Through a DSL
Hierarchical Data Format (HDF5) is a popular binary storage solution in high performance computing (HPC) and other scientific fields. It has bindings for many popular programming languages, including C++, which is widely used in the HPC field. Its C++ API requires mapping of the native C++ data types to types native to the HDF5 API. This task can be error prone, especially when working with complex data structures, which are usually stored using HDF5 compound data types. Due to the lack of a comprehensive reflection mechanism in C++, the mapping code for data manipulation has to be hand-written for each compound type separately. This approach is vulnerable to bugs and mistakes, which can be eliminated by using an automated code generation phase. In this paper we present an approach implemented in the LARA language and supported by the tool Clava, which allows us to automate the generation of the HDF5 data access code for complex data structures in C++.
KeywordsHDF5 Domain specific language LARA Source-to-source Aspect oriented Clava Code generation
Source-to-source transformation is a process during which a program source code is automatically created or updated according to a given set of inputs. It can be used for various tasks, such as low-level optimization for a given target platform, templating, integration and more. In this paper, we demonstrate how we can overcome the lack of compile-time reflection in C++, by applying user-defined transformations written in an aspect-oriented domain-specific language.
Reflection is the ability of a computer program to examine and/or modify its own structure. It usually provides information about the type of a given object, its inheritance hierarchy, its attributes and more, and can even be used to manipulate the code itself during run-time. This ability is available in many interpreted languages, such as Python, Ruby, Lua, Java or C#, usually thanks to the underlying virtual machine or interpreter.
The current version of the C++ language is C++17. Its features have been recently added into commonly used compilers such as the GNU GCC or Intel C++ Compiler. However, reflection is not among these features yet, though several proposals has been recently published . Based on the speed of implementation of the new standards in mainline compilers and their adoption by programmers, it can be said that C++ does not have comprehensive support for compile-time reflection yet. We briefly mention several alternative tools and approaches to this problem in Sect. 2.
One of the common use cases for reflection is mapping an object to a persistent data structure, where individual attributes of the object are examined and stored in a proper way. In the use case presented in this paper, we are storing a complex data structure representing a traffic navigation routing index in a HDF5 based binary file. Without proper reflection, we have to manually create the code that maps the C++ structures to the objects in the HDF5 file. This code implements several time-consuming data processing tasks that are executed on an HPC cluster, which places severe constraints on the robustness of the code and the entire process.
In this paper we present a method, based on the LARA language, for automatic generation of the mapping code. Section 3 explains the routing index and its HDF5-based storage. Section 4 presents the LARA language and its toolset, which can be used to define the desired code transformations in a robust and flexible way. Section 5 shows a concrete application of the approach on our data processing code and its integration in our build process.
2 Related Work
There are several approaches to reflection in the C++ language. One of them is through extensive use of macros to annotate individual classes and attributes, a solution that is popular for example among game engines . Its pitfalls are the inability to use reflection on non-modifiable code (e.g., third-party libraries) and its reliance on uncommon language constructs. A similar approach can be implemented using templates, at the cost of an increase in complexity of the code, compilation times and requirements for its maintenance.
Another approach is based on external tools which parse source code and have a certain knowledge of the code structure, such as the Meta object compiler, which is part of the Qt GUI framework. This tool produces source code for annotated C++ classes extended with support for accessing run-time information and a dynamic property system [5, 13]. This tool, however, provides only a fixed feature set intended for development in the Qt framework.
Domain-specific languages (DSLs) such as LARA can provide the desired level of flexibility and robustness for our purposes. The LARA language has been inspired by AOP approaches, including AspectJ and AspectC++. AspectJ  extends Java in order to provide better modularity for Java programs, and has a very mature tool support. AspectJ join points are limited to object-oriented concepts, such as classes, method calls and fields, and several works try to complement AspectJ. AspectC++  is an AOP extension to the C++ programming language inspired by AspectJ and uses similar concepts, adapted to C++.
In traditional AOP approaches, aspects usually define behavior which is executed during runtime, at the specified join points. LARA differs from traditional AOP in that it uses aspects to describe source code analysis and transformations, which currently are executed statically, at compile time. Due to this difference in approach, tools like AspectJ and AspectC++ usually do not consider join points which are common in LARA, such as local variables, statements, loops, and conditional constructs.
There are several term rewriting-inspired approaches for code analysis and transformation, such as Stratego/XT  and Rascal , which require the user to provide a complete grammar for the target language. On the other hand, LARA promotes the usage of existing compiler frameworks (e.g. Clang  in the case of this work) for parsing, analysis and transformations. Another distinct feature of LARA is that weavers can be built in an incremental fashion, adding join points, attributes and actions as needed (see Sect. 4).
3 Hierarchical Data Format for Routing Index
Binary formats offer efficient and fast data storage. However, custom implementations can be cumbersome and fragile, especially in multi-platform environment. The Hierarchical Data Format  (HDF) provides a binary storage format implementation for storing large volumes of complex data. It has been developed mainly for storing scientific data, however, since then it has been adopted by many other industries. The HDF allows easy and consistent sharing of binary data across various platforms and environments, which is one of its main advantages. There are two main versions of the HDF format. In this paper, we exclusively refer to the HDF5 version . HDF5 implements a storage model which resembles a standard file system hierarchy, with a tree of folders and files. The basic HDF5 file objects are Groups, Datasets and Attributes. Groups can hold one or more datasets; both groups and datasets can have attributes associated. Each HDF file has one root group. The datasets are used for the actual storage of multi-dimensional data of a given type.
3.1 Routing Use Case
The HDF5 provides APIs for a large number of major programming languages such as Python, C/C++, Java or even CLI .NET. Our codebase is written mainly in C++, hence we refer to the native HDF5 C++ API in this paper. In our approach for graph data for traffic navigation routing index, individual road segments, junctions and other elements of a road network are represented by a set of vertices and oriented edges. The edges have associated a number of parameters such as length, max. allowed speed or category. Graph representation of a road network of single country such as the Czech Republic can have millions of vertices and edges. The vertices and edges in the HDF5 file are divided in subsets (graph parts) which reside in their corresponding groups. Mapping of the graph parts to the individual vertices is stored in the NodeMap dataset located in the root group of the file. The parts can be determined either by geography or other topological properties of the graph. Each graph part group then contains the Edges, EdgeData and Nodes datasets. All datasets in our case are two-dimensional, where rows hold individual records and columns hold their attributes. References to records in other datasets in our case are represented by storing an index of the referenced record rather than using the native HDF5 reference mechanism.
4 C++ Code Manipulation
Figure 2 presents LARA code which adds include directives to a file, using a join point action. Line 1 declares an aspect, the top-level unit in LARA (which is similar to a function). Line 2 declares the inputs of the aspect, which in this case is a file join point. By convention, names of variables that represent join points are prefixed with a dollar-sign ($) in LARA. Line 4 uses a select to capture all the classes definitions that appear in the current program. Lines 5–8 represent an apply block that performs some work over the join points captured in the previous select. In this case, it executes a file action ($targetFile.exec) that adds an include directive to the file, corresponding to the file that belongs to the given join point (addIncludeJp($class)). This example shows a common pattern in LARA, which is to select some points in the source code and then act over them, possibly modifying the source code.
5 Use Case
In this section we present and explain the LARA code developed to automatically generate type mapping functions from classes and structs (henceforth referred to as records) present in the source code. The presented version generates a new class for each record found in the code, and this class has a single static method that returns a CompType object, which can then be passed to the HDF5 API calls when that particular record is accessed. In the example in Fig. 4, for demonstration purposes, we include the code in the same file as the original record. However, in the code presented in this section we create new files for the generated code, to avoid adding a dependency to HDF5 in every source-file that wants to use the record (note that both cases can be expressed in LARA). Currently, the type-mapping code is generated for all classes and structures in the given source files, but the code can be easily adapted to filter unwanted records (e.g., by providing a list of class/struct names, or files).
5.1 LARA for HDF5
Figure 6 shows the code for a working version of the Hdf5Types aspect. As input, it receives a path to the base destination folder of the generated code, and a namespace for the generated functions, with optional default values for the inputs (line 2).
Lines 5–6 use a Factory provided by Clava (i.e., AstFactory) that allows the creation of new AST nodes, that can then be inserted in the code tree. The AstFactory always returns join points, which can be handled the same way as the join points created by select statements. In this case, two join points of type file are created, one for the header file (CompType.h) and another for the implementation file (CompType.cpp).
Lines 8–11 select the program join point and add the newly created files with the action addFile. Line 15 selects all the records in the source code that are either of kind class or of kind struct, which are then iterated over in the apply block in lines 16–27. This block creates the declarations for the header and the implementation file using the code definitions in Fig. 7 (lines 20 and 25, respectively). It also adds to the implementation file an include directive for the current record (line 23), creates the code for the body of the implementation function by calling the aspect RecordToHdf5 (line 24) and inserts the code of the function in the implementation file (line 26).
The Clang compiler has a very rich AST with detailed information, not only about the source code itself, but also about the types used in the code, which are also represented as an AST. Clava takes advantage of this information and gives access to this AST for types by providing a join point type, which can be accessed from any join point using the attribute type.
Next, there are several special cases which need to be handled. For instance, C++ enumerations can customize the underlying integer type. If the type is an enumeration, the function is called recursively for the integer type of the enumeration (lines 10–12). Other example is the case of vector types, which appear in the AST as a TemplateSpecializationType (i.e., any type template that has been specialized, such has vector<int>). In this case, the function is also called recursively, this time for the specialization type.
After handling the special cases, the function uses the attribute code to obtain the code representation of the type and consult the table HDF5Types, which maps C/C++ types to the corresponding HDF5 types.
5.2 CMake Integration
Code metrics for the use case.
In this paper, we presented a possible solution to missing support for compile-time reflection in C++. Our solution is based on the domain-specific language LARA, which is used to write source-to-source transformations, and the tool Clava, which executes the LARA code over C/C++ programs. We have demonstrated its usage by generating a native C++ API for the HDF5 library, without modifications in the original source code. The generated code is used to store a traffic navigation routing index for processing on HPC infrastructure. Our use case is complex both in terms of structural complexity and data volume, and we needed to implement a robust and flexible approach to generate the data access code and integrate it into our build process. In Sect. 5.2 we introduced a basic approach for integration of the code generation process in CMake, by using custom build commands. The Clava tool is called during the build configuration to produce the type mapping code between C++ and HDF5 API.
Ongoing work includes adding support for custom compound types (e.g., fields that are user-defined classes/structs) and LARA and Clava support for custom #pragma constructs in the code, that can be used to mark arbitrary blocks of code to be processed by the LARA aspects. This approach can be used to apply a large number of custom optimizations (e.g., in the context of HPC systems) or to generate a concrete implementation of the data access layer on top of an existing abstract data storage library.
This work has been partially funded by ANTAREX, a project supported by the EU H2020 FET-HPC program under grant 671623, by The Ministry of Education, Youth and Sports of the Czech Republic from the National Programme of Sustainability (NPU II) project “IT4Innovations excellence in science - LQ1602” and by grant of SGS No. SP2017/177 “Optimization of machine learning algorithms for the HPC platform”, VŠB-Technical University of Ostrava, Czech Republic.
- 1.Clang. clang.llvm.org. Accessed 28 Feb 2017
- 2.Unreal engine documentation. https://docs.unrealengine.com/latest/INT/Programming/UnrealArchitecture/Reference/index.html
- 3.Bispo, J.: Clava: C++ language + lara weaver and code transformer - antarex technical report v0.1. (2017)Google Scholar
- 5.Blanchette, J., Summerfield, M.: C++ GUI Programming with Qt 4. Prentice Hall Professional, Upper Saddle River (2006)Google Scholar
- 7.Cardoso, J.M.P., Carvalho, T., Coutinho, J.G.F., Luk, W., Nobre, R., Diniz, P., Petrov, Z.: Lara: an aspect-oriented programming language for embedded systems. In: Proceedings of the 11th Annual International Conference on AOP Software Development, pp. 179–190. ACM (2012)Google Scholar
- 8.Chochlık, M.: Implementing the factory pattern with the help of reflection. Comput. Inf. 35(3), 653–686 (2016)Google Scholar
- 9.Folk, M., Heber, G., Koziol, Q., Pourmal, E., Robinson, D.: An overview of the hdf5 technology suite and its applications. In: Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pp. 36–47. ACM (2011)Google Scholar
- 10.Gradecki, J.D., Lesiecki, N.: Mastering AspectJ: Aspect-Oriented Programming in Java. Wiley, New York (2003)Google Scholar
- 11.Klint, P., Van Der Storm, T., Vinju, J.: Rascal: a domain specific language for source code analysis and manipulation. In: Source Code Analysis and Manipulation 2009, SCAM 2009, pp. 168–177. IEEE (2009)Google Scholar
- 12.Pinto, P., Carvalho, T., Bispo, J., Cardoso, J.M.: Lara as a language-independent aspect-oriented programming approach. In: Proceedings of the 32th Annual ACM Symposium on Applied Computing. ACM (2017, to appear)Google Scholar
- 13.Qt: Qt documentation. http://doc.qt.io/qt-5/why-moc.html. Accessed Feb 2017
- 14.Spinczyk, O., Gal, A., Schröder-Preikschat, W.: Aspectc++: an aspect-oriented extension to the c++ programming language. In: Proceedings of the 14th International Conference on Tools Pacific, CRPIT 2002, Darlinghurst, Australia, pp. 53–60 (2002)Google Scholar