1 Introduction

Nowadays, the online availability of an increasingly large amount of source code is dramatically changing the way programmers approach the development of large software systems. The possibility of reusing existing code allows developers to focus on the added value of their products, speed up the development process and easily explore new possibilities and solutions, while keeping high quality components at the foundations of the system. Several studies have been conducted to understand the general attitude of developers towards the comprehension of large code bases. For instance, in [1] questions asked by programmers during software evolution tasks are analyzed and classified in 44 different categories. The study also highlights the lack of specific methods to answer these questions. Hence, in this paper, we introduce CodeOntology, as a resource aimed at supporting the adoption of Semantic Web technologies, in the domain of software development and software engineering. The project has been conceived as an approach to leverage the Semantic Web technology stack and the impressive amount of code available online, to extract structured information from source code, thereby allowing to publish it on the Web in the form of Linked Open Data, as well as enabling the execution of highly expressive queries over source code by means of a powerful language like SPARQL. Thus, CodeOntology is of particular interest not only to the Semantic Web community, but also to software developers and engineers.

CodeOntology consists of two contributions: an ontology to represent the domain of programming languages and a parser that allows to parse Java source code or bytecode, to serialize it into RDF triples. The ontology is mainly focused towards the Java programming language, but it has been designed with flexibility in mind, thereby being suitable to be reused to represent more languages. On the other hand, the parser is able to extract structural information common to all object-oriented programming languages, like class hierarchy, methods and constructors. Optionally, it can also serialize into RDF triples all the statements and expressions, thereby providing a complete RDF-ization of source code. We then apply semantic techniques like Named Entity Disambiguation, to analyze the comments available within the code and link entities extracted from source code to specific DBpedia [2] resources. This way, it is possible to take advantage of SPARQL to run semantic queries over source code for different purposes, including computer-aided programming, static code analysis, component search and reuse, question answering over source code.

2 Related Work

Querying source code is a critical task in software engineering. Most of the research in software engineering, indeed, is focused towards the development of tools to enhance the maintenance and understanding of extremely large and old software systems. This need has underpinned the development of several source code querying systems, such as OMEGA [3] and CIA (The C Information Abstraction System) [4], which are based on the relational model. More powerful systems, such as Software Refinery [5] are based on graphs and abstract syntax trees. These systems are more sophisticated than tools based on the relational model, but they lack a well-defined query language.

More recently, the online availability of large amounts of open source code has motivated the development of tools to enable programmers to take advantage of this otherwise unstructured information. As an example, Sourcerer [6] allows to collect source code from open repositories and automatically leverage structural information extracted from arbitrary Java projects. The data used by this system, however, are not published on the Web as Linked Open Data, with obvious limitations. Such limitations are partially addressed in [7], where a system to automatically generate an ontology from source code is introduced. This system does not use a unique ontology to describe the entities belonging the programming languages domain, but it generates a different ontology for each input project.

An approach that is more similar to CodeOntology is represented by SCRO (Source Code Representation Ontology) [8]. However, SCRO does not allow to represent some features of modern object-oriented programming languages, such as parameterized types and exceptions handling. Furthermore, the project lacks a system to serialize source code into RDF triples. CodeOntology, unlike other state-of-the-art systems, makes full use of all the resources made available by the Semantic Web technology stack. Data are extracted from source code according to an appropriate ontology and are published using the RDF data model. The collected information is then available to be queried by means of a highly expressive language like SPARQL. Furthermore, CodeOntology allows to analyze documentation comments and link entities extracted from source code to appropriate DBpedia resources that are semantically associated with these entities.

3 The Ontology

The ontology is written in OWL 2 and has been designed using the Protégé tool [9], according to the design principles of clarity, coherence, extensibility, minimum encoding bias and minimum ontological commitment, introduced in [10]. It consists of 65 classes, 86 object properties and 11 data properties and it has been checked for satisfiability, incoherence and inconsistencies using the HermiT reasoner [11]. The modelling process underlying the creation of the ontology has been guided by common competency questions that usually arise during software process and has been inspired by a re-engineering of the Java abstract syntax, as specified in [12]. However, the ontology is sufficiently general to be extended in order to meet future requirements. For instance, it is possible to reuse the ontology to better represent other programming languages, apart from Java. The IRI associated with the ontology is http://rdf.webofcode.org/woc/, abbreviated as woc. The ontology represents structural entities common to all object-oriented programming languages, such as classes, methods, variables, statements and expressions, in a hierarchy of disjoint classes. The root of this hierarchy is the CodeElement class, that is the common superclass of all the elements extracted from source code. Since the parser is also able to serialize into RDF triples the structure of Java projects and to analyze libraries such as JAR files, two other classes, namely Project and Library, have been defined to represent these entities.

The design of the ontology has been conducted according to well-known Ontology Design Patterns, best practices and naming conventions. As an example, the domain of object-oriented programming languages involves large part-whole relations. For instance, a statement may be part of a method, which in turn is part of a class, that is contained in a specified package. In order to represent this partitive relations, the ontology employs a common Content OP and reuses the XKOS vocabulary [13], more precisely the terms xkos:hasPart and xkos:isPartOf. According to the Transitive Reduction pattern, only the most general property is transitive. Thus, transitivity is delegated to the XKOS vocabulary, which in turn gets transitivity from SKOSFootnote 1 and DCMI Metadata TermsFootnote 2. We make use of XKOS because it allows to represent both partitive (part-whole) and generic (generic-specific) relations. The domain of programming languages, indeed, includes also generic relations between entities. For instance, inheritance in object-oriented programming turns into generic-specific relations between classes. CodeOntology also makes use of other common Ontology Design Patterns and best practices, such as the N-ary relation patternFootnote 3 and the SV (Specified Values) patternFootnote 4 originally introduced by the W3C SWBPD (Semantic Web Best Practices and Deployment) Working Group. They are used in the ontology to model both access modifiers and primitive data types.

The ontology is available at http://doi.org/10.5281/zenodo.577939, under CC BY 4.0 license. Each entity in the ontology has been annotated by means of the rdfs:comment and rdfs:label properties. A documentation of the ontology, generated using Parrot [14], is available at http://codeontology.org.

4 The Parser

The RDF triple extraction process is managed by the parser, that is the module of CodeOntology that analyzes and parses Java source code to serialize it into RDF triples. As shown in Fig. 1, the RDF serialization of a Java project acts in three steps: first the project is analyzed to download all of its dependencies and load them in class path, then an abstract syntax tree of the source code and its dependencies is built and processed to extract a set of RDF triples.

Fig. 1.
figure 1

The RDF serialization process.

CodeOntology currently supports both Maven and Gradle projects. When analyzing a project, the parser first looks at its structure to recognize whether it is built with Maven or Gradle and download the dependencies of the project. JAR files downloaded in this step can optionally be processed and serialized into RDF triples, as well.

Next, the parser builds the abstract syntax tree of the whole input project. This step is handled by Spoon [15], an AST-based source code analysis and transformation library, that provides a Java metamodel designed to be easy to understand, query and manipulate. This library is used by CodeOntology to build a model containing information about packages, classes, interfaces, methods, as well as statements, expressions, comments and so on. Spoon allows to define processors to be launched over the abstract syntax tree. The RDF triple extraction is managed by a Spoon processor invoked for every package in the input project. From a particular package, the control flow moves to the types contained in that package, such as classes and interfaces, up to the fields, constructors and methods declared within a specified class. CodeOntology looks then inside the body of each method, to take note of all the referenced types, fields, constructors, methods and variables. The RDF serialization process is handled using Apache JenaFootnote 5 and it can optionally involve also all the statements and expressions. The parser also allows to keep track of unstructured information such as comments. We then use TagMe [16] to analyze these comments and automatically link entities extracted from source code to pertinent DBpedia resources.

Beside the processor aimed at walking the abstract syntax tree created by Spoon, CodeOntology actually has three more processors. One of these processors is used to analyze the structure of the input project and serialize it into RDF triples. The second one is used to parse comments and detect Javadoc tags, to extract useful information about parameters and method return values. The last processor is used to analyze JAR files, thereby enabling CodeOntology to run not only on Java source code, but also on bytecode. Given a JAR file, this processor makes use of Java reflection to create an abstract syntax tree that is compliant with the Java metamodel defined by Spoon. The resulting tree is then processed as described above, by means of the main Spoon processor. The parser, along with a tutorial on how to use it to extract a knowledge base from any Java project, is available on GitHub under the GPLv3 license: https://github.com/codeontology/parser.

5 Experiments

The parser has been successfully applied to extract a knowledge base from the OpenJDK 8 source codeFootnote 6. These data can be queried through a remote SPARQL endpoint at: http://codeontology.org/sparql. Moreover, the dataset is available at https://doi.org/10.5281/zenodo.818116 and on figshare [17]. The analysis has been conducted on about 1.5 million lines of code, retrieving a total of almost 2M RDF triples falling into 4 categories: structural information on source code (1.9M triples), DBpedia links (309k triples), actual source code as literals (134k triples) and literal comments (105k triples). Quality assessments have been conducted on a sample of methods and classes. Figure 2 shows a small subset of the triples produced by the parser from a simple “hello world” class. This representation allows to run expressive queries over source code, some of which are shown in [18]. As an example, since the output of the parser is a graph, we can easily apply to software not only metrics specifically designed for software engineering tasks, but also metrics borrowed from other fields, such as Social Network Analysis. Suppose we want to rank classes in OpenJDK, to select only the most important ones, according to a specified metric. For instance, we can select the three classes that turn out to be the most referenced by the methods of the other classes. When a method m references a specific class c, then the parser is able to serialize this information into a triple of the form: m woc:references c. Thus, we can rank the classes in OpenJDK according to this metric and retrieve the most referenced ones, by means of a simple SPARQL queryFootnote 7. Unsurprisingly, the most referenced class in OpenJDK is the java.lang.String class, followed by the classes java.lang.Object and java.io.IOException. In order to compute such a metric efficiently, a graph-based representation of software systems is needed.

Fig. 2.
figure 2

An excerpt of the RDF serialization produced for a simple “hello world” code.

Moreover, the extracted DBpedia links can be used to run highly expressive semantic queries over source code. For instance, we can retrieve all the methods for computing the cube root of a real number, by selecting the resources associated with the entity dbpedia:Cube_root.

Besides OpenJDK, the system has also been tested on a sample of 20 Java repositories randomly collected from GitHub. Table 1 shows the execution times required to download the dependencies, in the form of JAR files, analyze the source code and process the JAR files previously downloaded. All the times are expressed in seconds. Table 1 also shows the total number of RDF triples extracted from each project.

Table 1. Execution times for processing a sample of 20 Java projects and number of RDF triples extracted.

In some cases, it was not possible to analyze source code because Spoon failed building the Abstract Syntax Tree for different reasons, such as missing dependencies that were not automatically downloaded. However, the parser has been able to extract a knowledge base consisting of more than 30.5 million RDF triples, from only 20 repositories.

6 Conclusions and Future Work

CodeOntology is a project that consists of two contributions: an ontology describing structural entities common to all object-oriented programming languages and a parser capable of serializing Java source code and bytecode into RDF triples. In this paper, we have described the core ideas underlying the design of the ontology and we have analyzed the architecture of the parser. Furthermore, CodeOntology allows to analyze Java comments, in order to link entities extracted from source code to DBpedia resources. This way, it is possible to precisely search specific software components using expressive semantic queries. In the future, we plan to develop a Question Answering system to hide the complexity of SPARQL queries and allow retrieving software components by means of questions in natural language. Moreover, it will be possible to dereference and execute the source code of the methods in the datasets, using the Web of Functions technology [19].