Objective

Recently natural language interfaces (also known as bots, chatbots, dialog systems, and virtual assistants) have attracted enormous attention because of their potentials in hands-free environments (e.g., self-driving cars) and accessibility products [1]. Building such interfaces often requires training utterances (e.g., book a flight from Sydney to Paris), annotated with user intents (e.g., book-flight) and associated parametersFootnote 1 (e.g., from = “Sydney”, to = “Paris”) [1, 2]. Training utterances are used to train supervised models for detecting the users' intent based on their utterances. Because of the richness of human language, collecting large and diverse set of annotated utterances is required for building efficient natural language interfaces [2]. This is typically done in two steps: (i) generating an initial utterance (known as canonical utterance), and (ii) paraphrasing it to obtain more diverse training utterances [3,4,5]. In this paper, we focus on the first step, particularly for REST (Representational State Transfer) APIs since they are one of the most common forms of intents [6].

With the growing number of REST APIs and their ever-changing interfaces (e.g., renaming parameters, adding/removing operations), virtual assistants now require to automatically generate canonical utterances for scalability [4, 5]. Research has shown the feasibility of leveraging supervised machine translation techniques to generate canonical utterances for REST APIs [6, 7]. Ironically, such machine translation approaches also require training datasets to learn how to map REST API operations to canonical utterances. In this paper, we introduce the API2CAN dataset containing a large set of REST API operations (e.g., GET /series/{id}/actors) paired with their corresponding canonical templates (a canonical utterance in which parameter values have been replaced with placeholders e.g., “get the list of actors of the TV series with id being < id > ”).

Data description

To generate the “API2CAN dataset”, we obtained OpenAPI specifications indexed in OpenAPI Directory [8]. OpenAPI specification is a standard documentation format for documenting the interface of a REST API, including its operations and their parameters. OpenAPI Directory is a Wikipedia for REST APIs and maintains OpenAPI specifications for a large number of REST APIs. We obtained the latest version of each API index in OpenAPI Directory, and totally collected 983 APIs, containing 18,277 operations in total. Finally, we generated canonical utterances for each of the extracted operations as explained in [6]. In short, we converted the summary or description (e.g., “…gets the [actor](\#/definitions/Actor) by id. …”) of operations (e.g., GET /actors/{id}) in three steps: (i) extracting a sentence starting with a verb (e.g., “gets the actor by id.”), (ii) converting the extracted sentence to an imperative form to (e.g., “get the actor by id.”), and (iii) injecting the parameters (e.g., “get the actor by id being < id > .”) Finally, we manually cleaned the automatically generated utterances to ensure quality of the generated canonical utterances. As such we generated the API2CAN dataset which includes 14,370 pairs of operations and their corresponding canonical utterances, ignoring operations without generated canonical utterances. Next, we randomly divided the generated dataset into three parts as summarized in Table 1.

Table 1 Overview of data files/data sets

The API2CAN dataset is now public and accessible from [14]. The dataset is stored in a JSON (JavaScript Object Notation) array in which each element represents single Operation, including its API, and API version, endpoint (URL), HTTP verb (e.g., GET, POST), parameters, canonical utterances as shown in Fig. 1.

Fig. 1
figure 1

Dataset schema-sample operation [https://doi.org/10.6084/m9.figshare.13332347]

The generated canonical utterances can be used for training bots for existing REST APIs in the dataset. The API2CAN dataset can be also used to train machine translation systems to generate canonical utterances, which is required for new APIs.Footnote 2 The API2CAN dataset is also accompanied with a set of microservices called API2CAN Service. The service is implemented as a standalone open-source REST Service in Python, and it is also accessible from [13]. This REST service provides several functionaries as follow:

  • Parsing OpenAPI Specification This microservice parses the given API specification (in YAML format) and extract API elements such as operations and their parameters in JSON format.

  • Generating Canonical Utterances This microservice generates the canonical utterances based on the approaches used in generation of the API2CAN dataset. In a nutshell, two approaches are used for generating canonical utterances as introduced in [6]: (i) converting operation summary as briefly introduced in the previous section, and (ii) the resourced-based translator which relays on the notion of Resources in REST APIs and is proposed in [6, 7].

  • Sampling Parameter Values This microservice generates values (e.g., “Sydney”, “Paris”) for the parameters (e.g., “to”) of the given operation based on the approaches introduced in [6]. Generated values can be used to populate placeholders inside generated canonical utterances (e.g., “book a flight to Sydney”).

Limitations

Given that fulfilling complex intents usually requires a combination of operations [4, 15], it is also needed to generate canonical utterances for compositions between operations. To achieve this, it is required to detect the relations between operations and generate canonical templates for complex tasks (e.g., tasks requiring conditional operations or compositions of multiple operations). Adding such cases thus require further research.