Keywords

1 Introduction

Research Software [5] is becoming increasingly recognized as a means to support the results described in scientific publications. Researchers typically document their software project in code repositories, using README files (i.e., readme.md) with instructions on how to install, setup and run their software tools. However, software documentation is usually described in natural language, which makes it challenging to automatically verify whether the installation steps required to make the software project work are accurate or not. While seemingly arbitrary, it can be challenging for researchers to follow instructions from different document standards and make sure they work harmonically and consistently.

In this work we aim to address these issues by exploring and assessing the abilities of state of the art Large Language Models (LLMs) to extract installation methods (Plans) and their corresponding instructions (Steps) from README files. LLMs such as GPT-4 [21] and MISTRAL [12] have been firmly established as state of the art approaches in various natural scientific language processing (NSLP) tasks related to knowledge extraction from human-like scientific sources such as software documentation from public sharing code hosting services. LLMs have also shown promise in following instructions [26] and learning to use tools [25]. However, existing research in the field is still quite novel.

Our goal in this work is twofold: given a README file, we aim to 1) detect all the available Plans (e.g., installation methods for different platforms or operative systems) and, 2) for each Plan, detect what steps are required to install a software project, as annotated by the authors. Our contributionsFootnote 1 include:

  1. 1.

    PlanStep, a methodology to extract structured installation instructions from README files;

  2. 2.

    An evaluation framework to assess the ability of LLMs to capture installation instructions, both in terms of Plans and Steps;

  3. 3.

    An annotated corpus of 33 research software projects with their respective installation plans and steps.

We implement our approach by following our methodology to evaluate two state of the art LLMs (LLaMA-2 [31] and (MIXTRAL [12]) on both installation instruction tasks with our corpus of annotated projects.

The reminder of the paper is structured as follows. Section 2 discusses relevant efforts to ours, while Sect. 3 describes our approach. Section 4 describes our experimental setup and early results, Sect. 5 addresses our limitations and Sect. 6 concludes the paper.

2 Related Work

The goal of extracting relevant information from scientific software documentation forms the foundation of complex knowledge extraction with Natural Language Processing (NLP) models, all of which use machine-learning-based (ML) approaches as basic building blocks [10].

Extracting action sequences from public platforms (e.g. Github, StackOverflow) or README files is an instance of a complex planning tasks class of problems. Remarkably, the field of automated software engineering has rapidly developed novel approaches using LLMs on important problems, for instance, integrating tool documentation [23], detecting action sequence from manuals [18], testing software [39], traceability and generating software specifications [42].

LLMs such as GPT-4 and others follow an architecture using encoder-decoder structure [37], and have been shown to perform well on simple-plans extraction and procedure mining [24], as well as mining to support scientific material discovery [2, 38]. The fundamental constraint of multi-step reasoning abilities, however, remains [19, 33, 34].

In the Knowledge Extraction (KE) field, foundational work builds on general-propose metadata extractor and domain-specific have been successfully applied in a variety of tasks including scientific mentions [6], software metadata extraction [17, 32], and scientific knowledge graph creation [15]. The automated planning community has also continued to push the boundaries of approaches that learn how to extract plans [20] and action sequences from text in domain-specific [4, 8, 13], and general domains (i.e., [18, 35]). Recently, [22, 43] and [26] have made an impressive advance in the feasibility of connecting LLMs with massive APIs documentation. However, in most cases installation instructions, specifically for plans and steps are absent from the corresponding studies.

Recent work [9, 11] has achieved significant improvements in multi-step extraction tasks by using different prompt strategies [34, 36]. In these prompt strategies, the number of operations required to extract entities or events from text grows. This makes it more difficult to learn semantics between the inputs as the instructions are not self-actionable, especially when several steps are involved. Early approaches also discussed the missing descriptions in generated plans [30], and composed learning a mapping from natural language instructions to sequences of actions for planning proposes [27]. In this experiment, this is reduced to a number of formal definitions, albeit at cost of reduced effective resolution due to natural language problem, an effect we plan to counteract with improved prompt variations using formal representation [7, 18] as described in Sect. 3.6.

To the best of our knowledge, this is the first approach relying entirely on LLMs to extract installation instructions from research software documentation. We focus on eliciting multi-step reasoning by LLMs in a zero-shot configuration. We ask LLMs to extract both the installation plans and step instructions, effectively decomposing the complex task of installing research software into two kinds of sub-problems: 1) extraction of installation methods to capture various ways of installing research software as Plan(s) from unstructured documentation, and 2) extraction of installation instructions to identify sequential actions for each method as Step(s) (i.e., Step(s) per Plan).

3 PlanStep: Extracting Installation Instructions from README Files

In this section we present PlanStep, our proposed approach designed to address limitations briefly outlined in Sect. 2. First, we describe the core goal and problem we attempt to solve. Next, we describe the PlanStep architecture and building blocks. Finally, we describe the data generation and corpus.

3.1 Classical Planning: Software Installation Instructions

The central objective of planning tasks for an intelligent assistant is to autonomously detect the sequence of steps to execute in order to accomplish a task. In classical planning domain, this procedure relies on a formal representation of the planning domain and the problem instance, encompassing actions, and their desired goals [9].

In our case, a problem instance within the installation instruction activity is illustrated in Fig. 1. This instance features research software with two alternate plans for installation available in the README: "Install from pip" and "Install from source". Each plan is defined fairly briefly, but detailed in the corresponding headers of the markdown file. Subsequently, installation steps outlining the requirements for the setup and execution, are displayed. For instance, the Plan 1 (categorised as "Package Manager") includes Steps contains three steps (or actions). Plan 2, classified as “Source”, involves one step. If we ask an intelligent assistant to autonomously decide on what sequence of steps to execute this software, we might use an LLM to mine its documentation and break down the installation objective into smaller sub-tasks: first detecting the requirements, then identifying the plan available, and finally execute the necessary commands. Many installation procedures may not need planning. For example, the "Package Manager" plan usually entails a one-step with code block, showing exactly what others need to type into their install software from a command line. However, in complex installation plans such as "from Container" (i.e., Docker compose-up, create virtual environments, configure public keys, etc.,..) planning allows the assistant to decide dynamically what steps to take. If we want an assistant to consider a software component and install it following its instructions, the task may be decomposed into different steps: 1) detect alternate plans which are available as installation methods and, 2) for each installation method, detect its corresponding sequence of steps.

Fig. 1.
figure 1

An example of our experimental approach for PlanStep. A research software project includes two installation methods: a simple installation plan i.e. Package Manager and a complex one (i.e. Source). Each installation method has a different number of steps and configuration details.

3.2 PlanStep Methodology

Consistency in the extracted installation methods across different software versions is key for researchers to accurately reproduce experiments, regardless of when and how to access a README file. Therefore, our method aims to consistently connect human-readable instructions to installation Plans and Steps.

PlanStep receives as input the entire README file, and aims to extract action sequences from text in natural language, representing tasks with two distinct levels of granularity: 1) alternate installation methods (e.g., installation instructions for a separate operative system, installation instructions from source code, etc.), and 2) for each installation method, the sequence of steps associated with it.

Figure 2 depicts the methodology we followed for developing PlanStep, which comprises five stages: the first stage is to collect a set of research software for our study. Second, for each software component in the corpus, we retrieve the link for the code repository, if present, and extract the installation instruction text from its README file. Then, we inspect the original README and represent the alternate installation plans for each README in a structured format. Afterwards, for each entry, we prompt Large Language Models in order to detect plans and their corresponding steps. Finally, we design an evaluation framework to assess the quality of our results.

We limit ourselves to tasks that can be characterized as research software installation activities and involve a reasonable or necessary order of steps to be executed, such as manually setting up a software project component, installing additional libraries using package managers, running from isolated containers, or building from source.

Fig. 2.
figure 2

Overview of the methodology followed to collect research software and design an evaluation framework to assess PlanStep

3.3 PlanStep Corpus Creation

To systematically evaluate LLM performance on extracting installation plans with steps across varying setups, complexity, and domains, we started by selecting a corpus of research papers with their codeFootnote 2 implementations from diverse Machine Learning (ML) areas and across different task categories. For this evaluation, we excluded, however, papers with no link to their public repository available on Github or Gitlab.

All annotations were made by the authors of this paper separately, and subsequently compared until consensus was achieved. We discussed each entry to determine the final set of steps and plans for each research software. Very rarely, agreement on specific properties remained elusive even after evaluation, and these cases were manually resolved through additional discussion. In summary, our corpus has 33 fully active and maintained open-source research software projects.

3.4 Ground Truth Extraction for PlanStep

The 33 research software projects in our corpus were selected as study subjects. In a manual co-annotation process, we tasked annotators with identifying both the installation plans and steps associated with each project’s README. The installation plans varied in complexity and description style, with some, like from pip typically comprising a single-sentence step (excluding requirements), while others, such as ‘from source,’ included multiple steps, considering various user environments and requirements. Additionally, we defined specific properties for each plan type, taking into account technology-specific support, such as package repositories like npm or PyPI. Further elaboration on these definitions is provided below:

A. Plan: represents the concept of an installation method available in a README, which is composed of steps, that must be executed in a given order. For instance, a “Source” is an instance of a Plan concept. A README can include one or multiple Plans in the installation instructions section. A brief explanation of the plans and examples is provided in Table 1.

Table 1. Definition of plan types and examples found in our corpus

B. Step: represents the concept of a planned action as part of a ‘Plan’ to be executed sequentially. It may consist of either a single action or a group of actions. We define a ‘Step’ based on the original README text, where consecutive actions mentioned together are annotated as one step. For instance, Listing 1.1 illustrates this concept with a simple JSON example. In the example, the authors’ original text describes the first step (Step1) as ‘Clone this repository and install requirements,’ which encompasses two distinct actions: ‘Clone this repository’ and ‘install requirements.‘. The second step (Step2) simply involves one action ‘iRun the container with docker - compose’

figure a

We manually examined cases where annotators disagreed. For example, significant confusion arose from overly complicated instructions detected in README files, particularly in cases where installation instructions were included in the markdown subheadings, such as #Step1: Download the files, followed by a paragraph like Step1: Download the files with the following commands. We resolved these conflicts by removing the content of these subheadings and providing detailed annotations for the subsequent paragraph.

Next, we faced challenges when describing plan types and steps across supported technologies. For instance, while instructions for the package manager plan typically involve running pip install, the TorchCP library offered alternative installation methods like the TestPyPI server. To resolve this, we created a distinct plan named “package manager” and specified TestPyPI as the associated technology property.

Lastly, conflicts emerged concerning the inclusion of installation requirements. Some cases listed requirements within the installation instructions, while others deposited them in a separate section, traditionally before the installation instruction content begins. We decided to include these software requirements specifications only when they were part of the installation instruction section content.

3.5 Distribution of the Installation Instructions of README Files

Table 2 shows descriptive statistics of the selected research software projects based on our annotations. We reported four distinct installation Plans: binary, source, package manager and container. Notably, over half of our corpus exclusively relied on the “source” method for installing research software via README files. While “from source” was the most prevalent standalone method (66%), container and package manager plans were observed in only two and one cases, respectively. As anticipated, the “binary” method was not reported at all, indicating its rarity on open source general repositories such as Github. Unsurprisingly, the most popular research software tools e.g., tensorflow or langchain incorporated the instructions to install with package manager, typically consisting of up to two steps. The Plans vary widely in their number of steps. For example, “simple” Plans e.g., Package Manager and Container consists of 2–3 steps, while “complex” featured 10 (see Table 2 col (Total Steps)). This diversity in the number of steps impacts installation Plan length in two ways: 1) more steps introduce more complexity, and 2) additional instructions can serve as obstacles, needing further action for installation.

Approximately 44% of our samples offered multiple plans or combinations for software installation, suggesting a diverse landscape of installation approaches. Further analysis of these combinations revealed redundant information across many instruction sections, highlighting potential challenges for LLMs in accurately identifying plans and steps. For instance, the maximum length of installation instructions for the source plan reached approximately 1,765 tokens, underscoring the complexity and variability of these instructions. This diversity not only reflects the varied nature of installation plans but also poses challenges for LLMs in accurately parsing and selecting relevant instructions, potentially leading to errors in plan and step detection. The total average length of installation instructions across all subjects was 130.79 tokens.Footnote 3

Table 2. Statistics of plan and steps in the corpus. We report the number and average of “ids” per plan type, multiple plans, maximum total steps in a plan, and the length of installation instructions with parameters (TokenInstall.)

3.6 PlanStep Prompting

This section introduces our PlanStep prompt templates and its explanations for such tasks, as depicted in PROMPT101 and PROMPT201.

We directly instruct the LLMs with prompt design to describe the installation methods (Plan) and their corresponding installation instructions (Step) for each README. That is, the usual zero-shot prompt is set to ask LLM two tasks, Plan and Step, respectively. Since the LLM contains no information about these terms, we describe the terms and their respective meanings next to the task of the prompt. Consequently, the prompts used in our experiment can be categorised as follows:

figure b

Plan Prompting: This task is about extracting the installation method as Plans described in a README. We named it the PROMPT101, and it contains the four unique Plans and its definitions.

Step Prompting: This task asks for detecting the installation instructions as Steps found in a README. We named it the PROMPT201 and it requests a list of Steps for a given installation plan.

figure c

4 Experiments

In order to evaluate the effectiveness of our approach, we conducted experiments to test the ability of LLMs to capture plans and the sequence of tasks required to install different software.

4.1 Experimental Setup

We employed Mixtral-8x7b-Instruct-v0.1 [12] and LLaMA2-70b-chat [31], which are two of the most widely-used open-source LLMs with public access.Footnote 4 Both models demonstrate moderately good instruction-following capabilities [43]. Throughout our experiments, we maintained a temperature of 0 (argmax sampling) to ensure reproducible results. The ground-truth annotations and study subjects used to compare LLM’s predicted responses in the experiments were those presented in Sect. 3.3 and Sect. 3.4.

4.2 Evaluation Metrics

To assess our proposed PlanStep method, we employed the following metrics to assess the performance of LLMs on NLP-oriented tasks (as proposed by [3]):

  • F1-scores: these scores are computed to compare the performance of LLMs in extracting plans with the ground truth annotations.

  • Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [16]: we report ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L to evaluate the quality of the results by comparing the LLMs extracted steps with the ground-truth dataset.

4.3 Evaluation Results

The results of our evaluation are shown in Table 3 and Table 4 for Plan and Step tasks respectively. We present the performance of open-source LLM on two task with the standard (PROMPT101 and PROMPT201) zero-shot prompt templates.

Plan-Task: To evaluate the effectiveness of the LLMs, we tested the models in different ways, measuring the change in performance on plan task by comparing their generated response plans with ground truth annotations. We used zero-shot approach. Table 3 summarises our results on plan tasks, and compares both LLMs’ performance.

Table 3. Results obtained on the Plan detection task.

Both LLMs in zero-shot prompting achieved roughly a performance of more than 50% F1-score. LLaMA-2 exhibits superior performance over MIXTRAL in plan task with the LLaMA-2 outperforming the best of our experiment by 9% compared to MIXTRAL.

Step-Task: We evaluated the performance with three metrics to measure the quality in analyzing step-task. Table 4 shows the performance of the models (e.g., MIXTRAL and LlaMA).

Table 4. Evaluation results for detecting task steps for each plan. The scores (%) for Rouge-1 (R1), Rouge-2 (R2), and Rouge-L (RL) for the generated step descriptions compare our results against the ground truth steps.

We further observe that while MIXTRAL consistently outperforms LLaMA-2 across all ROUGE scores (R1, R2, and RL) in the Step-task, achieving approximately 15% higher scores, both models demonstrate similarly poor performance in adhering to optimal step orderings, with scores ranging from 0.46 to 0.29. These findings suggest that both models struggle with the task of sequentially ordering steps in a installation Plan.

4.4 Analysis

Results of Plan-Step task. Experimental results indicate that both LLMs scored an average of around 55% F1-score for plan-task, and 37% ROUGE scores for step-task. This suggest that LLMs intrinsically vary in their abilities to solve complex tasks and reason efficiently, which are crucial for extracting plans and detecting steps more accurately.

Error Analysis. We performed a detailed analysis on specific cases where the detection performance of the LLMs differ significantly from the annotations to understand why certain steps were falsely detected. We manually studied all errors made by LLMs and classify them into four categories. Table 5Footnote 5 shows the count of each error type on Plan-Step tasks: E1: means models call Plans and Steps installation instructions wrongly by reusing prompt input e.g., "Binary": ["Step 1: Definition Prompt."; E2: indicates cases where models include notes and code commands in their responses, resulting in falsely imputed new steps to a wrong Plan; E3 refers to situations where models extract steps correctly but assign them to the wrong Plan Types due to a mixture of verbs or words associated with the different methods, and lack of context e.g., if the word "pip" appears, the LLM directly assigns the corresponding step to directly “Package Manager”Footnote 6; O represents errors in an unclassified category (e.g., summarizing steps, incorporating steps from a previous README, splits steps or invented sentences as hallucinations). Further tables, plots and error responses examples can be found in the Appendix.

Our results suggest that different Plans exhibit a wide diversity in error types: simple installation tasks with few actions (“Package” and “Binary”) primarily encounter issues related to E3; notably, “Source” faces more issues with E1, indicating a significant impact of prompts on model performance. Across Plan types, we observe nearly identical results suggesting a possible explanation: concise instructions in README files may significantly reduce these incorrect behaviors, leading to successful execution of installation steps. Additional experiments are needed to assess this hypothesis.

Table 5. Counts of PlanStep on different Plans. E1: wrong Plans category but correct Steps; E2: wrong order of steps but correct number; E3: wrong sequential order; O: others

Effect of Prompts. Figure 3 shows an overview of the steps detected by each LLM. Both MIXTRAL and LLaMA-2 perform on par (score: 10 vs. 7) in detecting steps correctly and incorrectly (score: 8 vs. 9). However, the latter exhibits slightly worse performance compared to the former in over-detection cases (score: 4 vs. 9), which is likely due to the ability of the model to insert prompts information into the responses. The definitions seem to inadvertently lead LLMs to incorrectly detect plans and steps by copying extra steps added solely to the prompt definition i.e., E1: call non-existing Plans by adding prompt’s input. This observation suggests the need for additional testing with zero-shot prompts for different installation plans (and reduce the definitions used in the prompt). More advanced zero-shot prompting methods [40] as well as chain-of-thought prompt strategies [41] to effectively guide LLMs in translating steps into smaller sub-tasks will be investigated in our future work.

Fig. 3.
figure 3

Total count of steps detected for each Plan per LLM, in comparison with the ground truth. If a LLM detected fewer steps than the annotations, we consider it under-detection (under-d), while if it detected more, it indicate over-detection (over-d). A correct step detection ((perfect) indicates the number of steps agree with those in the ground truth. The (incorrect) detection counts steps in a plan that are falsely detected i.e., the LLM model detected a plan with steps that are not part of our annotations)

Few-shot prompt strategies such as LLM4PDDL [28] together with chain-of-thought prompts, may provide an expressive and extensible vocabulary representation for semantically writing and describing plans to machines. We plan to investigate this approach further in future work.

5 Discussion

This work aims to automatically extract all available installation information from research software documentation. Our experiments demonstrate that while LLMs show promising results, there is substantial room for improvement. During our analysis, we prioritized extracting concise plans and steps of software installation text using two LLMs. LlaMA-2 generally demonstrates the fewest errors in plan-task, indicating a higher accuracy in predicting the installation methods. The LlaMA-2, however, shows a progressively higher number of errors when dealing with steps. MIXTRAL exhibits the opposite. We observe that MIXTRAL outputs are significantly more truthful than LlaMA-2 with less randomness and creativity in their responses. Notably, the more steps involved, the more frequent errors across both models, indicating the challenges faced in accurately predicting parameters for tools.

Moreover, the reliance on LLM for the evaluation of plans and step instructions introduce new challenges. As LLM’s ability in planning tasks in under scientific scrutiny [14, 29], there is a crucial need for further validation and fine-tuning of its capabilities in this specific context.

We are in our initial phase of the experimental research project, and consequently, components from PlanStep approach will certainly be updated and revised. First, we believe that designing combinations of few-shot prompt standards with the addition of strict formal language will improve the ability of LLMs to detect plans, and their instructions for installation consistently. Second, additional evaluations are needed to validate the insights obtained in our experiments. For plan tasks, our approach may be compared with baseline models, measuring the change in performance. Third, increasing the size of our annotated corpus is notably advantageous, providing a broader exploration of alternative semantic approaches formal representations. However, the manual nature of our instruction writing process limits our capacity to scale this work significantly.

6 Conclusion and Future Work

In this work we presented an evaluation framework and initial experimentation for using LLMs as a means to extract alternate research software installation plans and their corresponding instructions. Our approach involves equipping the LLM with essential documentation tailored to installation instructions, enabling them to refine their accuracy when using the README and improve their performance in automating the detection of installation instructions. As part of our evaluation framework we have proposed an annotated corpus, which collects different research software with their installation instructions, to systematically evaluate LLMs in extracting tasks, including plans and steps belonging to those plans.

Our experiments show promising results for both plan detection and step detection, although we are still a long way from our goal. We are currently extending our approach in different directions. First, we are augmenting the annotation corpus to consider additional README files of increasing complexity in order to create a comprehensive benchmark, distinguishing READMES of different complexity. Second, we aim to improve the prompting strategies used in our approach, including few-shot examples to better equip the model with the goal of each PlanStep task. Our central goal is to create an assistant that aids in installing research software while addressing issues that may currently exist in the installation instructions. Investigating further the addition of executable instructions in formalised and machine-readable language from classical planning research community i.e., Domain Definition Language (PDDL) [1] and beyond i.e., P-Plan Ontology [7] is another research goals of ours.