Automated Extraction of Research Software Installation Instructions from README Files: An Initial Analysis

Utrilla Guerrero, Carlos; Corcho, Oscar; Garijo, Daniel

doi:10.1007/978-3-031-65794-8_8

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14770))

Included in the following conference series:

International Workshop on Natural Scientific Language Processing and Research Knowledge Graphs

442 Accesses

Abstract

Research Software code projects are typically described with a README files, which often contains the steps to set up, test and run the code contained in them. Installation instructions are written in a human-readable manner and therefore are difficult to interpret by intelligent assistants designed to help other researchers setting up a code repository. In this paper we explore this gap by assessing whether Large Language Models (LLMs) are able to extract installation instruction plans from README files. In particular, we define a methodology to extract alternate installation plans, an evaluation framework to assess the effectiveness of each result and an initial quantitative evaluation based on state of the art LLM models (llama-2-70b-chat and Mixtral-8x7b-Instruct-v0.1). Our results show that while LLMs are a promising approach for finding installation instructions, they present important limitations when these instructions are not sequential or mandatory.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

Research Software [5] is becoming increasingly recognized as a means to support the results described in scientific publications. Researchers typically document their software project in code repositories, using README files (i.e., readme.md) with instructions on how to install, setup and run their software tools. However, software documentation is usually described in natural language, which makes it challenging to automatically verify whether the installation steps required to make the software project work are accurate or not. While seemingly arbitrary, it can be challenging for researchers to follow instructions from different document standards and make sure they work harmonically and consistently.

In this work we aim to address these issues by exploring and assessing the abilities of state of the art Large Language Models (LLMs) to extract installation methods (Plans) and their corresponding instructions (Steps) from README files. LLMs such as GPT-4 [21] and MISTRAL [12] have been firmly established as state of the art approaches in various natural scientific language processing (NSLP) tasks related to knowledge extraction from human-like scientific sources such as software documentation from public sharing code hosting services. LLMs have also shown promise in following instructions [26] and learning to use tools [25]. However, existing research in the field is still quite novel.

Our goal in this work is twofold: given a README file, we aim to 1) detect all the available Plans (e.g., installation methods for different platforms or operative systems) and, 2) for each Plan, detect what steps are required to install a software project, as annotated by the authors. Our contributions^{Footnote 1} include:

1.
PlanStep, a methodology to extract structured installation instructions from README files;
2.
An evaluation framework to assess the ability of LLMs to capture installation instructions, both in terms of Plans and Steps;
3.
An annotated corpus of 33 research software projects with their respective installation plans and steps.

We implement our approach by following our methodology to evaluate two state of the art LLMs (LLaMA-2 [31] and (MIXTRAL [12]) on both installation instruction tasks with our corpus of annotated projects.

The reminder of the paper is structured as follows. Section 2 discusses relevant efforts to ours, while Sect. 3 describes our approach. Section 4 describes our experimental setup and early results, Sect. 5 addresses our limitations and Sect. 6 concludes the paper.

2 Related Work

The goal of extracting relevant information from scientific software documentation forms the foundation of complex knowledge extraction with Natural Language Processing (NLP) models, all of which use machine-learning-based (ML) approaches as basic building blocks [10].

Extracting action sequences from public platforms (e.g. Github, StackOverflow) or README files is an instance of a complex planning tasks class of problems. Remarkably, the field of automated software engineering has rapidly developed novel approaches using LLMs on important problems, for instance, integrating tool documentation [23], detecting action sequence from manuals [18], testing software [39], traceability and generating software specifications [42].

LLMs such as GPT-4 and others follow an architecture using encoder-decoder structure [37], and have been shown to perform well on simple-plans extraction and procedure mining [24], as well as mining to support scientific material discovery [2, 38]. The fundamental constraint of multi-step reasoning abilities, however, remains [19, 33, 34].

In the Knowledge Extraction (KE) field, foundational work builds on general-propose metadata extractor and domain-specific have been successfully applied in a variety of tasks including scientific mentions [6], software metadata extraction [17, 32], and scientific knowledge graph creation [15]. The automated planning community has also continued to push the boundaries of approaches that learn how to extract plans [20] and action sequences from text in domain-specific [4, 8, 13], and general domains (i.e., [18, 35]). Recently, [22, 43] and [26] have made an impressive advance in the feasibility of connecting LLMs with massive APIs documentation. However, in most cases installation instructions, specifically for plans and steps are absent from the corresponding studies.

Recent work [9, 11] has achieved significant improvements in multi-step extraction tasks by using different prompt strategies [34, 36]. In these prompt strategies, the number of operations required to extract entities or events from text grows. This makes it more difficult to learn semantics between the inputs as the instructions are not self-actionable, especially when several steps are involved. Early approaches also discussed the missing descriptions in generated plans [30], and composed learning a mapping from natural language instructions to sequences of actions for planning proposes [27]. In this experiment, this is reduced to a number of formal definitions, albeit at cost of reduced effective resolution due to natural language problem, an effect we plan to counteract with improved prompt variations using formal representation [7, 18] as described in Sect. 3.6.

To the best of our knowledge, this is the first approach relying entirely on LLMs to extract installation instructions from research software documentation. We focus on eliciting multi-step reasoning by LLMs in a zero-shot configuration. We ask LLMs to extract both the installation plans and step instructions, effectively decomposing the complex task of installing research software into two kinds of sub-problems: 1) extraction of installation methods to capture various ways of installing research software as Plan(s) from unstructured documentation, and 2) extraction of installation instructions to identify sequential actions for each method as Step(s) (i.e., Step(s) per Plan).

3 PlanStep: Extracting Installation Instructions from README Files

In this section we present PlanStep, our proposed approach designed to address limitations briefly outlined in Sect. 2. First, we describe the core goal and problem we attempt to solve. Next, we describe the PlanStep architecture and building blocks. Finally, we describe the data generation and corpus.

3.1 Classical Planning: Software Installation Instructions

The central objective of planning tasks for an intelligent assistant is to autonomously detect the sequence of steps to execute in order to accomplish a task. In classical planning domain, this procedure relies on a formal representation of the planning domain and the problem instance, encompassing actions, and their desired goals [9].

In our case, a problem instance within the installation instruction activity is illustrated in Fig. 1. This instance features research software with two alternate plans for installation available in the README: "Install from pip" and "Install from source". Each plan is defined fairly briefly, but detailed in the corresponding headers of the markdown file. Subsequently, installation steps outlining the requirements for the setup and execution, are displayed. For instance, the Plan 1 (categorised as "Package Manager") includes Steps contains three steps (or actions). Plan 2, classified as “Source”, involves one step. If we ask an intelligent assistant to autonomously decide on what sequence of steps to execute this software, we might use an LLM to mine its documentation and break down the installation objective into smaller sub-tasks: first detecting the requirements, then identifying the plan available, and finally execute the necessary commands. Many installation procedures may not need planning. For example, the "Package Manager" plan usually entails a one-step with code block, showing exactly what others need to type into their install software from a command line. However, in complex installation plans such as "from Container" (i.e., Docker compose-up, create virtual environments, configure public keys, etc.,..) planning allows the assistant to decide dynamically what steps to take. If we want an assistant to consider a software component and install it following its instructions, the task may be decomposed into different steps: 1) detect alternate plans which are available as installation methods and, 2) for each installation method, detect its corresponding sequence of steps.

3.2 PlanStep Methodology

Consistency in the extracted installation methods across different software versions is key for researchers to accurately reproduce experiments, regardless of when and how to access a README file. Therefore, our method aims to consistently connect human-readable instructions to installation Plans and Steps.

PlanStep receives as input the entire README file, and aims to extract action sequences from text in natural language, representing tasks with two distinct levels of granularity: 1) alternate installation methods (e.g., installation instructions for a separate operative system, installation instructions from source code, etc.), and 2) for each installation method, the sequence of steps associated with it.

Figure 2 depicts the methodology we followed for developing PlanStep, which comprises five stages: the first stage is to collect a set of research software for our study. Second, for each software component in the corpus, we retrieve the link for the code repository, if present, and extract the installation instruction text from its README file. Then, we inspect the original README and represent the alternate installation plans for each README in a structured format. Afterwards, for each entry, we prompt Large Language Models in order to detect plans and their corresponding steps. Finally, we design an evaluation framework to assess the quality of our results.

We limit ourselves to tasks that can be characterized as research software installation activities and involve a reasonable or necessary order of steps to be executed, such as manually setting up a software project component, installing additional libraries using package managers, running from isolated containers, or building from source.

3.3 PlanStep Corpus Creation

To systematically evaluate LLM performance on extracting installation plans with steps across varying setups, complexity, and domains, we started by selecting a corpus of research papers with their code^{Footnote 2} implementations from diverse Machine Learning (ML) areas and across different task categories. For this evaluation, we excluded, however, papers with no link to their public repository available on Github or Gitlab.

All annotations were made by the authors of this paper separately, and subsequently compared until consensus was achieved. We discussed each entry to determine the final set of steps and plans for each research software. Very rarely, agreement on specific properties remained elusive even after evaluation, and these cases were manually resolved through additional discussion. In summary, our corpus has 33 fully active and maintained open-source research software projects.

3.4 Ground Truth Extraction for PlanStep

The 33 research software projects in our corpus were selected as study subjects. In a manual co-annotation process, we tasked annotators with identifying both the installation plans and steps associated with each project’s README. The installation plans varied in complexity and description style, with some, like from pip typically comprising a single-sentence step (excluding requirements), while others, such as ‘from source,’ included multiple steps, considering various user environments and requirements. Additionally, we defined specific properties for each plan type, taking into account technology-specific support, such as package repositories like npm or PyPI. Further elaboration on these definitions is provided below:

A. Plan: represents the concept of an installation method available in a README, which is composed of steps, that must be executed in a given order. For instance, a “Source” is an instance of a Plan concept. A README can include one or multiple Plans in the installation instructions section. A brief explanation of the plans and examples is provided in Table 1.

Table 1. Definition of plan types and examples found in our corpus

Full size table

B. Step: represents the concept of a planned action as part of a ‘Plan’ to be executed sequentially. It may consist of either a single action or a group of actions. We define a ‘Step’ based on the original README text, where consecutive actions mentioned together are annotated as one step. For instance, Listing 1.1 illustrates this concept with a simple JSON example. In the example, the authors’ original text describes the first step (Step1) as ‘Clone this repository and install requirements,’ which encompasses two distinct actions: ‘Clone this repository’ and ‘install requirements.‘. The second step (Step2) simply involves one action ‘iRun the container with docker - compose’

We manually examined cases where annotators disagreed. For example, significant confusion arose from overly complicated instructions detected in README files, particularly in cases where installation instructions were included in the markdown subheadings, such as #Step1: Download the files, followed by a paragraph like Step1: Download the files with the following commands. We resolved these conflicts by removing the content of these subheadings and providing detailed annotations for the subsequent paragraph.

Next, we faced challenges when describing plan types and steps across supported technologies. For instance, while instructions for the package manager plan typically involve running pip install, the TorchCP library offered alternative installation methods like the TestPyPI server. To resolve this, we created a distinct plan named “package manager” and specified TestPyPI as the associated technology property.

Lastly, conflicts emerged concerning the inclusion of installation requirements. Some cases listed requirements within the installation instructions, while others deposited them in a separate section, traditionally before the installation instruction content begins. We decided to include these software requirements specifications only when they were part of the installation instruction section content.

3.5 Distribution of the Installation Instructions of README Files

Table 2 shows descriptive statistics of the selected research software projects based on our annotations. We reported four distinct installation Plans: binary, source, package manager and container. Notably, over half of our corpus exclusively relied on the “source” method for installing research software via README files. While “from source” was the most prevalent standalone method (66%), container and package manager plans were observed in only two and one cases, respectively. As anticipated, the “binary” method was not reported at all, indicating its rarity on open source general repositories such as Github. Unsurprisingly, the most popular research software tools e.g., tensorflow or langchain incorporated the instructions to install with package manager, typically consisting of up to two steps. The Plans vary widely in their number of steps. For example, “simple” Plans e.g., Package Manager and Container consists of 2–3 steps, while “complex” featured 10 (see Table 2 col (Total Steps)). This diversity in the number of steps impacts installation Plan length in two ways: 1) more steps introduce more complexity, and 2) additional instructions can serve as obstacles, needing further action for installation.

Approximately 44% of our samples offered multiple plans or combinations for software installation, suggesting a diverse landscape of installation approaches. Further analysis of these combinations revealed redundant information across many instruction sections, highlighting potential challenges for LLMs in accurately identifying plans and steps. For instance, the maximum length of installation instructions for the source plan reached approximately 1,765 tokens, underscoring the complexity and variability of these instructions. This diversity not only reflects the varied nature of installation plans but also poses challenges for LLMs in accurately parsing and selecting relevant instructions, potentially leading to errors in plan and step detection. The total average length of installation instructions across all subjects was 130.79 tokens.^{Footnote 3}

Table 2. Statistics of plan and steps in the corpus. We report the number and average of “ids” per plan type, multiple plans, maximum total steps in a plan, and the length of installation instructions with parameters (TokenInstall.)

Full size table

3.6 PlanStep Prompting

This section introduces our PlanStep prompt templates and its explanations for such tasks, as depicted in PROMPT101 and PROMPT201.

We directly instruct the LLMs with prompt design to describe the installation methods (Plan) and their corresponding installation instructions (Step) for each README. That is, the usual zero-shot prompt is set to ask LLM two tasks, Plan and Step, respectively. Since the LLM contains no information about these terms, we describe the terms and their respective meanings next to the task of the prompt. Consequently, the prompts used in our experiment can be categorised as follows:

Plan Prompting: This task is about extracting the installation method as Plans described in a README. We named it the PROMPT101, and it contains the four unique Plans and its definitions.

Step Prompting: This task asks for detecting the installation instructions as Steps found in a README. We named it the PROMPT201 and it requests a list of Steps for a given installation plan.

4 Experiments

In order to evaluate the effectiveness of our approach, we conducted experiments to test the ability of LLMs to capture plans and the sequence of tasks required to install different software.

4.1 Experimental Setup

We employed Mixtral-8x7b-Instruct-v0.1 [12] and LLaMA2-70b-chat [31], which are two of the most widely-used open-source LLMs with public access.^{Footnote 4} Both models demonstrate moderately good instruction-following capabilities [43]. Throughout our experiments, we maintained a temperature of 0 (argmax sampling) to ensure reproducible results. The ground-truth annotations and study subjects used to compare LLM’s predicted responses in the experiments were those presented in Sect. 3.3 and Sect. 3.4.

4.2 Evaluation Metrics

To assess our proposed PlanStep method, we employed the following metrics to assess the performance of LLMs on NLP-oriented tasks (as proposed by [3]):

F1-scores: these scores are computed to compare the performance of LLMs in extracting plans with the ground truth annotations.
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [16]: we report ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L to evaluate the quality of the results by comparing the LLMs extracted steps with the ground-truth dataset.

4.3 Evaluation Results

The results of our evaluation are shown in Table 3 and Table 4 for Plan and Step tasks respectively. We present the performance of open-source LLM on two task with the standard (PROMPT101 and PROMPT201) zero-shot prompt templates.

Plan-Task: To evaluate the effectiveness of the LLMs, we tested the models in different ways, measuring the change in performance on plan task by comparing their generated response plans with ground truth annotations. We used zero-shot approach. Table 3 summarises our results on plan tasks, and compares both LLMs’ performance.

Table 3. Results obtained on the Plan detection task.

Full size table

Both LLMs in zero-shot prompting achieved roughly a performance of more than 50% F1-score. LLaMA-2 exhibits superior performance over MIXTRAL in plan task with the LLaMA-2 outperforming the best of our experiment by 9% compared to MIXTRAL.

Step-Task: We evaluated the performance with three metrics to measure the quality in analyzing step-task. Table 4 shows the performance of the models (e.g., MIXTRAL and LlaMA).

Table 4. Evaluation results for detecting task steps for each plan. The scores (%) for Rouge-1 (R1), Rouge-2 (R2), and Rouge-L (RL) for the generated step descriptions compare our results against the ground truth steps.

Full size table

We further observe that while MIXTRAL consistently outperforms LLaMA-2 across all ROUGE scores (R1, R2, and RL) in the Step-task, achieving approximately 15% higher scores, both models demonstrate similarly poor performance in adhering to optimal step orderings, with scores ranging from 0.46 to 0.29. These findings suggest that both models struggle with the task of sequentially ordering steps in a installation Plan.

4.4 Analysis

Results of Plan-Step task. Experimental results indicate that both LLMs scored an average of around 55% F1-score for plan-task, and 37% ROUGE scores for step-task. This suggest that LLMs intrinsically vary in their abilities to solve complex tasks and reason efficiently, which are crucial for extracting plans and detecting steps more accurately.

Error Analysis. We performed a detailed analysis on specific cases where the detection performance of the LLMs differ significantly from the annotations to understand why certain steps were falsely detected. We manually studied all errors made by LLMs and classify them into four categories. Table 5^{Footnote 5} shows the count of each error type on Plan-Step tasks: E1: means models call Plans and Steps installation instructions wrongly by reusing prompt input e.g., "Binary": ["Step 1: Definition Prompt."; E2: indicates cases where models include notes and code commands in their responses, resulting in falsely imputed new steps to a wrong Plan; E3 refers to situations where models extract steps correctly but assign them to the wrong Plan Types due to a mixture of verbs or words associated with the different methods, and lack of context e.g., if the word "pip" appears, the LLM directly assigns the corresponding step to directly “Package Manager”^{Footnote 6}; O represents errors in an unclassified category (e.g., summarizing steps, incorporating steps from a previous README, splits steps or invented sentences as hallucinations). Further tables, plots and error responses examples can be found in the Appendix.

Our results suggest that different Plans exhibit a wide diversity in error types: simple installation tasks with few actions (“Package” and “Binary”) primarily encounter issues related to E3; notably, “Source” faces more issues with E1, indicating a significant impact of prompts on model performance. Across Plan types, we observe nearly identical results suggesting a possible explanation: concise instructions in README files may significantly reduce these incorrect behaviors, leading to successful execution of installation steps. Additional experiments are needed to assess this hypothesis.

Table 5. Counts of PlanStep on different Plans. E1: wrong Plans category but correct Steps; E2: wrong order of steps but correct number; E3: wrong sequential order; O: others

Full size table

Effect of Prompts. Figure 3 shows an overview of the steps detected by each LLM. Both MIXTRAL and LLaMA-2 perform on par (score: 10 vs. 7) in detecting steps correctly and incorrectly (score: 8 vs. 9). However, the latter exhibits slightly worse performance compared to the former in over-detection cases (score: 4 vs. 9), which is likely due to the ability of the model to insert prompts information into the responses. The definitions seem to inadvertently lead LLMs to incorrectly detect plans and steps by copying extra steps added solely to the prompt definition i.e., E1: call non-existing Plans by adding prompt’s input. This observation suggests the need for additional testing with zero-shot prompts for different installation plans (and reduce the definitions used in the prompt). More advanced zero-shot prompting methods [40] as well as chain-of-thought prompt strategies [41] to effectively guide LLMs in translating steps into smaller sub-tasks will be investigated in our future work.

Few-shot prompt strategies such as LLM4PDDL [28] together with chain-of-thought prompts, may provide an expressive and extensible vocabulary representation for semantically writing and describing plans to machines. We plan to investigate this approach further in future work.

5 Discussion

This work aims to automatically extract all available installation information from research software documentation. Our experiments demonstrate that while LLMs show promising results, there is substantial room for improvement. During our analysis, we prioritized extracting concise plans and steps of software installation text using two LLMs. LlaMA-2 generally demonstrates the fewest errors in plan-task, indicating a higher accuracy in predicting the installation methods. The LlaMA-2, however, shows a progressively higher number of errors when dealing with steps. MIXTRAL exhibits the opposite. We observe that MIXTRAL outputs are significantly more truthful than LlaMA-2 with less randomness and creativity in their responses. Notably, the more steps involved, the more frequent errors across both models, indicating the challenges faced in accurately predicting parameters for tools.

Moreover, the reliance on LLM for the evaluation of plans and step instructions introduce new challenges. As LLM’s ability in planning tasks in under scientific scrutiny [14, 29], there is a crucial need for further validation and fine-tuning of its capabilities in this specific context.

We are in our initial phase of the experimental research project, and consequently, components from PlanStep approach will certainly be updated and revised. First, we believe that designing combinations of few-shot prompt standards with the addition of strict formal language will improve the ability of LLMs to detect plans, and their instructions for installation consistently. Second, additional evaluations are needed to validate the insights obtained in our experiments. For plan tasks, our approach may be compared with baseline models, measuring the change in performance. Third, increasing the size of our annotated corpus is notably advantageous, providing a broader exploration of alternative semantic approaches formal representations. However, the manual nature of our instruction writing process limits our capacity to scale this work significantly.

6 Conclusion and Future Work

In this work we presented an evaluation framework and initial experimentation for using LLMs as a means to extract alternate research software installation plans and their corresponding instructions. Our approach involves equipping the LLM with essential documentation tailored to installation instructions, enabling them to refine their accuracy when using the README and improve their performance in automating the detection of installation instructions. As part of our evaluation framework we have proposed an annotated corpus, which collects different research software with their installation instructions, to systematically evaluate LLMs in extracting tasks, including plans and steps belonging to those plans.

Our experiments show promising results for both plan detection and step detection, although we are still a long way from our goal. We are currently extending our approach in different directions. First, we are augmenting the annotation corpus to consider additional README files of increasing complexity in order to create a comprehensive benchmark, distinguishing READMES of different complexity. Second, we aim to improve the prompting strategies used in our approach, including few-shot examples to better equip the model with the goal of each PlanStep task. Our central goal is to create an assistant that aids in installing research software while addressing issues that may currently exist in the installation instructions. Investigating further the addition of executable instructions in formalised and machine-readable language from classical planning research community i.e., Domain Definition Language (PDDL) [1] and beyond i.e., P-Plan Ontology [7] is another research goals of ours.

Notes

1.
The code and corpus are publicly available at: https://github.com/carlosug/READMEtoP-PLAN/ [44].
2.
We made use of Paper with Code platform: https://paperswithcode.com/ as it links together articles with software repository.
3.
Additional details about the corpus and its data exploration are available in our GitHub repository: https://github.com/carlosug/READMEtoP-PLAN/blob/main/RESULTS/corpus-explore-data.ipynb.
4.
Accessible through a public Python API Library: https://pypi.org/project/groq/.
5.
The raw material we used to calculate the counts are listed in our repository [44]: qualitative_error_analysis.md.
6.
Install python packages from a git repository has been classified as step of “Source” plan.

References

Constructions Aeronautiques et al.: PDDL—the planning domain definition language. Technical report (1998)
Google Scholar
Microsoft Research AI4Science and Microsoft Azure Quantum. “The impact of large language models on scientific discovery: a preliminary study using GPT-4”. arXiv:2311.07361 (2023)
Blagec, K., et al.: A global analysis of metrics used for measuring performance in natural language processing. arXiv:2204.11574 (2022)
Boiko, D.A., et al.: Autonomous chemical research with large language models. Nature 624(7992), 570–578 (2023). https://doi.org/10.1038/s41586-023-06792-0. https://www.nature.com/articles/s41586-023-06792-0. ISSN 1476-4687. Accessed 31 Dec 2023
Hong, N.P.C., et al.: FAIR Principles for Research Software (FAIR4RS Principles). Version 1.0 (2022). https://doi.org/10.15497/RDA00068
Du, C., et al.: Softcite dataset: a dataset of software mentions in biomedical and economic research publications. J. Assoc. Inf. Sci. Technol. 72(7), 870–884 (2021). https://doi.org/10.1002/asi.24454. ISSN 2330-1635
Garijo, D., Gil, Y.: Augmenting PROV with plans in PPLAN: scientific processes as linked data. In: Second International Workshop on Linked Science: Tackling Big Data (LISC), Held in Conjunction with the International Semantic Web Conference (ISWC), Boston, MA (2012)
Google Scholar
Hirsch, E., Uziel, G., Anaby-Tavor, A.: What’s the plan? Evaluating and developing planning-aware techniques for LLMs. arXiv:2402.11489 (2024)
Hirsch, E., Uziel, G., Anaby-Tavor, A.: What’s the plan? Evaluating and developing planning-aware techniques for LLMs (2024). arXiv:2402.11489 [cs]. Accessed 14 Mar 2024
Hou, X., et al.: Large language models for software engineering: a systematic literature review (2023). http://arxiv.org/abs/2308.10620. Accessed 05 Sept 2023
Huang, X., et al.: Understanding the planning of LLM agents: a survey. arXiv:2402.02716 (2024)
Jiang, A.Q., et al.: Mixtral of experts. arXiv:2401.04088 (2024)
Jin, Q., et al.: GeneGPT: augmenting large language models with domain tools for improved access to biomedical information (2023). arXiv:2304.09667 [cs, q- bio]. Accessed 14 Mar 2024
Kambhampati, S., et al.: LLMs can’t plan, but can help planning in LLM-modulo frameworks. arXiv:2402.01817 (2024)
Kelley, A., Garijo, D.: A framework for creating knowledge graphs of scientific software metadata. Quant. Sci. Stud. 1–37 (2021). https://doi.org/10.1162/qss_a_00167. ISSN 2641-3337
Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004). https://www.aclweb.org/anthology/W04-1013
Mao, A., Garijo, D., Fakhraei, S.: SoMEF: a framework for capturing scientific software metadata from its documentation. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 3032–3037 (2019). https://doi.org/10.1109/BigData47090.2019.9006447
Miglani, S., Yorke-Smith, N.: NLtoPDDL: one-shot learning of PDDL models from natural language process manuals. In: ICAPS 2020 Workshop on Knowledge Engineering for Planning and Scheduling (KEPS 2020) (2020)
Google Scholar
Mondorf, P., Plank, B.: Beyond accuracy: evaluating the reasoning behavior of large language models–a survey. arXiv:2404.01869 (2024)
Olmo, A., Sreedharan, S., Kambhampati, S.: GPT3- to-plan: extracting plans from text using GPT-3 (2021). arXiv:2106.07131 [cs]. Accessed 17 Jan 2024
OpenAI. GPT-4 Technical Report (2023). arXiv:2303.08774 [cs]. Accessed 24 Sept 2023
Qin, Y., et al.: InFoBench: evaluating instruction following ability in large language models (2024). arXiv:2401.03601 [cs]. Accessed 16 Feb 2024
Qin, Y., et al.: ToolLLM: facilitating large language models to master 16000+ real-world APIs (2023). arXiv:2307.16789 [cs]. Accessed 16 Feb 2024
Rula, A., D’Souza, J.: Procedural text mining with large language models. In: Proceedings of the 12th Knowledge Capture Conference 2023, K-CAP 2023, pp. 9–16. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3587259.3627572
Schick, T., et al.: ToolFormer: language models can teach themselves to use tools (2023). arXiv:2302.04761 [cs]. Accessed 21 Sept 2023
Shen, Y., et al.: TaskBench: benchmarking large language models for task automation (2023). arXiv:2311.18760 [cs]. Accessed 14 Mar 2024
Shridhar, M., et al.: ALFRED: a benchmark for interpreting grounded instructions for everyday tasks (2020). arXiv:1912.01734 [cs]. Accessed 16 Jan 2024
Silver, T., et al.: PDDL planning with pretrained large language models. In: NeurIPS 2022 Foundation Models for Decision Making Workshop (2022)
Google Scholar
Stechly, K., Valmeekam, K., Kambhampati, S.: On the self-verification limitations of large language models on reasoning and planning tasks. arXiv:2402.08115 (2024)
Tenorth, M., Nyga, D., Beetz, M.: Understanding and executing instructions for everyday manipulation tasks from the World Wide Web. In: 2010 IEEE International Conference on Robotics and Automation (ICRA 2010), Anchorage, AK, pp. 1486–1491. IEEE (2010). https://doi.org/10.1109/ROBOT.2010.5509955. ISBN 978-1-4244-5038-1. Accessed 02 Feb 2024
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)
Tsay, J., et al.: AIMMX: artificial intelligence model metadata extractor. In: Proceedings of the 17th International Conference on Mining Software Repositories, MSR 2020, pp. 81–92. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3379597.3387448. ISBN 978-1-4503-7517-7. Accessed 20 Sept 2023
Valmeekam, K., et al.: Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change)
Google Scholar
Valmeekam, K., et al.: On the planning abilities of large language models-a critical investigation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Valmeekam, K., et al.: PlanBench: an extensible benchmark for evaluating large language models on planning and reasoning about change (2023). arXiv:2206.10498 [cs]. Accessed 18 Jan 2024
Valmeekam, K., et al.: PlanBench: an extensible benchmark for evaluating large language models on planning and reasoning about change. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Google Scholar
Wang, H., et al.: Scientific discovery in the age of artificial intelligence. Nature 620(7972), 47–60 (2023). https://doi.org/10.1038/s41586-023-06221-2. ISSN 1476-4687. Accessed 07 Sept 2023
Wang, J., et al.: Software testing with large language models: survey, landscape, and vision. IEEE Trans. Softw. Eng. (2024)
Google Scholar
Wang, L., et al.: Plan-and-solve prompting: improving zero-shot chain-of- thought reasoning by large language models (2023). arXiv:2305.04091 [cs]. Accessed 14 Mar 2024
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
Google Scholar
Xie, D., et al.: Impact of large language models on generating software specifications (2023). arXiv:2306.03324 [cs]. Accessed 11 Sept 2023
Yuan, S., et al.: EASYTOOL: enhancing LLM-based agents with concise tool instruction (2024). arXiv:2401.06201 [cs]. Accessed 06 Feb 2024
Carlos, Z.: Carlosug/READMEtoP-PLAN: READMEtoP-PLAN First Release (2024). https://doi.org/10.5281/zenodo.10991890

Download references

Acknowledgements

This work is supported by the Madrid Government (Comunidad de Madrid - Spain) under the Multiannual Agreement with Universidad Politécnica de Madrid in the line Support for R &D projects for Beatriz Galindo researchers, in the context of the VPRICIT, and through the call Research Grants for Young Investigators from Universidad Politécnica de Madrid. The authors would also like to acknowledge European Union’s Horizon Europe Programme under GA 101129744 – EVERSE – HORIZON-INFRA-2023-EOSC-01-02.

Author information

Authors and Affiliations

Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain
Carlos Utrilla Guerrero, Oscar Corcho & Daniel Garijo
Research Data and Software (RDS) Department in Delft University of Technology, Delft, the Netherlands
Carlos Utrilla Guerrero

Authors

Carlos Utrilla Guerrero
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Corcho
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Garijo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Utrilla Guerrero .

Editor information

Editors and Affiliations

Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Berlin, Germany
Georg Rehm
GESIS Leibniz Institut für Sozialwissenschaften and Heinrich-Heine - University Düsseldorf, Cologne, Germany
Stefan Dietze
Technical University of Berlin and Fraunhofer FOKUS, Berlin, Berlin, Germany
Sonja Schimmler
Wismar University of Applied Sciences, Wismar, Germany
Frank Krüger

Appendix

1.1 Examples of Error Types Produced by LLMs

1.2 Detailed Tables

Figure 4 illustrates the distribution of total length of readme instructions for each study subject in our corpus.
Figure 5 aggregates the study subjects per distinct Plan type and its technology properties.
Plots where each bar represents an ID research software, and within each bar, different colored segments represent the ratio of system-detected steps to reference steps for each method. Ratio of LLM Detected steps to Annotations steps. A value around 1 indicates a good match between LLM and Annotations (Figs. 6 and 7).

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Utrilla Guerrero, C., Corcho, O., Garijo, D. (2024). Automated Extraction of Research Software Installation Instructions from README Files: An Initial Analysis. In: Rehm, G., Dietze, S., Schimmler, S., Krüger, F. (eds) Natural Scientific Language Processing and Research Knowledge Graphs. NSLP 2024. Lecture Notes in Computer Science(), vol 14770. Springer, Cham. https://doi.org/10.1007/978-3-031-65794-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-65794-8_8
Published: 15 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-65793-1
Online ISBN: 978-3-031-65794-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automated Extraction of Research Software Installation Instructions from README Files: An Initial Analysis

Abstract

Keywords

1 Introduction

2 Related Work

3 PlanStep: Extracting Installation Instructions from README Files

3.1 Classical Planning: Software Installation Instructions

3.2 PlanStep Methodology

3.3 PlanStep Corpus Creation

3.4 Ground Truth Extraction for PlanStep

3.5 Distribution of the Installation Instructions of README Files

3.6 PlanStep Prompting

4 Experiments

4.1 Experimental Setup

4.2 Evaluation Metrics

4.3 Evaluation Results

4.4 Analysis

5 Discussion

6 Conclusion and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Examples of Error Types Produced by LLMs

1.2 Detailed Tables

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation