Information correspondence between types of documentation for APIs


Documentation for programming languages and their APIs takes many forms, such as reference documentation, blog posts or other textual and visual media. Prior research has suggested that developers switch between reference and tutorial-like documentation while learning a new API. Documentation creation and maintenance is also an effort-intensive process that requires its creators to carefully inspect and organize information, while ensuring consistency across different sources. This article reports on the relationship between information in tutorials and in API reference documentation of three libraries on the topics: regular expressions, unified resource location and Input/Output in the two programming languages, Java and Python. Our investigation reveals that about half of the sentences in the tutorials studied describe API Information, i.e. syntax, behaviour, usage and performance of the API, that could be found in the reference documentation. The remaining are tutorial specific use-cases and examples. We also elicited and analyzed six types of correspondences between sentences in tutorials and reference documentation, ranging from identical to implied. Based on our findings, we propose a general information reuse pattern as a structured abstraction to represent the systematic integration of information from the reference documentation into a tutorial. We report on the distribution of 38 instances of this pattern, and on the impact of applying the pattern automatically on the existing tutorials. This work lays a foundation for understanding the nature of information correspondence across different documentation types to inform and assist documentation generation and maintenance.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. 1.

    Links to documentation are referenced at the end of this article under “References to Documentation Sources”.

  2. 2.

    We selected a sample size of 332 using as guide the sample size for a minimum confidence interval of 5% for the proportion of sentences in agreement based on simple random sampling. We used stratified sampling, where a number of sentences is randomly drawn from each of the six tutorials in proportion to the number of sentences in the tutorial. This prevents the calculation of an exact confidence interval.

  3. 3.

    We experimented with an automated technique based on computing the Jaccard index between sentences. Not surprisingly, the precision and recall were so low that producing a reliable analysis required a manual review as onerous as a complete manual inspection.


  1. Al Omran FNA, Treude C (2017) Choosing an nlp library for analyzing software documentation: a systematic literature review and a series of experiments. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR). IEEE, pp 187–197

  2. Angelini G (2018) Current practices in Web API documentation. In: European academic colloquium on technical communication, p 70

  3. Caponi A, Di Iorio A, Vitali F, Alberti P, Scatá M (2018) Exploiting patterns and templates for technical documentation. In: Proceedings of the ACM symposium on document engineering 2018. DocEng ’18. Association for Computing Machinery, New York

  4. Checkstyle - JavadocStyle (2020) (Online; Accessed 7 May 2020)

  5. Cleland-Huang J, Guo J (2014) Towards more intelligent trace retrieval algorithms. In: Proceedings of the 3rd international workshop on realizing artificial intelligence synergies in software engineering, pp 1–6

  6. Dagenais B, Robillard MP (2010) Creating and evolving developer documentation: understanding the decisions of open source contributors. In: Proceedings of the 18th ACM SIGSOFT international symposium on foundations of software engineering, pp 127–136

  7. Dagenais B, Robillard MP (2014) Using traceability links to recommend adaptive changes for documentation evolution. IEEE Trans Softw Eng 11:1126–1146

    Article  Google Scholar 

  8. Dekel U, Herbsleb JD (2009) Improving API documentation usability with knowledge pushing. In: Proceedings of the 31st international conference on software engineering, pp 320–330

  9. Forward A, Lethbridge TC (2002) The relevance of software documentation, tools and technologies: a survey. In: Proceedings of the ACM symposium on document engineering, pp 26–33

  10. Fourney A, Terry M (2014) Mining online software tutorials: challenges and open problems. In: Proceedings of extended abstracts on human factors in computing systems, pp 653–664

  11. Garousi G, Garousi V, Moussavi M, Ruhe G, Smith B (2013) Evaluating usage and quality of technical software documentation: an empirical study. In: Proceedings of the 17th international conference on evaluation and assessment in software engineering, pp 24–35

  12. Git-commit (2020). (Online; Accessed 7 May 2020)

  13. IEEE Standard (2009) IEEE standard for information technology–systems design–software design descriptions. In: IEEE STD 1016-2009, pp 1–35

  14. Jiang H, Zhang J, Ren Z, Zhang T (2017) An unsupervised approach for discovering relevant tutorial fragments for APIs. In: Proceedings of the 39th international conference on software engineering, pp 38–48

  15. Josyula JRA, Panamgipalli SSSC (2016) Identifying the information needs and sources of software practitioners: a mixed method approach. Master’s thesis.

  16. Koznov D, Luciv D, Basit HA, Lieh OE, Smirnov M (2015) Clone detection in reuse of software technical documentation. In: Proceedings of international Andrei Ershov memorial conference on perspectives of system informatics, pp 170–185

  17. Koznov D, Luciv D, Chernishev G (2017) Duplicate management in software documentation maintenance. In: Proceedings of the 5th international conference on actual problems of system and software engineering. CEUR Workshops proceedings, vol 1989, pp 195–201

  18. Kramer D (1999) API documentation from source code comments: a case study of Javadoc. In: Proceedings of the 17th annual international conference on computer documentation, pp 147–153

  19. Krippendorff K (2018) Content analysis: an introduction to its methodology. Sage Publications, Thousand Oaks

    Google Scholar 

  20. Luciv D, Koznov D, Basit H, Terekhov A (2016) On fuzzy repetitions detection in documentation reuse. Program Comput Softw 42:216–224

    MathSciNet  Article  Google Scholar 

  21. Luciv D, Koznov D, Chernishev G, Terekhov A, Romanovsky KY, Grigoriev D (2018) Detecting near duplicates in software documentation. Program Comput Softw 44:335–343

    MathSciNet  Article  Google Scholar 

  22. Maalej W, Robillard MP (2013) Patterns of knowledge in API reference documentation. In: IEEE Trans Softw Eng, vol 39, pp 1264–1282

  23. Meng M, Steinhardt S, Schubert A (2018) Application programming interface documentation: what do software developers want? J Tech Writ Commun 48:295–330

    Article  Google Scholar 

  24. Meng M, Steinhardt S, Schubert A (2019) How developers use API documentation: an observation study. Commun Des Q Rev 7:40–49

    Article  Google Scholar 

  25. Monperrus M, Eichberg M, Tekes E, Mezini M (2012) What should developers be aware of? An empirical study on the directives of API documentation. Empir Softw Eng 17:703–737

    Article  Google Scholar 

  26. Oumaziz MA, Charpentier A, Falleri JR, Blanc X (2017) Documentation reuse: hot or not? An empirical study. In: Proceedings of international conference on software reuse, pp 12–27

  27. Parnin C, Treude C (2011) Measuring API documentation on the web. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering, pp 25–30

  28. Phoha V (1997) A standard for software documentation. Computer 30:97–98

    Article  Google Scholar 

  29. Ries R (1990) IEEE standard for software user documentation. In: International conference on professional communication, communication across the sea: North American and European practices, pp 66–68

  30. Robillard MP (2009) What makes APIs hard to learn? answers from developers. In: IEEE Softw, vol 26, pp 27–34

  31. Robillard MP, Deline R (2011) A field study of API learning obstacles. Empir Softw Eng 16:703–732

    Article  Google Scholar 

  32. Runeson P, Host M, Rainer A, Regnell B (2012) Case study research in software engineering: guidelines and examples. Wiley, New York

    Book  Google Scholar 

  33. Rupakheti CR (2012) A critic for api client code using symbolic execution. PhD thesis, Clarkson University

  34. Treude C, Robillard MP (2016) Augmenting API documentation with insights from stack overflow. In: Proceedings of 38th international conference on software engineering, pp 392–403

  35. Treude C, Robillard MP, Dagenais B (2014) Extracting development tasks to navigate software documentation. IEEE Trans Softw Eng 41:565–581

    Article  Google Scholar 

  36. Uddin G, Robillard MP (2015) How API documentation fails. IEEE Softw 32:68–75

    Article  Google Scholar 

  37. Watson RB (2012) Development and application of a heuristic to assess trends in api documentation. In: Proceedings of the 30th ACM international conference on design of communication, pp 295–302

  38. Watson R, Stamnes M, Jeannot-Schroeder J, Spyridakis JH (2013) API documentation and software community values: a survey of open-source api documentation. In: Proceedings of the 31st ACM international conference on design of communication, pp 165–174

  39. Wikipedia Contributors (2020) Wikipedia: manual of style/lead section. (Online; Accessed 7 May 2020)

  40. Wildermann S (2014) Messung der Informationstypen-Häufigkeiten in der Python-Dokumentation. Bachelor’s thesis

  41. Zhong H, Zhang L, Xie T, Mei H (2009) Inferring resource specifications from natural language API documentation. In: Proceedings of the international conference on automated software engineering, pp 307–318

Download references

Author information



Corresponding author

Correspondence to Deeksha M. Arya.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: Martin Monperrus


Appendix A: Preprocessing Steps

In general, the following rules and preprocessing techniques for sentence extraction were adhered to:

  • Remove HTML tags script, style, table

  • Insert a comma after the tokens ‘e.g.’ and ‘i.e.’

  • Insert a comma after the token ‘etc.’ if the word following this one began with a lower case alphabet.

  • Replace multiple adjacent commas (occurring as a result of previous preprocessing steps) with a single comma.

  • Replace newlines with spaces

  • Replace multiple adjacent spaces with a single space

  • Replace multiple adjacent periods (...) with a single period (.)

  • In general, blockquotes, code blocks, images and the equivalents across the files were replaced by a single token BLOCKQUOTE, CODE and IMAGE respectively. These blocks were identified as being of a specific HTML tag type or having a specific HTML class.

  • If a list item did not end in a period, the following item would be concatenated to the previous, separated by a semicolon.

  • Finally, split on on a period followed by a space (‘. ’) and an exclamation followed by a space (‘! ’) to produce individual sentences

It is important to note here that inline HTML code tags in the sentence (inline and hence, did not involve line breaks) were maintained as is. Usually such pieces were names of the library or method being described. For example, “The java.util.regex package primarily consists of three classes: Pattern, Matcher, and PatternSyntaxException [15].

Appendix B: Reasons for Similar Correspondences

The reason for a two matched sentences to be considered similar but not equivalent could be one of the following:

  • The sentence is a rephrased version of two non-neighbouring reference documentation sentences. As a result, these sentences cannot necessarily be systematically identified and merged without advanced mechanisms to merge the sentences coherently, efficiently and favorably for the reader in a human-like writing style. For example, consider the highlighted sentence in from Java-I/O tutorial [26].

    The description in the reference documentation of newByteChannel method to which it refers mentions this information in two separate non-consecutive sentences, as highlighted in the figure below [23].

    Combining these sentences to generate a coherent sentence as in the tutorial is beyond the scope of our work.

  • The sentence references or is in conjunction with a specific example. Sentences that provide example-specific information are marked as Supporting Text. However, sentences that provide general information about the API in the context of an example are considered to have similar matches. Python-I/O tutorial [14] contains one such instance:

    Our preprocessing steps result in the sentence extracted in the following format: “If you have an object x, you can view its JSON string representation with a simple line of code: CODE.” [14]

    The code snippet within this sentence as seen from the screenshot informs that the JSON string representation of an object x can be viewed using the dumps method. The description for the method in the reference documentation [27] states:

    Hence, replacing this tutorial sentence by its corresponding reference documentation sentence will result in a loss of the example which is integral for this sentence to provide useful information.

  • It introduces a use-case for the reference API documentation. For example, in the Java-URL tutorial, the following sentence exists: “After you’ve successfully created a URL, you can call the URL’s openStream() method to get a stream from which you can read the contents of the URL.” [28] Here, the bold phrase describes when the openStream method can and should be used as opposed to the corresponding reference documentation that simply says: “Opens a connection to this URL and returns an InputStream for reading from that connection.” [25], describing what the method performs.

  • The matched API sentence may be providing excessive technical information. For example, the Java-REGEX tutorial states “The regular expression syntax in the java.util.regex API is most similar to that found in Perl.” [15] On the other hand, the reference documentation goes into deeper details: “The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.” [18] A tutorial author might decide to omit technical details that the reference documentation contains which a reader referring to the tutorial would not be expected to benefit.

Appendix C: Reasons for Unmatched Sentences

We describe the reasons for unmatched sentences in tutorials in detail below based on the ten categories listed in Table 4.

The majority of unmatched sentences in Java-REGEX provide information about the underlying topic, usually describing the general behaviour of the fundamental concept behind the API. The definition and description of a regular expression, its syntax, the behaviour of special characters or definitions of related terminology are examples of this category that we found in tutorial but not in the reference. For example, the following sentence defines a set of methods having similar functionality: “Capturing groups are a way to treat multiple characters as a single unit.” [29]

We discovered that usage, i.e. general information on how an API is expected or intended to be used, is the most frequent category for unmatched in Java-URL, Python-URL, and Java-I/O. The Java-URL tutorial recommends how to handle a MalformedURLException: “Typically, you want to catch and handle this exception by embedding your URL constructor statements in a try/catch pair, like this: CODE.” [17]

In Python-REGEX, the most commonly unmatched sentence category is that of internal behaviour with 30% of the sentences describing such information. For example, the following sentence was found in the tutorial, but not reference documentation: “Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.” [12]

A surprising 52% of sentences describe API behaviour in Python-I/O. It can be expected that descriptions of the way in which an API component performs is presented in the reference documentation, and so this finding is of interest. For example, the following sentence describes a particular behaviour of the read method on a file object: “If the end of the file has been reached, will return an empty string (‘’).” [14]

We found that sentences describing specific use-cases in which the API could be or is intended to be used, usually with the intention of motivating and justifying the usefulness of the API, were also not matched with reference documentation. This sentence from Java-REGEX is one such example: “The split method is a great tool for gathering the text that lies on either side of the pattern that’s been matched.” [16]

Sentences regarding performance of the API in terms of efficiency and scalability can also be observed in the tutorial, but their match could not be found in the reference documentation. The following sentence from Python-I/O is an example: “This is memory efficient, fast, and leads to simple code: CODE.” [14]

We observed that some sentences providing version information and backward compatibility were not matched in the reference documentation. One example of a sentence providing information regarding content of a particular version is this sentence in Java-REGEX: “As of the JDK 7 release, Regular Expression pattern matching has expanded functionality to support Unicode 6.0.” [30] This is a surprising finding because deprecation and enhancement information are generally specified in the reference documentation, in order to caution developers about no longer supported API components, or introduce them to new ones. We expect that this kind of information can be found in the version release notes and we leave the exploration of this documentation type to future work.

Some of our observations are unique to Java-I/O documentation. This, we theorize, is likely due to its large length and diverse range of sub-topics, providing greater scope for writing style variation. We found non-matched sentences providing environment and platform specific information, API support and input configuration details information only in this documentation. While describing the typical syntax of a file location, the documentation provides the following platform specific information: “In the Solaris OS, a Path uses the Solaris syntax (/home/joe/foo) and in Microsoft Windows, a Path uses the Windows syntax (C:∖home∖joe∖foo).” [31] Another sentence describes whether a file system may be able to support the API components provided: “A specific file system implementation might support only the basic file attribute view, or it may support several of these file attribute views.” [22]

Sentences containing input configuration details information are ones which describe the structure of the input to an API. For example, in the JAVA-IO tutorial, the width is an element of the format specifier in the format API. The sentence provides the following information about width: “By default the value is left-padded with blanks.” [32] The default behaviour of this element of the format specifier is not mentioned in the reference documentation.

We also identified one sentence describing a method for which the corresponding description in the API documentation was not descriptive enough to consider it as a match. While the tutorial states: “visitFile - Invoked on the file being visited.” [33] The description of the visitFile method in the reference documentation is simply: “Invoked for a file in a directory.” [34] While the sentences provide little explanation, the tutorial clarifies that this method is invoked when a file is visited as opposed to the reference documentation. We chose to treat this as an anomaly and not to categorize it as a separate unmatched category because of its low occurrence. Further, we found two more instances of descriptions in reference documentations that could have been matches for an tutorial sentence but were either incomplete or not clear in explanation. We consider both cases as implied matches because their meanings can be deduced given familiarity with the API. One example is shown below:

It is important to note that these categories are not exclusive to unmatched sentences. There may be sentences which are matched to reference documentation that provide information on these categories. We leave the detailed comparison of documentation at the category level to future work.

Appendix D: References to Documentation Sources

Below is the list of web URLs referenced in this paper. In the case of snippets of documentation used as examples, the corresponding URL defines the particular file in which the example text can be found.

  1. [1]

  2. [2]

  3. [3]

  4. [4]

  5. [5]

  6. [6]

  7. [7]

  8. [8]

  9. [9]

  10. [10]

  11. [11]

  12. [12]

  13. [13]

  14. [14]

  15. [15]

  16. [16]

  17. [17]

  18. [18]

  19. [19]

  20. [20]

  21. [21]

  22. [22]

  23. [23]

  24. [24]

  25. [25]

  26. [26]

  27. [27]

  28. [28]

  29. [29]

  30. [30]

  31. [31]

  32. [32]

  33. [33]

  34. [34]

  35. [35]

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Arya, D.M., Guo, J.L.C. & Robillard, M.P. Information correspondence between types of documentation for APIs. Empir Software Eng 25, 4069–4096 (2020).

Download citation


  • Software documentation
  • Application programming interface
  • Qualitative analysis
  • Exploratory study