Translation of programs written in C/C++ for the Web has been a recent topic of interest in various fields of research. Compilation of audio tools has been examined by Letz et al. and Zbyszyński et al. [8, 9], whereas a distributed evolutionary algorithm has been investigated by Leclerc et al. [10]. Furthermore, the potential of using Rust instead of other systems programming languages is another emerging recent topic. Rust has been shown to produce efficient code for the implementation of garbage collectors reducing at the same time the programmer error surface [11, 12]. Combination of both is possible, and we were motivated to examine the application of these technologies in junction with modern development trends in OSS, in order to improve the state of C/C++ software discoverability. The rest of this section provides the background information in order to justify our thought process, as well as the technologies chosen.
2.1 JavaScript in Open Source Projects
JS is one of the most popular programming languages intended to facilitate interaction with the user in web applications [3]. More recently, it has been employed in server-side infrastructure for distributed services, in cross-platform applications targeting stationary and mobile users alike, but has also been used as a compilation target for a plethora of programming languages [13]. In our previous work, we have examined the aforementioned aspects of JS by implementing a full-featured high-performance cross-platform social application and distributed service, with only OSS components, and evaluated the benefits and implications of our choices [14,15,16]. This endeavor was only made possible due to the wide spectrum of focused OSS modules, easily accessible via the npm repository. The word “module” is used to describe building blocks, usually performing a single task, leading to composable, instead of monolithic, design patterns. According to Modulecounts [5], a service monitoring language for module repositories, npm averages 697 new modules/day, followed by Packagist (PHP) with 136 modules/day and Maven Central (Java) with 100 modules/day. It is noteworthy, that npm module submissions follow an exponential growth curve, clearly outpacing all the other repositories. Furthermore, TF Bissyandé et al. have studied 100,000 OSS projects hosted on GitHub [17]. Their findings exhibited that JS was the programming language that appeared the most frequently in multi-language projects. The Node.js platform is such a multi-language project, where its components are written in both JS and C/C++.
Node.js Architecture.
The “Applications/Modules” space is where all JS project files reside. They may make either direct, or indirect use of precompiled foreign code. Additionally, required external dependencies, which are reused in the application’s logic, belong in this space as well. Although JS has positive aspects, Node.js has to rely upon compiled code to perform I/O operations [18]. The runtime’s garbage collector (GC) abstracts the, potentially error prone, manual dynamic memory management, but there is no thread-safety mechanism present when performing I/O, making memory related faults, and race conditions possible [19]. Furthermore, failure points may be present in the underlying foreign compiled code, propagating to the higher levels, and leading to unexpected behaviors and faults. Node.js utilizes C/C++ libraries internally, in order to provide access to the operating system resources. Such libraries provide efficient solutions to all I/O related operations included as core functionalities. For instance, the libuv project provides event-loop and asynchronous I/O access [20]. Other examples of libraries used internally by Node.js are c-ares, zlib, and OpenSSL. The “C/C++ Bindings” space is where the functionality of such libraries is exposed via the core JS Application Programming Interface (API). Some examples in this space are the
os
,
fs
,
net
and
http
modules.
Addons in Node.js refer to libraries and their corresponding bindings, which are not included in the core modules [21]. They are usually written in C/C++, in order to extend Node.js’ functionality, or provide performance gains, when a JS implementation is found to be lacking. For instance, μWebSockets is a popular WebSocket protocol implementation for Node.js, which out-performs all known pure JS implementations [22]. Moreover, libxmljs provides bindings to the popular XML parsing C library, libxml, fulfilling Node.js lack of XML support [23].
Although addons have capabilities to extend Node.js, they require knowledge of how the V8 engine works and their implementations must be fine-tuned, in order to avoid locking the main thread JS executes, duplication of memory allocations, data races, memory faults, etc. Another method of interfacing shared objects with Node.js is via the
ffi
module. It involves no elaborate setup, at the cost of highly reduced performance on high Input/Output systems. We theorize that writing and publishing Node.js bindings, may pose an opportunity for undiscoverable C/C++ OSS to receive exposure and be collaboratively improved, due to the massive reuse potential in Node.js projects. Unfortunately, their implementation is connected to a performance-productivity trade-off.
2.2 Challenges in C/C++ OSS
OSS implemented in the popular systems programming languages C and C++ predates the proliferation of cloud computing, which enabled OSS to flourish. There is evidence of the inherent inflexible codebase componentization in the amount of build systems available. Some well-known examples are CMake [24], qmake [25], SCons [26], and GYP [27], with the latter being used by Node.js. Legacy codebases that are not using such systems gradually degrade in maintainability, as observed by Dayani-Fard et al. [28]. In contrast, most prevalent programming languages in OSS implement some form of enforced conventions, as well as a queryable directory or repository, containing the corresponding metadata for every published project, e.g. via mvn for Java [29], npm for JS [4], gem for Ruby [30], pip for Python [31], etc. Those enforced conventions serve as guidelines to interact with code repositories, document, license, test, build, distribute, etc., features which may exist for C/C++ in the form of various third-party tools, but are incapable of providing the cohesion needed across OSS. The lack of advocated methodology in C/C++ OSS is more apparent in legacy projects, with many of them accessible only through manual pursuit via web search-engines. Downloading arbitrary dependencies by-hand, extracting, copying-and-pasting, figuring out the right compiler flags, are not uncommon practices.
Finally, inconsistent dynamic memory management is common in even mainstream utilities. For example, the GNU tool
ls
, has been known to leak memory, and is considered as a non-issue by its maintainers [32]. Although in its intended use, the operating system would handle the leak, using the library unknowingly of that issue in a persistent system, e.g. a server, would pose a threat to robustness. Moreover, lack of C/C++ programming experience may cause integer overflow/underflow leading to “undefined behavior”. Type safety is not guaranteed by C/C++, and although programs may not exhibit type errors, undefined behaviors are incorporated in the standard specification, leading compilers to produce unspecified results and also to allow the program to do practically anything. A simple example of iterator invalidation, leading to undefined behavior is demonstrated in the following listing.
If the contents of a container that is being iterated over are destroyed, the program is led into undefined behavior. This is an example of a perfectly valid code from a C++ compiler’s perspective, capable of halting the project using it. Such cases of undefined behavior have already been investigated in depth [33]. Memory safety is set at risk by null-pointer dereferences (
NULL
in C and
nullptr
in C++) that cause programs to crash, dangling pointers allowing access to heap allocated resources that have not lived as long as they had to, and buffer overruns allowing the program to access elements before the start or beyond the end of an array [34]. Malicious software has been taking advantage of the way C and C++ programs handle memory and exploiting bugs in the code. Hence, OSS discovered in the wild may propagate unwanted effects to derivative projects.
2.3 Rust: A Young Contender in OSS
Rust was created in order to address the challenges presented in Sect. 2.2. It follows the C++ philosophy of zero-cost abstractions and takes a step further, by incorporating memory-safety and data-race free concurrency without the need for a GC [35]. This is accomplished by statically tracking ownership and lifetimes of all variables and their references. The ownership system enables Rust to automatically deallocate and run destructors on all values immediately, when they go out of scope and prevents values from being accessed after they are destroyed. Rust applies some established techniques from academia, e.g.
enums
as algebraic data types, common in the ML family of languages, and
traits
, which enable polymorphism similar to Haskell’s type classes. Similarly to JS, both procedural and functional paradigms are used, as examined by Poss [36].
Rust is available on GitHub, where all parts of the compiler and tooling are accessible for contributions [37]. The integrated command line application
cargo
serves as a complete project management tool. It is capable of instantiating new projects, building for various architectures, managing dependencies, testing, producing documentation, and more. Furthermore,
cargo
is responsible for enforcing the practices that enable OSS to be discoverable and maintainable. Interaction with C APIs is free of overheads, and the binaries produced can be called from C with no setup. As Rust utilizes LLVM for machine code emission, we were triggered to examine the possibility of utilizing this system for bridging the gap between C/C++ codebases and modern OSS development practices.
2.4 Compilation to JavaScript and WebAssembly
A strict subset of the JS programming language was designed as a compilation target, in order to allow for translation of programs written in other languages. It became known as asm.js, and the Emscripten compiler was created by Alon Zakai in 2011. This language is statically compiled and has shown near native performance [38]. One of the key benefits to its adoption was that even if a Web browser had not implemented optimizations for the subset, it could still run on every JS interpreter. That is one reason it is still in use as a fallback from the newer WebAssembly specification [39]. WebAssembly is a portable size and load-time efficient format, suitable for compilation to the Web, implemented in all major Web browsers, but still under the process of standardization via the W3C WebAssembly Working Group. It features language and platform independence, safe execution, and has shown promising performance gains of up to 5.89 times when replacing JS components with Rust code in real-use parsing scenarios [40]. Currently not all features have been finalized and garbage collection, threads, SIMD, etc. are in progress. It is relevant to both JS and C/C++/Rust OSS, because the combination of those technologies may solve the portability issue in the dissemination of multi-language OSS.