1 Introduction

A command-line interface, also called a shell, is a textual interface that allows users to interact with the underlying operating system by issuing commands. Expert users, such as system administrators, software developers, researchers, and data scientists, routinely use the shell as it affords them flexibility and the ability to compose multiple commands. They perform a variety of tasks on their systems including navigating and interacting with the filesystem (e.g., ls, mv, cd), using version control (e.g., git, hg), installing packages (e.g., apt-get, npm), or dealing with infrastructure (e.g., docker). Experts can adapt and play with a multitude of commands and arguments, chaining them together to create more complex workflows. All this versatility introduces a common problem in user interfaces of recognition over recall (Nielsen 2005b), where users have to recall the particularities of syntax and argument combinations, instead of enabling them to use a more recognizable symbol (as in graphical user interfaces).

A way for these experts to introduce recognizability and customize their command-line experience is to attach distinct names to potentially convoluted, but frequently used, command and argument structures, as well as workflows expressed as compositions of commands. This can be achieved by defining shell aliases. An alias substitutes a given name, the alias, with a string value that defines an arbitrarily complex command (or chain of commands). The set of aliases users define provides a window into their preferences expressed as part of their personal configuration. Many users publicly share these configurations on social coding platforms such as GitHub, contributing to a collective knowledge of command-line customizations, which can provide insight into the tasks that expert users repeatedly perform and how well the standard environment supports those tasks.

1.1 Contribution

We see our large-scale analysis on command-line user customizations manifested in alias definitions as a unique window of opportunity to study how the standard environment of the command line could be productively extended, modified, and improved. Our work goes hand in hand with existing efforts to innovate on the experience of command lines that employ techniques from research in systems (Raghavan et al. 2020; Handa et al. 2021), software engineering and programming languages (Vasilakis et al. 2021; Vasilakis et al. 2020; D’Antoni et al. 2017), human-computer interaction (Vaithilingam and Guo 2019; Gandhi and Gandhi 2020), and artificial intelligence (Agarwal et al. 2020; Lin et al. 2018; Hou et al. 2021). Particularly, our extensive qualitative and quantitative analysis, in conjunction with our dataset, form the basis for identifying opportunities for improving command-line experience in the following directions: by characterizing customization practices, we gain a categorical understanding underlying the needs and wants of command-line users; based on our analysis, we identify opportunities for innovation and formulate them as implications, accompanied with concrete scenarios and examples; further, our comprehensive dataset enables the foundation of learning approaches, as part of learning-based program synthesis (Bruch et al. 2009; Raychev et al. 2014), automated repair (Monperrus 2018), and recommendation systems (Mens and Lozano 2014); finally, we also see our results and datasets as a basis for usability research that can impact the design of tools and the future of the shell in general.

We summarize the work in this paper as follows:

  • We identified nine Customization Practices, grouped into three high-level themes: Shortcuts introduce new names. They can be used for nicknaming commands (and correcting misspellings in the process), abbreviating subcommands like git push, and bookmarking locations for quick navigation. Modifications change the semantics of commands. We can use these types of aliases for substituting commands, such as replacing more with less, for overriding defaults to customize commands to personal contexts, which often involves colorizing output, and also running certain commands as root by elevating privilege. Aliases that combine multiple commands are Scripts. They enable many ways of transforming data using Unix pipes, and allow for automating repetitive workflows by chaining subcommands.

  • A Curated Dataset of Command-Line Customizations, consisting of over 2.2 million shell aliases collected from GitHub. We view our dataset as a playground for fine-grained discovery that can benefit researchers, tool-builders, and command-line users; for example, researchers can use this knowledge base to discover which commands are frequently used together and how they are combined, while tool-builders can see how their programs are being customized. We also describe the effective mining technique we used to distill this knowledge, which allowed us to capture almost the whole population (94.09%) of relevant shell configuration files.

  • We formulate Implications for Improving Command-Line Experience that go beyond single customization practices to address shortcomings and tie them to existing user experience research. Codifying emergent behavior (Fast et al. 2014) found in our customizations enables learning repair rules and discovering workflows. We are able to uncover conceptual design flaws, where customizations indicate frustrations with underlying command structures, supporting prior research on potential flaws in the conceptual design of certain commands (Perez De Rosso and Jackson 2013). Based on the prevalence of highly variable command redefinitions, we propose contextual defaults, the ability to suggest different command preferences based on user context (Stefanidis et al. 2011). Overall, we find that many customizations deal with the tension of Interactivity vs Scripting: commands being used to interactively navigate systems, while at the same time being used within scripts for batch-processing.

We now describe usage and syntax of aliases as a vehicle for customization. We further describe our data collection and coding process, followed by a presentation of customization practices. Finally, we discuss implications for usability and review related work in the broader context of this study.

2 Background

A shell is a command interpreter allowing the user to interact with an underlying system. The concept of the operating system shell as an independent process executing outside the kernel originated in Multics (Pouzin 1965) and was further developed into the original Unix shell sh and its various descendants (Jones 2011; Seibold 2020). The POSIX family of standards defines a Shell Command Language (IEEE and The Open Group 2018; Greenberg 2017), whose standard implementation is still the sh utility, but there exist a wide variety of popular POSIX-compliant shells like bash or zsh. These implementations are free to extend the functionality of the shell, but all share a common subset of core commands and programming language constructs. In this paper, we focus on the built-in alias command, available on all POSIX shells.

2.1 Usage and Syntax

The alias command allows the user to create alias definitions, defining command substitutions. When the shell processes the command line, it replaces known alias names with their defined string values. For example,


alias ll='ls -l'

defines the alias name ll, that is replaced by the alias value ls -l. In this case, ls is the standard command for listing directory contents, with the argument -l specifying a long-form output format. So the alias ll (present in many system configurations) is used to specify a default argument to a commonly used command under a different name.

Alias values can be arbitrarily complex strings and can substitute not only simple commands and arguments, but whole chains of commands. The definition


alias ducks='du -cksh * | sort -hr | head -n 15'

defines the new command ducks by chaining together three different command-line tools in order to return the 15 largest files in the current directory.

In general, an alias definition takes the form


alias name=value

where value can optionally be enclosed in single (') or double (") quotes and name can be any identifier that is a valid command name.Footnote 1

In particular, the alias name can be an existing command, so a re-definition like


alias grep='grep --color=always'

is possible.

In the remainder of this paper, we will use the more compact notation ab to indicate an alias that replaces the name a with the value b.

2.2 Dotfiles

Aliases can be entered directly on the command line, in which case they are valid until the shell session ends. To make an alias definition permanent, it is common practice to enter it into a file that is read and executed by the shell on startup. The names of these configuration files differ by shell, but common ones are .bashrc, .zshrc, or .profile and their main difference is the order in which they are executed.Footnote 2 Often, aliases are also stored in other files referred to by these startup scripts.

These kinds of files—text-based configuration files that store system or application settings—are also known as dotfiles, because their filenames usually start with a dot (.) so that they are hidden by default on most Unix-based systems. In recent years, people have started sharing their dotfiles on platforms like GitHub.Footnote 3 This has the advantage of being able to sync one’s configurations across different machines, and also enables exchange and discovery of configurations between users.

3 Dataset

Our analysis is based on 2204199 alias definitions found on GitHub, collected over a period of two-and-a-half weeks from December 20th 2019 to January 8th 2020.

3.1 Data Collection

Alias definitions can appear in any Shell script, but we anticipated that they would predominantly be found in personal configuration files (like .bashrc or .bash_profile).

Unfortunately, this rules out using some prominent existing datasets for our study (Mombach and Valente 2018): The public GitHub archive on BigQuery,Footnote 4 while containing over 1.5 TB of source code, only includes “notable projects” (presumably those with a certain number of stars on GitHub) that additionally have an explicit open source license. This leaves out many of the repositories we are interested in, as users sharing configuration scripts for personal use do not usually add a license file and their repositories are generally not “notable”. GHTorrent (Gousios et al. 2014), another popular archive of GitHub data, only contains metadata but not file contents.

Therefore, we found it necessary to write our own tooling to directly collect the data from GitHub ourselves. We used the GitHub Code Search APIFootnote 5 to find files written in Shell languageFootnote 6 that contain the string alias.

Alas, the GitHub Code Search API comes with its own set of limitations:

  1. 1.

    only files smaller than 384 KB are searchable

  2. 2.

    forks are not included

  3. 3.

    requests are rate limited at 30 per minute and there are additional opaque abuse detection mechanisms that impose further restrictions in an unforeseeable manner

  4. 4.

    the number of results is limited to 1000 per search request

The first two limitations do not really affect us, as we are interested in smaller files and do not have to consider forks. The rate limiting, while significantly slowing down the retrieval process, is also not a fatal obstacle. The maximum number of returned search results, however, is a critical limitation. To get around it, we wrote a Python tool called github-searcherFootnote 7 that uses a clever sampling strategy to vastly increase the number of results we are able to retrieve.

The sampling strategy is based on the GitHub API allowing code search queries to be conditioned on file sizes. For example, the query


alias language:Shell size:101..200

returns up to 1000 Shell language files containing the string “alias” that have a file size between 101 and 200 bytes (inclusive). Repeating the search with


alias language:Shell size:201..300

returns up to 1000 files of a size between 201 and 300 bytes, and so on. Repeatedly searching with the same search term but different non-overlapping file size ranges allows us to significantly increase our sample of the overall population. Another trick further improves on this: the API gives us an option to sort the results by most or least recently indexed; if we run a search using a specific sort order, then we can effectively double the sample size by repeating the same search with the opposite sort order. Thus we can get up to 2000 results per search per file size range.

Additionally, while GitHub does not allow us to retrieve more than a limited number of files per query, it does return the total count of files matching the query. While this count is usually very erratic on broad searches, fluctuating wildly between repeated requests, it turns out to be fairly accurate for searches with a small number of results, such as those conditioned on a narrow range of file sizes. This allows us to get a good estimate of the population, and how accurately our sample approximates it.

For this study, using the search term


alias language:Shell

and the sampling strategy described above, we started by sampling all files in increments of 100 bytes and stopped when we reached 29 KB, about ten times the median file size of the estimated population encountered so far. We then re-sampled some high-population areas with smaller size increments in order to get a better sample, in some cases sampling in increments of 1 byte. In total, we collected 844140 files from 304361 GitHub repositories. Our sample represents 94.09% of the estimated population of 897182 files under 29 KB on GitHub written in Shell language and containing the word “alias”. The file contents, together with repository metadata, were stored in an SQLite database. After removing duplicate files based on their SHA-1 hash value, our database contains 372816 unique files from 205126 repositories.

3.2 Parsing

After collecting files with potential aliases, we ran a parsing script to find actual alias definitions and decompose them into their constituent parts for analysis. The decomposed aliases are stored in the same SQLite database as the raw file contents to facilitate easy cross-referencing. The database schema is given in Fig. 1.

Fig. 1
figure 1

Relational database schema

The parser is a Haskell script that splits each alias definition into alias name and alias value, and tokenizes the value into commands and arguments. Commands can be delimited by the shell operators for piping (| and |&), logical composition (&& and ||), background execution (&) and simple chaining (;). Arguments are separated by whitespace, but care is taken to handle quoted arguments correctly. For example, echo "hello world" is parsed as one command (echo) with one argument ("hello world"). See Fig. 2 for a more elaborate example.

Fig. 2
figure 2

Decomposition of alias ips="ifconfig | grep 'inet'| cut -d' ' -f2"

Beyond quoting, which is defined by the Shell Command Language and thus uniform across all commands, the parser can not make any further considerations as to how arguments are meant to be interpreted. While there are some conventions around command-line argument handling, programs are generally free to do as they wish and there is a wide variety of argument styles in the wild: single-dash short arguments combined with double-dash long-form arguments (e.g., ls -l -a --color=always); combined short arguments without a dash (e.g., tar xvzf archive.tar); dictionary-style arguments (e.g., dd if=/dev/zero of=/dev/sda); subcommands (e.g., git commit -m "wip"); and many more. Since the parser can not know the intentions of any command, it simply treats each token as a separate argument. There is one exception: if the command is sudo, then its first argument is taken as the real command. For example, sudo apt-get install is parsed as the command apt-get with argument install and the sudo flag set.

After parsing, we ended up with 2204199 alias definitions, broken down into 2534167 commands and 3630423 arguments. Files that did not contain any aliases were removed from the database, as was repository metadata that only referenced files without aliases. 194218 files from 138112 repositories, or 52.09% of the original sample without duplicates, contained aliases.

3.3 Provenance

The majority of aliases in our dataset (85.74%) originate from common startup scripts, like .bashrc, aliases.zsh or .profile (see Table 1). We found another 2.78% of aliases originating from scripts related to Git, with file names like git.plugin.zsh or git.bash. The remaining aliases are more or less evenly distributed among a variety of file names, none of which contributes more than half a percent of aliases, in most cases significantly less. The average number of aliases per file is 11 ± 18, the median is 6.

Table 1 Distribution of common file names

Table 2 shows the most commonly occurring words in repository descriptions on GitHub (excluding stop words), together with the amount of aliases found in repositories whose descriptions contain at least one of these words. Counting them all together, repositories mentioning any of the words listed in Table 2, in either their description or their repository name, make up 74.48% of the repositories in our dataset, contributing 82.3% of all aliases. It is notable that more than half of the repositories in our dataset (51.08%) have a name that includes the string dot, as in dotfiles, dot-files, dots, mydotfiles, and so on. Looking at these names and descriptions, we can see a clear bias towards personal configurations and settings management. On average, each repository contributes 16 ± 28 aliases, the median is 8.

Table 2 Most common words in repository descriptions

3.4 Reproducibility

To enable reproducibility and follow-up studies, we have made all data and our entire tool-chain publicly available. Our dataset (1.45 GB of parsed alias definitions, plus 4.3 GB unparsed file contents and metadata) is available on Zenodo.Footnote 8 The parsing script and the executable Jupyter notebooks, containing all SQL queries and additional Python code used during our analysis, are available on GitHub.Footnote 9

4 Analysis

Table 3 shows the most common alias names, commands, and arguments appearing in alias definitions. The most common alias name we found is ls, appearing a total number of 83782 times, which is 3.8% of all alias definitions. Note that this is ls as an alias name, a redefinition of the ls command, which appears 260156 times (10.27%). This is a bit less often than git, the most common command, which appears in 327786 aliases (12.93%). The most common argument, across all commands, is --color=auto, appearing 153931 times (4.24%).

Table 3 Top alias names, commands and arguments

Looking at each part of an alias definition in isolation can only get us so far, as arguments only gain meaning in conjunction with commands and alias names can be identical between users, referring to the same command/argument combination, or indeed can overlap, meaning the same alias name is used differently by different users. Table 4 gives a more informative view for the top two commands, git and ls, showing us the top arguments given with each and the most common alias names by which the command/argument combinations are referred to. Here we can already identify some of the typical alias use cases. Looking at ls, we find that aliases are used to redefine the command with a default argument (lsls --color=auto); to shorten a common invocation (llls -alF); and to correct a spelling mistake (slls). We also notice that in the case of git, most aliases are used for shortening git subcommand invocations (e.g. gdgit diff).

Table 4 Top two commands with top arguments and aliases

4.1 Inductive Coding

To capture the range of patterns and use cases for which aliases are defined, we analyzed the dataset using inductive coding, a classic technique for qualitative data analysis (Saldaña 2016; Thomas 2006; Dey 2003). Inductive coding is used when conducting exploratory research without prior expectations on themes in the data. The individual data points—in our case, alias definitions—are labelled with descriptive tags which try to capture the essence of the datum for later purposes of categorization. It is an iterative process between theoretical sampling and comparing data within emerging themes, continuing in cycles until no new themes emerge.

Since manually coding the entire dataset is infeasible, we developed our themes by coding a representative sample. For this sample, we gathered the top three most common aliases for the top ten most common arguments for the top 50 commands (cf. Table 4), resulting in 1381 alias definitions, directly covering 28.77% of the dataset. Additionally, we drew a random sample of 200 alias definitions from the long tail of unique aliases. These are aliases that each occur only once in the entire dataset, making up 27.53% of all aliases. The commands that occur in this long tail are distributed in roughly the same manner as the commands in the whole dataset, the top commands being cd, git, ssh, ls, and vim. Unique aliases often contain user-specific file system paths (e.g. gitbashsource /Users/j/mybin/gitsh), happen to have a unique combination of arguments (e.g. lsls -GphF) or are otherwise highly particular (e.g. h23history -23000).

In total, we looked at 1581 aliases during the coding process. In order to reason about the intent of any particular alias, we had to take the semantics of each command into account, consulting their man pages and other forms of documentation.Footnote 10 To increase the trustworthiness of our codes, coding was performed independently in parallel by the two authors. After a first iteration, we compared our labels, consolidating different naming conventions. In consecutive iterations, we identified ways of formalizing the emerged categories, i.e. constructing automated mechanisms for classifying alias definitions as belonging to certain categories. The suitability for mechanical classification was an important factor for the viability of any emerging themes. The discussion of these formalizations additionally served to establish a better shared understanding. Ultimately, we reached a saturation point at which further coding and analysis did not lead to further insights.

5 Customization Practices

We identified nine customization practices among three types of aliases: Shortcuts introduce new names and are often used for nicknaming commands, abbreviating subcommands, and bookmarking locations; Modifications change the semantics of commands by substituting commands, overriding defaults, colorizing output, and elevating privilege; and Scripts combine multiple commands, often for the purposes of transforming data or chaining subcommands. We developed automated classification methods for each practice, which can be found in our replication package. Table 5 gives a quantitative overview of the prevalence of each of these practices in the dataset. Any alias can be an expression of multiple customization practices at once, and some practices only occur with certain commands. Table 6 breaks down the customization practices by command, counting the number of aliases that a command is involved in (including aliases that redefine the command).

Table 5 Alias types and customization practices
Table 6 Customization practices broken down by command

We will now discuss the alias types and customization practices in more detail.

5.1 Shortcuts

The most obvious use of an alias is to give a complex expression a short and/or memorable name. The average length of an alias name is 4.3 characters, whereas the average length of an alias value is 23.7 characters. If we divide the length of an alias value by the length of the alias name, we get the compression ratio of the alias. For example, the alias gsgit status has a compression ratio of 5. Figure 3 shows the distribution of compression ratios over all aliases in the dataset. The median compression ratio is 4.25, meaning half of all alias values are at least four times as long as their alias names. A compression ratio less than 1 indicates a name that is longer than the value it aliases.

Fig. 3
figure 3

Distribution of alias compression ratios

There are 26055 aliases (1.18%) with names longer than their values. The two longest alias names we found are from joke definitions. The first is 1772 characters long and is comprised of the letter ‘f’ repeated 1053 times, followed by the letter ‘u’ repeated 719 times. It is an alias for the cat command with a similarly named file as an argument. The second longest alias name is a Swedish compound word of 131 characters,Footnote 11 aliasing the ls command.

On the other end of the spectrum, an alias named line echoes 23635 dashes, achieving a compression ratio of 5911, the highest among all aliases. The second highest comes from an alias named BEEP, which invokes the Linux beep utility 9 times in succession, with a combined 4471 arguments. When executed, it appears to play Daft Punk’s 2001 instrumental single Aerodynamic.

Beyond just compression and expansion of strings, we can see a few distinct customization practices related to naming.

Nicknaming Commands.

There are 244872 aliases in our dataset (11.11%) that merely give a new name to a command, without adding any arguments, and without the name belonging to a different command (that would be a substitution, see below). The most often occurring nicknames are ggit, cclear, hhistory, and vvim. Almost all (93.03%) of these kinds of aliases introduce a nickname that is shorter than the command they are referring to, and about half (50.58%) introduce a name that is only one or two characters long.

A special case of nicknaming occurs when the new name is a common misspelling of the command. In this case, the alias acts like an autocorrect mechanism, as in gotgit. To determine instances of these typographical errors, we surveyed and experimented with different string distance measures (Navarro 2001) and decided on using the Damerau-Levenshtein algorithm (Damerau 1964).

We determined empirically that a distance measure of 2 seems like a good threshold to decide whether or not an alias corrects a misspelling. We found 9195 aliases (0.42%) that serve as autocorrect rules, most commonly involving transposition (grpegrep), case-sensitivity (Jupyterjupyter), localization (pluralisepluralize), and punctuation (docker-builddocker_build).

Abbreviating Subcommands

Many commands can operate in different modes, or act as interfaces to a variety of different subcommands. The subcommand is commonly specified as the first argument to the command, and takes its own set of arguments and flags. For example, git push --tags executes the push subcommand of git with the --tags flag enabled. We identified 67 commands in our dataset that take subcommands, such as git, docker, or systemctl. Noticeably, we found 194850 aliases (8.84%) that are purely abbreviations of subcommands, without adding any additional arguments beyond the subcommand. For example, gsgit status or gdgit diff. The majority of such subcommand abbreviations (58.5%) are for git, with 113980 aliases defined purely for abbreviating git subcommands, accounting for 36.77% of all aliases involving git. The command with the second-most subcommand abbreviations is the package manager pacman, with only 9918 instances (5.09% of subcommand abbreviations, but 68.67% of all aliases involving pacman).

Bookmarking Locations

When an aliased command is called with an argument that references some specific local or remote location, like a file path or a domain, the alias acts as a bookmark to that location. For instance, starwarstelnet towel.blinkenlights.nl and dlcd ~/Downloads are both bookmark aliases.

To find such bookmarking uses in our dataset, we searched for arguments that are locations, which we take to be any of the following:

  • A string containing a forward slash (/), indicating a path.

  • An IPv4 address, matched by the liberal regular expression [0-9]+∖.[0-9]+∖.[0-9]+∖.[0-9]+

  • A string containing one of the known top-level domainsFootnote 12 preceded by a dot (.) and followed by a slash (/), colon (:) or the end of the string.

To avoid false positives, we sampled the top 300 search results according to the above criteria and determined some exclusion patterns. For instance, /dev/null is not a location for our purposes. Neither is origin/master, and thus gmgit merge origin/master does not count as a bookmark. We also exclude aliases that are merely referencing unnamed relative directories (e.g., ../..).

By our definition, 321546 aliases (14.59%) are bookmarks. Of these, 59931 are remote bookmarks containing URLs or IP addresses (15.92% of all bookmarks). Bookmarks are used predominantly for file system navigation, and the cd command is featured heavily. Most other uses seem to be development related, like starting services such as web servers or databases with pre-defined locations, opening frequently edited files, or outputting logs, as in onozcat /var/log/errors.log

5.2 Modifications

Aliases are not only used syntactically, for naming purposes, but also in ways that change the semantics of certain commands. We found four customization practices related to command modification.

Substituting Commands.

When an alias name is identical to the name of a pre-existing command, the alias defines a substitution for that command. A common example is moreless, replacing a standard Unix utility (more) with a more capable but similar command (less). This can also be used for subterfuge, as in emacsvim (appearing 132 times in our dataset) or indeed vimemacs (86 times, alas).

To determine which alias names are also actual command names, we compared them to known Unix commandsFootnote 13 and a curated sample of commands from our dataset (taking care to not include names that appear in a command position but are actually just other aliases). To determine proper substitutions, we only count aliases whose value does not also include the name of the command (which would point to an overriding alias, see below). We find that 100564 aliases (4.56%) are used to substitute one command for another. The top three substitutions are vivim, vimnvim, and vinvim.

Overriding Defaults.

When an alias has the same name as the command it aliases, as in lsls -G, then the alias re-defines the command and effectively overrides its default settings. Any time the command is now executed, it will be with the arguments specified in the alias. There are 319239 aliases in our dataset (14.48%) that are used to override defaults in this way. Aliases to override the defaults of the grep family of commands (grep, egrep, fgrep) occur 96970 times, accounting for (4.4%) of all alias definitions (and 68.27% of all grep appearances). The ls command is redefined with new defaults 75374 times, accounting for 3.42% of all aliases (28.99% of ls appearances).

Looking at the new defaults of these redefined commands, they reveal a variety of user preferences, especially in the diverse long tail, where we find a lot of unique alias definitions and argument combinations. Two areas of customization stand out, however: formatting output and adding safety. The majority of overrides for file system commands (mv, cp, and rm, but also ln, for creating symbolic links) enable interactive mode (-i and variations), which prompts the user before performing potentially destructive actions. Verbose output (-v) also plays a role here, describing exactly what kind of effects a command execution had or will have. Enabling verbosity can also be seen as a kind of output formatting, although much more common is the wish for human-readable output. For example, the alias dfdf -h ensures that the available disk space is displayed in common size units, as opposed to just the raw number of bytes. But by far the most common reason for overriding defaults is to enable colorized output. This behavior is so prevalent that we count it as a customization practice in its own right.

Colorizing Output

Enabling colored output can be done in many different ways: adding an argument (like less -R or grep --color=always), setting an environment variable (as in sshTERM=xterm256color ssh), running the command through a tool that colorizes its output (like grcat or pygmentize), or even replacing a command outright (diffcolordiff). Taking all these varieties into account, more than half of all command redefinitions (57.21%) enable colored output by default. This amounts to a surprising 182623 aliases, or 8.29% percent of all aliases in the dataset. If we extend this count to also include aliases that introduce new names (like llls -l --color=auto), then more than 10% of aliases colorize a command’s output.

Elevating Privilege

The sudo command allows the user to execute another command with superuser privileges. Combining a command with sudo is often necessary if the other command needs to modify critical parts of the system. In our dataset, we found 93683 aliases (4.25%) in which a command is prefixed with sudo. The top sudo-prefixed command is the package manager apt-get, appearing 10467 times with sudo. Remarkably, these are 89.35% of all occurrences of apt-get. In fact, 72.45% of all occurrences of the package managers apt* (Debian and derivatives; including apt, apt-get, apt-cache, aptitude, and $apt_pref), pacman, abs and aur (Arch Linux), yum (RPM), dnf (Fedora), zypper (openSUSE), port (macOS), and gem (Ruby) are together with sudo, and these package managers account for 29.1% of all sudo occurrences. Interestingly, the macOS package manager brew rarely appears with sudo (only 1.07%), even though it is the third most occurring package manager overall, behind apt* and pacman.

Other commands that more often than not demand elevated privileges are system utilities like systemctl, shutdown, lsof or mount.

5.3 Scripts

Aliases that combine multiple commands are basically tiny shell scripts. In our dataset, 204142 aliases (9.26%) compose multiple commands. The most popular composition operator is the pipe (|), used in 39.66% percent of alias scripts, followed by the operators for simple chaining (;), with 29.61%, and logical conjunction (&&), with 26.88%. Other operators (||, |&) appear in only 3.85% of multi-command aliases.

There are two scripting practices that are of particular interest.

Transforming Data

The pipe (|) creates an interface between two otherwise separate programs. It embodies the Unix philosophy of small tools doing one thing well, which can then be connected together to accomplish more complex tasks. There are 74719 aliases (3.39%) combining two or more commands using only the pipe operator. The most common command occurring after a pipe, by far, is grep, which makes an appearance in almost half of all pipelines (46.16%), more than three times as often as xargs and sort. The most common data sources are ps, git, and ls, which are found at the beginning of almost a third (32%) of all pipelines. Figure 4 shows a flow diagram of the top pipelines with three commands.

Fig. 4
figure 4

Flow diagram of the top 250 pipelines with three commands that make up at least 10% of one participating command’s usage

The names of aliases for such pipelines are varied, speaking to the broad range of tasks that can be accomplished by combining various Unix tools. They range from the descriptive, as in diskspacedu -S | sort -n -r | more or weatherwget -qO - http://wttr.in/ | head -7, to the very terse, as in hhistory | uniq | tail -15 or lllls -trlh | less. Interestingly, aliases with the same name usually describe pipelines with the same general shape (the same commands in the same order), but slightly different argument combinations: lsd → ls -l | grep "̂d" lsd → ls -la | grep ̂d lsd → ls -lGFA --color | grep -i "̂d.⋆/" lsd → ls -lh | grep --color=never '̂d' This highlights the highly personal nature of aliases, each customized for an individual use case.

Chaining Subcommands.

An interesting pattern appearing in alias scripts are chains of subcommand invocations. For example, the package manager brew has a subcommand update, for updating the package database, and a subcommand upgrade, for upgrading previously installed packages to the latest available versions. 28.08% of all aliases involving the brew command contain the composition brew update && brew upgrade (sometimes with ; instead of &&), with alias names like update, brewup, bup, etc. This pattern of repeated subcommand invocations can be found in 22062 aliases (1%), and it is most prevalent among package managers, like brew, apt-get, npm or gem, mostly for the same purpose as above.

The command with the highest absolute number of aliases showing this pattern is git, however, with 12063 occurrences (3.89% of all aliases using git). Here, the uses are more varied, e.g., commitgit add . && git commit -m, or gitpullgit stash && git pull && git stash pop, or indeed whoopsgit reset --hard && git clean -df.

6 Implications

Through our large-scale analysis of the collective knowledge of shell customization via aliases, we gained insight into practices detailing how users customize their command-line interface. Based on our observations, we outline discussion points that go beyond single customization practices and identify implications that can address shortcomings in command-line usability and tie them to existing user experience research. Further, while our presented findings already give us an understanding of customization practices over many different kinds of commands, we view our collected dataset as a playground for fine-grained discovery that can benefit researchers, tool builders, and command-line users.

6.1 Learning Repair Rules

The complexity of commands and arguments can cause users to introduce errors when working in a command-line interface. Figuring out specifically how to fix these errors is often a convoluted process. A popular open source project that attempts to navigate this issueFootnote 14 uses a set of rules to suggest possible error corrections for commands. While these rules are all hard-coded, we envision leveraging the global wisdom of customizations in our large-scale dataset to learn rules that form the basis for different kinds of suggestions. This is in line with visions of integrating collective intelligence in software development (Bruch et al. 2010), in particular work in leveraging emergent behavior from corpora (Fast et al. 2014) that we can codify based on our customization data. We can also see approaches similar to work on learning code completions from examples (Bruch et al. 2009), with our dataset of alias definitions serving as an oracle for an automatic software repair system (Monperrus 2018) in the domain of shell commands. Using our dataset of known-good command invocations, it should be possible to train a statistical language model for command repair, akin to related work in code synthesis (Raychev et al. 2014).

As an example, take the following erroneous command execution:


$ apt-get install vim E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied) E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?

Without having to consult a hard-coded rule involving knowledge about apt-get, or even looking at the specific error that is produced, a command repair system trained on our dataset of alias definitions could easily suggest the correct fix: sudo apt-get install vim. It is reasonable to assume that this could be inferred as the correct invocation, because in aliases the command sequence apt-get install occurs almost exclusively pre-fixed with sudo.

As another example, the following error is caused by the wrong order of arguments to the systemctl command:


$\HCode{ }systemctl docker status Unknown command verb docker.

The correct invocation is systemctl status docker. It is again very plausible that a repair rule for this type of error could be learned from our dataset, based on the prevalence of aliases containing the command systemctl together with an argument status that occurs in first position, indicating the latent knowledge that status is in fact a subcommand of systemctl.

6.2 Discovering Workflows

Following a different thread of leveraging emergent practices, we can also see how our dataset would enable a world beyond only trying to fix immediate errors, by providing usage hints that could introduce users to common parameters and workflows. For example, as soon as a user tries to sort the output of the ps command, the alias mem10ps auxf | sort -nr -k 4 | head -10 can serve as a suggestion for the complex but common data transformation that results in showing the ten most memory-intensive processes.

Similarly, in the practice of chaining subcommands we can clearly see the prevalence of object protocols (Beckman et al. 2011), which are implicit rules determining the order in which commands have to be executed. We can improve usability by enabling the discovery of these implicit rules and by exposing the dependency structure based on our customization data. For instance, if executing brew upgrade results in a failure, we can suggest using brew update && brew upgrade instead, based on the patterns in our dataset (cf. Section 6.1).

Our findings can also contribute to recent work on the parallelization and distribution of shell scripts. Systems like PaSh (Vasilakis et al. 2021) and POSH (Raghavan et al. 2020) rely on manual annotation of commands and their arguments to effectively parallelize shell scripts. Our data can help focus these annotation efforts by informing the developers of these systems about which groups of commands and arguments are most frequently used together. The KumQuat system (Vasilakis et al. 2020) leverages program synthesis techniques to search a large space of candidate solution to synthesize parallel shell scripts. The collective knowledge present in alias definitions can guide this search and justify certain intuitions about the latent data parallelism in Unix pipelines (Handa et al. 2021). For example, while a parallel version of the comm command for comparing sorted files line-by-line is not synthesizable in general, it becomes trivially parallelizable if each of its input lines is known to be unique. Evidence that this indeed the common case can be found in our dataset, where 41.29% of all occurrences of comm follow sort | uniq or sort -u, and the remainder mostly have unique data sources as input, like pacman -Qeq.

6.3 Uncovering Conceptual Design Flaws

Customization can also be an indicator for problems in the underlying conceptual design, manifesting as usability frustrations that require adaptation by the user. In their analysis of Git, Perez De Rosso and Jackson (2013, 2016) describe a number of flaws and operational misfits arising from the conceptual design of the software. The frustrations experienced by users because of these design flaws are evident based on the alias definitions in our dataset.

For example, the difficulties some Git users have with the concept of staging can be seen in aliases that ensure untracked files are included in a commit by explicitly adding them beforehand, like commitgit add . && git commit -m or gacgit add --all && git commit.Footnote 15 Another frustration is having to use git stash to temporarily save uncommitted changes and clean the working directory in order to avoid conflicts when using other Git commands. Stashing in itself has no higher purpose in version control, it merely exists as a concept to work around limitations in Git.Footnote 16 This can be seen in aliases like gspullgit stash && git pull && git stash pop, which defines a new type of pull command that stashes away ongoing work before pulling in remote changes and finally re-applying the stashed work. The same problem happens when switching branches, hence aliases like gscgit stash && git checkout $1 && git stash pop.

Church et al. (2014) found that version control systems are generally perceived as being risky to use, and sought explanations for this impression via an analysis of Git using a framework of cognitive dimensions (Green and Petre 1996). One of the dimensions that dominate the command-line interface of Git is Hidden Dependencies. There are many hidden dependencies in Git, a prominent one being the dependency between the local branch and the remote repository. This is revealed by alias definitions like gitstatusgit remote update && git status. Unless one first manually updates Git’s local information about remote branches, the command git status will happily report that the local branch is up-to-date with respect to its remote origin, even if the remote repository is in fact many commits ahead.

We want to emphasize that we are not suggesting that large-scale quantitative data of customization practices can replace qualitative analysis, but rather that the corpus we provide, together with our findings, can support exploration and provide new insights for usability research. Alias definitions can provide evidence for analytic theories based on cognitive or conceptual models of software use, because they codify workarounds for common annoyances and other customizations based in every-day use. According to a recent need-finding study by Zhang et al. (2020), API designers have a strong desire to know more about users’ mental models, and wish to validate design hypotheses with examples of real-world API usage. Existing techniques for mining API usage fall short in this respect, and the study highlights the importance of, among other things, looking at how users deal with unanticipated corner cases and how they apply workarounds. We suspect makers of command-line software are in a similar situation as API designers and could similarly benefit from community usage data that highlights gaps between interface design and users’ expectations.

6.4 Contextual Defaults

Choosing proper defaults in user interfaces is a pillar of user experience design (Nielsen 2005a). The fact that 14.48% of the customizations in our dataset are for overriding defaults suggests that, at least for some groups of users, the default settings of their tools could be improved. We see overriding defaults not necessarily as an indictment of the involved commands, but rather as an indication that the assumed user context does not in all cases match the actual usage profile. This can be the case if the tool assumes a different execution environment than the one it is ultimately used in, e.g. personal notebook vs cloud deployment (where an alias like javajava -ea -server ensures that Java programs are always run on a server-optimized virtual machine) or interactive terminal vs shell script use (cf. Section 6.5), or if the tool assumes a certain type of user with different needs than the actual user.

Indeed, the variety of different defaults in the data indicate what we call contextual defaults, where context could be a reflection of the level of expertise of a command-line user, or a certain persona (e.g., system administrator, data scientist, or software engineer). For example, the top default alias for the ffmpeg command is ffmpegffmpeg -hide_banner, suppressing verbose default output that can be confusing for newcomers but is helpful for the tool developers when providing support and locating errors.Footnote 17 We could imagine providing different sets of defaults to different users, effectively alias starter packs, generated from our data. We see parallels to work that investigates contextual preferences and personalization in information systems (De Amo et al. 2015; Stefanidis et al. 2011) and privacy research (Wijesekera et al. 2018; Alom et al. 2019).

6.5 Interactivity vs Scripting

The first “modern” command line, the Bourne shell from 1977, had two primary goals: to provide an interactive command interpreter, and at the same time serve as a scripting system (Jones 2011). There is a natural tension between these two goals, which becomes evident when users are overriding defaults with aliases like mvmv -i. Here, the mv command is redefined to always run interactively, prompting the user at critical points, i.e. before overwriting existing files. The default operating mode of mv, and most other commands, is to assume that the user is aware of and okay with the possible consequences of running it—and that they have not made any mistakes in its invocation. This is of course a much more useful assumption in a scripting context.

The bias of most command-line tools towards scripting is also evident in their output, which is usually minimal and not tailored for human ease-of-use. We can see this in aliases like mountmount | column -t, which aligns the output of the mount command for easier reading, or dfdf -h or llls -lh, which change the default output of these commands so that file sizes are not shown simply in bytes but rather in much more practical common units like megabytes. The high prevalence of aliases for colorizing output (e.g. grepgrep --color=auto) is also notable, as color only makes sense in an interactive context. In terminals, colorful text is achieved by inserting ANSI escape codes into the text stream. This is a hindrance for scripts, but tools could easily detect whether they are run in an interactive terminal or as part of a script and adjust their output accordingly.

Note that the tension between interactivity and scripting is not the same as the divide between “casual” and “power” users. Experts are experiencing the same frustrations as amateurs when using the shell interactively. Recently, there has been a growing movement that sees today’s command line as a human-first text-based UI, rather than a machine-first scripting platform (Prasad et al. 2021). This new generation of command-line users and tool authors embrace the Unix philosophy with its core tenet of simple tools that can be composed well together (Raymond 2003), but they want to modernize those tools to fit current environments, with a more humanistic approach to their interaction design.Footnote 18 Emphasizing the conversational nature of the command line, they highlight the need for features such as error correction (cf. Section 6.1) or command suggestions (cf. Section 6.2), and confirming potentially destructive actions before they are executed. They see human-readable output as paramount and suggest tools should be more aware of their environment (cf. Section 6.4).

7 Threats to Validity

We review potential limitations of our study as threats to validity. First, our sample might not be representative. Our dataset only includes aliases by people who publicly shared their dotfiles, we only collected from GitHub, and our sample does not include forks. Nevertheless, our dataset is very exhaustive, as we were able to sample 94.09% of the estimated population of Shell files containing aliases on GitHub. And while mining GitHub can be fraught with perils (Kalliamvakou et al. 2014), we specifically sought out personal repositories, side-stepping many of the typical issues with mining GitHub for software projects.

Second, our parser might not be sophisticated enough to recognize complex real-world aliases or cope with minute platform differences. To mitigate this threat, we ran multiple sanity checks and tested the parser on some hairy examples from the dataset. We did not detect any significant mis-parses and think that we have covered the majority of relevant cases. The raw unparsed database is available in our replication package.

Third, aliases might not reflect intent as much as we assume. En-masse copy-pasting of aliases by users, without them knowing exactly what they are copying, is certainly a realistic scenario. System distributions and configuration frameworks like ohmyzsh ship with numerous aliases by default or as part of easily enabled plugins. Users might not even be aware of the aliases they have on their system. We mitigate this concern by removing all duplicate files from our dataset that would indicate sheer copy/pasting.

Fourth, we might not actually be able to see the true user intent, if it exists, as quantitative measures might hide a long tail of minor variations and individual user preference. Conclusions about common aliases or selected subsets might not be generalizable. To mitigate these summarizing effects, we established customization practices as a vehicle to take a deeper dive into the details of certain alias usage. Since we sampled almost the whole available population, we are confident in the strength of our data and the conclusions we can draw from particular instances. Our replication package includes our whole toolchain and all alias data in a relational format ready for further analysis.

8 Related Work

Related research in the broader context of our work has been conducted on understanding common practices in the software engineering community based on public online data, on software configuration in general, on the use of command-line interfaces and how to improve them, and on the shell as a programming language for both scripting and interactive use.

Empirical studies similar to ours, looking at community knowledge in software engineering to understand practices and distill insights, have been conducted in related domains: Zhong and Su (2015) study real-world bug fixes in Java projects to help guide automatic program repair; Yang et al. (2017) mine Stack Overflow posts and GitHub repositories to find out how programmers use and adapt copy-pasted code snippets in open-source projects, while Baltes and Diehl (2019) investigate to what extent such snippets are copied without proper attribution; Prana et al. (2019) conduct a qualitative study to categorize the content of GitHub README files and build an automated classifier to label README sections, easing information discovery; Barnaby et al. (2020) present a tool that mines code bases for idiomatic usage examples of API methods.

In the context of software configuration, Sayagh et al. (2020) surveyed experts and the literature to identify a number of challenges and recommendations related to configuration practices. Our work reflects some of their findings, insofar as shell aliases are a form of personal configuration that can interact with—and counteract—other system configurations. For example, selecting good out-of-the-box default values is seen as an important issue by experts, and aliases are indeed often used to override defaults. Related to our implications on contextual defaults (Section 6.4), Zheng et al. (2011) present MassConf, a system that proposes optimal software configurations based on a user’s environment and existing configurations. Adjacent work in configuration mining includes the ConfigMiner tool by Sayagh and Hassan (2020), which identifies appropriate configuration options based on related StackOverflow questions.

The earliest study we found on the use of command-line interfaces was by Greenberg (1988), who collected four months of continuous real-life use of the Unix csh shell from 168 users. The data was used in a follow up study to analyze the use of interactive systems by examining the frequency of command invocations for different groups of users (Greenberg and Witten 1988). In later work, Davison and Hirsh (1998) use probabilistic action modeling to predict user action sequences based on the same dataset. Korvemaker and Greiner (2000) similarly predict future action sequences in command lines, but condition on actions of the particular user group with the goal of enabling adaptive user interfaces. Other work in the context of adaptive user interfaces by Jacobs and Blockeel (2001) uses association rule learning on the shell logs to produce scripts to automate common task sequences. Khosmood et al. (2014) use the same corpus and two additional, more recent, corpora to learn a model that can identify user profiles based on their command-line behavior. Bespoke (Vaithilingam and Guo 2019) is a system that synthesizes specialized graphical user interfaces (GUIs) based on command usage. Our work can be viewed as an input to this system that passes common shell workflows in aliases to be generated as GUIs.

There has been other work on enhancing user experience in command-line interfaces. NoFAQ (D’Antoni et al. 2017) provides repair suggestions for failed shell invocations based on a model learned from a curated set of fix patterns. NL2Bash (Lin et al. 2018) implements a system that translates natural language phrases in English to shell commands. Recent work by Greenberg (2017) has been looking into understanding the POSIX shell as a programming language. More specifically, understanding word expansion in the shell to support interactivity (Greenberg 2018b) and concurrency (Greenberg 2018a).

9 Conclusion

We report on a large-scale exploratory study on how command-line users customize user experience by defining shell aliases. Through inductive coding, nine customization practices emerged from our dataset of collective customization knowledge mined from GitHub, providing insight on the characteristics of command-line use. Based on our results, we discuss and formulate a set of implications for command-line tool developers, researchers, and the shell as an interactive environment for experts. We enable further analysis and a basis for learning applications based on our extensive curated dataset.

Aliases often redefine commands with other default arguments, which is a potential indicator for usability problems in these tools. However, we have to also be aware that defaults can be highly contextual depending on user profiles (e.g., expertise level) and environment (e.g., scripting vs. interactive use). We also see our dataset and results as a rich source for learning norms with respect to repair rules, data flows, and descriptive names for complex command structures. We provide a comprehensive replication package and see potential for future work based on our dataset and analyses.