Abstract
Open model developers have emerged as key actors in the political economy of artificial intelligence (AI), but we still have a limited understanding of collaborative practices in the open AI ecosystem. This paper responds to this gap with a three-part quantitative analysis of development activity on the Hugging Face (HF) Hub, a popular platform for building, sharing, and demonstrating models. First, various types of activity across 348,181 model, 65,761 dataset, and 156,642 space repositories exhibit right-skewed distributions. Activity is extremely imbalanced between repositories; for example, over 70% of models have 0 downloads, while 1% account for 99% of downloads. Furthermore, licenses matter: there are statistically significant differences in collaboration patterns in model repositories with permissive, restrictive, and no licenses. Second, we analyse a snapshot of the social network structure of collaboration in model repositories, finding that the community has a core-periphery structure, with a core of prolific developers and a majority of isolate developers (89%). Upon removing these isolates from the network, collaboration is characterised by high reciprocity regardless of developers’ network positions. Third, we examine model adoption through the lens of model usage in spaces, finding that a minority of models, developed by a handful of companies, are widely used on the HF Hub. Overall, the findings show that various types of activity across the HF Hub are characterised by Pareto distributions, congruent with open source software development patterns on platforms like GitHub. We conclude with recommendations for researchers, and practitioners to advance our understanding of open AI development.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Open source developers have become central actors in the political economy of artificial intelligence (AI). The rise of open source AI, specifically the emergent practice of releasing, fine-tuning, and openly developing pre-trained models that are freely available,Footnote 1 has extended open science practices crucial to AI advances [3, 4], including the development of open source software (OSS)Footnote 2 and the provision of open access to research [5] and datasets [6,7,8]. Open source AI has attracted attention as a potential challenger to the dominance of a few well-funded startups and Big Tech companies in AI research and development (R &D) [9, 10]. Grassroots initiatives like EleutherAI [11], BigScience [12], and BigCode [13] have shown the feasibility of open model development [14], while the Hugging Face (HF) Hub has emerged as a popular platform used by millions to host, download, and collaborate on a growing number of models, datasets, and spaces (i.e., web applications to demonstrate and try out models) [15].
While the benefits and risks of open source AI have been widely debated [16,17,18,19,20], the practices and processes involved in open model development have received relatively little attention. To date, only a handful of scholars have explored various aspects of open model development, including user contributions to grassroots initiatives [12, 14], commercial participation in model development [14, 21], model maintenance practices [22], and the processes and tools used by open data engineering communities [23].
We contribute to this nascent research agenda with a three-part quantitative analysis of development activity on the HF Hub. First, we investigate typical patterns of various types of activity on the HF Hub in 348,181 model, 65,761 dataset, and 156,642 space repositories (RQ1). Subsequently, we apply social network analysis (SNA) of code contributions to model repositories to investigate the social network structure of the developer community as well as collaboration practices amongst developers (RQ2). We replicate this analysis for models in the sub-fields of natural language processing (NLP), computer vision (CV), and multimodal (MM) for comparative analysis. Finally, we quantify model adoption through the lens of model usage in spaces on the HF Hub (RQ3), providing insights into the widespread use of a minority of models in the HF Hub developer community and the key actors driving their development.
Overall, our analysis reveals that various aspects of development activity on the HF Hub—e.g., interactions in model, dataset, and space repositories; collaboration in model repositories; and model adoption in spaces—exhibit right-skewed, Pareto distributions, which is a well-documented pattern in OSS development [24,25,26,27,28]. While the open model development life-cycle involves unique practices which differ from OSS development [22], such as model training and fine-tuning, the observed similarities in the overall patterns of activity suggests that future research on open source AI can benefit from drawing on the extensive, multidisciplinary literature on the social dynamics of OSS development. Based on our findings, we propose a number of recommendations for researchers, policymakers, and platform providers to facilitate research and evidence-based discussions on open source AI.
The paper has the following structure. First, the literature review provides an overview of prior work on open source AI, as well as prior work on OSS development in order to draw comparisons between OSS and open model development practices. Second, we presents the RQs and research design. Third, we introduce the main findings from the three-part analysis. Fourth, we discuss the contributions of the findings to research and practice, and make recommendations for research and practice. We conclude with a discussion of what further clarification of the practices in open model development can offer for (open source) AI researchers, developers, policymakers, and platform providers.
Related work
“We have no moat”: The emergence of open models
Open science practices, from the development of open source software (OSS) to the provision of open access to research (e.g., via arXiv [5]) and datasets (e.g., via Kaggle [6], ImageNet [8], or Common Crawl [7]), have been integral to advances in AI R &D and adoption [4, 29]. The culture and norms of openness in AI has evolved significantly in the last 15 years [30]. For example, in 2007, a coalition of 16 researchers lamented the lack of OSS that standardised the implementation of ML algorithms, highlighting this as a major obstacle to advances and reproducibility in AI research [31]. Yet today AI R &D is simply unimaginable without OSS [3, 32], drawing on a growing commons of over 300 OSS libraries [33], hundreds of thousands of open models [34], and over a million OSS repositories [35].
Following years of debate about the safety of openly releasing AI models [17, 18, 36, 37], recent years have seen the emergence and proliferation of “open” models, which individuals and organisations have shared on an open access basis on platforms such as the HF Hub [4]. Prior to this, AI models, in particular large language models (LLMs), were principally developed and maintained behind closed doors, albeit with open science practices, such as the sharing of publications on arXiv and code on platforms like GitHub. The start of this trend is attributed to EleutherAI, a grassroots research group, which formed on a Discord server with the intention to develop and release an open source variant to OpenAI’s GPT, resulting in The Pile in December 2020 [38], a library of datasets for training LLMs, and GPT-Neo in March 2021 [39]. Subsequently, open models gained more visibility with the release of other state-of-the-art AI models [10], including BLOOM by the BigScience workshop in July 2022 [40], Stable Diffusion by Stability AI in August 2022 [41], and LLaMA 2 by Meta in July 2023 [42], amongst others.
The proliferation of open models, especially foundation models, has ignited heated debate about their potential benefits and risks [16,17,18,19,20, 43]. On the one hand, open models are said to promise benefits for research, innovation, and competition by lowering entry barriers and widening access to state-of-the-art AI [44]. Drawing on Linus’ Law from OSS development that “given enough eyeballs, all bugs are shallow” [45] , proponents argue that open model development and auditing offers safety advantages [46]. In addition, open access to models lowers the barriers for adaptability and customisation for diverse language contexts [18, 47]. On the other hand, open models can pose risks of harm by both well-intended and malicious actors, including the creation of deepfakes [48,49,50], disinformation [51, 52], and malware [53, 54]. A study by 25 experts concluded that open models have five distinctive properties that present both benefits and risks: broader access, greater customisability, local adaptation and inference ability, the inability to rescind model access, and the inability to monitor or moderate model usage [18].
The development of open models has been described as a potential challenge to the dominance of Big Tech companies in AI R &D [9, 55]. This was underlined by a leaked Google memo that claimed there is “no moat around closed-source AI development” and “open source solutions will out-compete companies like Google or OpenAI” [56]. Venture capitalists have bullishly invested in open source AI startups [57, 58], and world leaders like President Macron of France have pledged public funds to support open source AI [59]. In addition, the Mozilla Foundation has launched mozilla.ai with $ 30 million in investment to build a trustworthy, independent, and open source AI ecosystem “outside of Big Tech and academia” [60]. While proponents champion open models as good news for innovation and competition, others temper this optimism by pointing to market concentrations at several layers of the AI stack, from chips to cloud compute infrastructure, which remain unchallenged by innovations stemming from open source AI communities [21, 61, 62].
A myriad of meanings are attached to “open models” and “open source AI”. Oftentimes these terms are understood as making pre-trained models, parameters (or “weights”), and documentation available on platforms like the HF Hub. In some cases, they refer to open collaboration on the development of models [14]. The description of open models as “open source” has been fiercely contested for failing to meet OSS standards as defined by the OSI
[2, 4, 63, 64]. For example, when Meta imposed limits on use of LLaMA 2, Stefano Maffulli from the OSI commented, “Unfortunately, the tech giant has created the misunderstanding that LLaMA 2 is ‘open source’—it is not. Meta is confusing ’open source’ with ‘resources available to some users under some conditions,’ [which are] two very different things” [63].
Companies have been criticised for “open-washing” by promoting their models as “open source” models, when they are typically “open weight” models at most, as a commercial strategy to present themselves as patrons of the digital commons, whilst disguising their intent to set open standards and benefit from crowdsourced innovation [21, 62, 65, 66]. A review ofthe openness of LLMs found that, “[W]hile there is a fast-growing list of projects billing themselves as ’open source’, many inherit undocumented data of dubious legality, few share the all-important instruction-tuning (a key site where human annotation labour is involved), and careful scientific documentation is exceedingly rare” [66].
It remains an open question whether one can or should classify AI models as either open or closed-source. Through a global, multi-stakeholder approach, the OSI is currently developing a definition of open source AI as AI systems that are made available under terms that grant the freedoms to use, study, modify, and share the system [1, 67]. Countering binary approaches, Irene Solaiman [17] makes the case that AI systems are not either fully open or fully closed; rather, the openness of AI systems can be plotted along a gradient with six degrees of openness. Each grade of openness involves trade-offs between concentrating power and mitigating risks [17]. As the field rapidly evolves, developing responsible practices, norms, and regulation around open source AI remains a critical challenge [43, 44].
A nascent research agenda on open source AI
While the benefits and risks of open models have been widely discussed, we still have a limited understanding of the collaborative practices involved. In this section, we review prior work on open model development and motivate our empirical analysis of development activity on the HF Hub to address this research gap.
The HF Hub has emerged as a popular platform used by individuals and organisations to share, download, and collaborate on models, datasets, and spaces [68, 69]. The HF Hub is a “model marketplace,” which is “a new form of user-generated content platform, where users can upload AI systems and AI-related datasets, which in turn can be downloaded, and depending on the business model, queried, tweaked, or built upon by other users” [70]. Much of the activity amongst the emerging developer community on this platform concerns individuals fine-tuning pre-trained models that were released by industry leaders for downstream use in research and applications [21]. In addition to the hosting and fine-tuning of open models, a few grassroots initiatives have embraced open collaboration methods to develop open models. For example, the development of BLOOM, a 176B parameter multilingual LLM, and its training dataset, ROOTS, was the largest “open source” AI collaboration to date, involving over 1,000 volunteers from over 70 countries and over 250 institutions [12]. Such initiatives have demonstrated alternative pathways for AI model development beyond the handful of companies that dominate the AI R &D [9, 14]. Prior work has also highlighted the leadership role of companies, such as Hugging Face, in organising “values-driven initiative[s]”, such as the BigScience workshop, and attracting contributors who have diverse motivations, from developing new skills and working on new problems to publishing research giving back to the ecosystem [12, 14].
Due to the growing popularity of the HF Hub, scholars have examined the suitability of the HF Hub for empirical research on open model development [69, 71].Footnote 3 Castaño et al. [22] provide most comprehensive empirical insights into maintenance practices in model repositories on the HF Hub.Footnote 4 They find that commit activity follows a right-skewed distribution, with a few models receiving extensive activity while the majority of repositories receive limited activity [22]. While the majority of models are developed by singular developers (1.18 mean, 1.0 median), some model repositories, such as bigscience/bloom or bigcode/santacoder, are co-developed and co-maintained by up to 20 developers [22]. They also find that developers tend to prioritise “perfective tasks” to enhance model performance and align with technological advances, unlike OSS maintenance that focuses on bug fixes and feature additions [22]. The authors contend this “reveals the need for methods and tools specifically designed for the unique demands of ML model maintenance. Such tools may include advanced version control systems optimized for data and model tracking, as well as automated monitoring tools capable of detecting model drift or degradation” [22]. Prior work has also examine carbon emission reporting in model repositories, finding stagnation in emissions reporting by developers and highlight and the need for improved reporting practices and carbon-efficient model development on the HF Hub [72].
Our research builds on this prior work. As one of the first studies to investigate open model development practices, in the next section we draw on prior work on OSS development in order to be able to compare our findings to prior research and to lay the groundwork for a more comprehensive understanding of open model development in the future.
Learning from prior work on OSS development
Prior work on OSS development provides an empirical foundation for investigating the social dynamics of open model development. In the early 2000s, a number of metaphors were used to describe the social structure of “the OSS community”. For example, the Linux developer community was described as a “bazaar” that vibrated with the activity of geeks, hackers, and hobbyists, who performed various tasks, from bug-spotting to writing code to “serving the hacker culture itself” [45]. However, prior work illustrates that OSS communities have diverse social structures [73, 74], from “caves” with singular developers [75] to “core-periphery” networks, akin to “layered onions” [76], with uneven activity distributions ranging from core contributors (e.g., project initiators) to users (e.g., bug-spotters) [77,78,79,80].
Numerous studies highlight that various types of activity in OSS development, such as discussions in mailing lists, bug-spotting in issue trackers, and commit activity, exhibit right-skewed, Pareto distributions [24, 26, 28]. Indeed, it is well-documented observation that OSS development is typically characterised by the Pareto principle, commonly known as the 80/20 rule or the law of the vital few, which states that approximately 80% of effects come from 20% of causes [81]. These findings are congruent with a wide range of Internet phenomena, which similarly exhibit right-skewed distributions, which follow power laws [82, 83]. However, there are exceptions to the rule; for example, a study of 2,496 projects on GitHub found that the Pareto principle does not always characterise development activity in OSS repositories, thus highlighting the need to be cautious about generalising the Pareto principle as an incontestable law of OSS development [84]. Furthermore, many activities, such as mentorship and hackathons, take place outside of the repository [32, 85,86,87] and are therefore invisible to quantitative scholars of OSS development practices.
The various social structures of OSS communities are shaped, amongst others, by the diverse incentives of individuals and companies that participate in OSS development [88,89,90]. Individual developers are typically motivated by factors such as personal values, altruism, enjoyment, reputation-building, and career benefits [91,92,93,94]. However, there are also major barriers to participation, including gender disparities [95, 96] and geographic inequalities [86, 87]. Activity tends to be concentrated in the Global North [97] and the English lingua franca is a barrier for many developers [87, 98]. Furthermore, the incentives of OSS developers vary by geography: while developers in the USA show a relatively strong interest in “geek culture”, developers in India and China tend to be motivated primarily by career benefits [99]. Thus, “researchers studying open source should be mindful of geographic variation in what motivates participation and what forms participation may take, particularly outside of the code repository” [86].
Meanwhile companies primarily participate in OSS development for strategic reasons, such as recruiting developers [100,101,102], reducing costs [101, 103, 104], influencing OSS projects [104, 105], promoting open standards [103, 106], and building a reputation as an OSS patron [32, 89, 107]. Commercial participation has mixed effects on the social structure of OSS communities. Typically, one company or a few companies emerge as dominant contributors in projects [28, 108]. The dominance of a company is negatively associated with the participation of volunteers, while it is positively associated with the productivity of contributors and the quality of issue reports [74, 109]. It is also common for companies, which may be market rivals, to collaborate in OSS ecosystems [108, 110,111,112,113], which has turned many OSS communities “from networks of individuals into networks of companies” [100].
Building on this prior work, this study aims to provide novel insights into the collaborative dynamics of the HF Hub. Specifically, we investigate typical patterns of development activity across model, dataset, and space repositories the HF Hub (RQ1), the social network structure of its developer community (RQ2), as well as adoption and key actors driving the development of the most widely-adopted models (RQ3). The research extends the literature by shedding light on the practices involved in open model development on this increasingly important platform. The findings contribute to a more comprehensive understanding of and lay the groundwork for future research on open model development.
Study design
Research aims and research questions
This study extends the nascent research agenda on open model development with a quantitative analysis of development activity on the HF Hub. We adopted a quantitative approach to explore large-scale patterns and trends in development activity on the HF Hub, which is a suitable approach when one seeks to generate baseline insights on a new phenomenon [114]. In particular, we examine different aspects of development activity on the HF Hub via the following RQs:
-
RQ1: What are typical patterns of development activity across the HF Hub?
-
RQ2: What is the social network structure of the HF Hub developer community?
-
RQ3: What is the distribution of model adoption on the HF Hub, and who are the key actors driving the development of the most widely-adopted models?
These RQs examine different aspects of development activity on the HF Hub. RQ1 focuses on identifying common patterns across various types of activity, such as likes, discussions, commits, and downloads, in the repositories of models, datasets, and spaces. Concretely, this analysis expands prior work that focuses on commit activity in model repositories [22]. RQ2 concerns the social network structure of the developer community on the HF Hub. In particular, we analyse a snapshot of collaboration interactions in model repositories amongst around 100,000 developers, building on prior descriptions of collaboration on open models [12, 14] and maintenance practices [22]. Lastly, RQ3 empirically tests a prior observation of uneven model adoption and the influence of Big Tech companies [21] by examining the distribution of model use in spaces and identifying the developers of the most used models. In addition, we examine model co-usage patterns to provide insights into the interdependencies and ecosystems surrounding popular models.
The HF Hub: a new platform and source of research data
The HF Hub was launched in 2021 by Hugging Face, a startup whose mission is to “democratize AI” [68]. The HF Hub is a Git-based social coding platform, widely used by researchers, developers, and hobbyists to share, discover, discuss, and collaborate on open models [115], datasets [116], and spaces [117]. Spaces are interactive web applications that facilitate the creation of demonstrations and make models hosted on the platform more accessible to end-users. The platform provides a number of tools for open model development, such as version control for collaboration and tracking [115], and evaluation and benchmarking of model performance [118]. The HF Hub API allows programmatic access to platform resources as well as metadata from repositories hosted on the platform [15]. In light of its features and data availability, prior work underlines the platform’s suitability for empirical studies on open model development [22, 69]. Building on this prior work, this paper aims to advance the research community’s understanding of the development practices in open model development as well as methodological considerations regarding the HF Hub.
When using data from the HF Hub, it is important to consider the ethical implications and adhere to the platform’s terms of service. In the study, we only collected publicly available data through the official HF Hub API, respecting the privacy settings of users and repositories. For example, we did not attempt to access or include data from private repositories in the analysis. Additionally, we anonymised the collected data by focusing on aggregate measures and avoiding the disclosure of personally identifiable information in the findings. Ethical clearance for this study was obtained from the CUREC institutional review board at the University of Oxford.
Data collection
We collected data via the HF Hub’s API in October 2023 [15], using Python scripts that are available on GitHub [119]. For RQ1, we collected and processed metadata for a number of activities from the public repositories of 348,181 models, 65,761 datasets, and 156,642 spaces, using the list_models(), list_datasets(), and list_spaces() API endpoints. These included: likes (n_likes), downloads (n_downloads),Footnote 5 discussions (n_discussions), commits (n_commits), unique developers who have contributed commits (n_commiters), unique developers who started discussions (n_disc_starters),Footnote 6 and the repository’s community size (n_community), calculated as the cardinality of the set union of n_disc_starters and n_commiters. As per prior work [113, 120, 121], we removed bots and merged multiple developer identities before enumerating n_disc_starters, n_commiters, and n_community. As a result, n_community is recorded as 0 if no user has made a commit or started a discussion in the repository, which ignores the creator of the repository. We acknowledge that alternatively such repositories could have the value 1.
For RQ2, we operationalised collaboration on models as instances where a pair of developers contributed commits to the same model repository, with direct edges recorded between developers that were weighted by the number of times a developer contributed a commit to the same repository as the other developer [122]. We operationalised commit activity as acts of collaboration because commits are easily measurable, represent “validated” contributions, and represent an accurate audit trail of collaboration [80, 113]. However, we acknowledge that the fact that two developers commit to the same repository does not necessarily imply direct interaction; for example, it would have been more accurate to focus on developers’ contributions to the same file in a repository, as we discuss in “Threats to validity” section. Formally, we modelled collaboration as a network \(N = (D,E,W)\), where D is the set of developers, \(E = \{(i,j,w_{ij}) \mid i,j \in D, w_{ij} \in {\mathbb {N}}\}\) is the set of directed edges denoting the relationships between developers, and \(W = \{w_{ij} \mid (i,j,w_{ij}) \in E\}\) represents the weights associated with each directed edge. For a developer pair i and j, we denote the directed relationship as \((i,j,w_{ij})\), where \(w_{ij}\) signifies the number of times developer i has committed to the same repository as developer j.
To collect data for the analysis of RQ2, we collected commit data from public model repositories via the HF Hub API. We started by retrieving a list of all available model IDs using the list_models() endpoint. Then, for each model repository, we used the list_repo_commits() endpoint to retrieve the commit data, including the authors associated with each commit. For each commit, we recorded an edge between the the developer who made the commit (source_node) and all other developers who had contributed to the repository (target_node). In cases where a repository had only one contributor, we created self-loop edges to capture the isolate contributor’s activity. We did not take temporal dynamics of commit activity into account, which we discuss as a threat to construct validity under “Threats to validity” section. We collected data for collaboration in NLP, CV, and MM model repositories by filtering repositories based on the tags, which developers add to their repositories to aid discoverability on the HF Hub. We used the list of tags per sub-field provided by the HF Hub, including computer-vision and image-classification for CV models; translation and summarization for NLP models; and image-to-text and image-to-video for MM models.
For RQ3, we collected data on model usage in spaces using the list_models() and model_info() API endpoints. We modelled model usage in spaces as a bipartite network, akin to the representation of software dependency networks [123]. The bipartite model usage network is denoted as \(D = (M, S, E)\), where M is the set of models, S is the set of spaces, and \(E = \{(m, s) \mid m \in M, s \in S\}\) is the set of undirected edges signifying that “space” s uses model m. The edges are unweighted, representing the model usage relationship between a “space” and a model. From the bipartite network D, we derived an undirected model co-usage network \(C = (M, E, W)\). In this network, M is the set of models, \(E = \{(m_i, m_j) \mid m_i, m_j \in M\}\) is the set of undirected edges connecting models based on their co-usage in a “space”, and \(W = \{w_{ij} \mid (m_i, m_j) \in E\}\) is the set of weights assigned to the edges, reflecting the frequency of co-usage of models \(m_i\) and \(m_j\) across spaces. This analysis complements the former analysis of model usage with insights into the interdependencies and ecosystems surrounding widely used models on the HF Hub.
Username merging
Following prior work, before the analysis, we undertook data preprocessing to merge multiple developer identities per unique developer, which can be caused by how Git records usernames based on users’ local repository credentials [28, 77, 121, 124, 125]. We assumed this might be an issue on the HF Hub, too. To ensure the accuracy of the dataset of 101,144 developers, we applied a three-pronged approach. First, we classified username string similarity (threshold=90%) between pairs of developers who contributed to the same repository, accepting 126 out of 180 (70.00%) pairs based on manual username searches on the HF Hub. Second, in light of the presence of potential real names (i.e. usernames with spaces like “Jessica Smith”), we examined string similarity (threshold=90%) between 1,979 potential real names and the remaining 99,041 usernames, accepting 358 out of 403 (87.75%) username pairs after manual searches on the HF Hub. Finally, we inspected the usernames of 700 developers with a network degree of 10 or higher, who represented 0.7% of developers but accounted for 44.78% of edges, via manual searches on the HF Hub. This resulted in the identification of 212 username pairs. In total, we merged 546 usernames after removing duplicates.
Data analysis
To investigate development activity on the HF Hub (RQ1), we conducted a descriptive analysis of various types of activity in 348,181 model repositories, 65,761 dataset repositories, and 156,642 space repositories. Pearson correlation coefficients were calculated to assess the pairwise relationships between the activity variables. In addition, we employed the Mann–Whitney U test to compare the levels of activity across repositories with different licenses (Permissive, Restrictive, and No license). The Mann–Whitney U test is a non-parametric test that examines whether two independent samples come from the same distribution, which does not require the data to be normally distributed or to meet the assumption of homogeneity of variance [126]. Given the large sample sizes, the U values are expected to be large, and the salient test statistic is the p-value which indicates the statistical significance of observed differences. Due to capacity constraints in labelling licenses, we limited this analysis to repositories with licenses used in at least 100 repositories (\(n=\)339,502, 98% of all repositories). Subsequently, we analysed a snapshot of the social network structure of collaboration on the HF Hub (RQ2), using techniques defined in Table 1 in Appendix 1. This analysis provides insights into collaboration patterns in model repositories at this point in time. Furthermore, we analysed collaboration patterns in the three AI sub-fields (NLP, CV, and MM) to enable comparisons. Lastly, we examined model adoption on the HF Hub (RQ3) by calculating the ranked degree of models in the bipartite model usage networks and ranked degree of models in the model co-usage networks to identify the most used models in spaces and their respective developers. These two complementary approaches quantified model popularity (i.e. which models are most frequently used in spaces) and model co-popularity (i.e. which models are most commonly used in conjunction with other models). We replicated this analysis for spaces with NLP, CV, and MM tags for comparative analysis of the three AI sub-fields.
Results
We first report results for activity in the 348,181 model, 65,761 dataset, and 156,642 space repositories in “Development activity on the HF Hub” section, relying on the metrics described in “Data collection” section. We then report results on the structure and dynamics of collaboration in “Social network structure and dynamics of collaboration” section, based on the analysis of collaboration interactions between around 100,000 developers in model repositories. Finally, we present the results of our analysis of model adoption in spaces in “Model adoption in spaces on HF Hub” section, where we examine the distribution of model usage in spaces on the HF Hub and identify the developers of the most used models.
Development activity on the HF Hub
In this section, we present the findings of development activity in the repositories of 348,181 models, 65,761 datasets, and 156,642 spaces on the HF Hub. We present three key findings: right-skewed distributions across different types of activity (“Right-skewed distributions in development activity” section), strong correlations between development activities (“Correlation between community size and engagement” section), and a significant lack of licenses in model and dataset repositories (“Impact of licenses on collaboration” section).
Right-skewed distributions in development activity
Activity per repository is extremely imbalanced, with right-skewed distributions of n_likes, n_discussions, n_commits, and n_downloads across model, dataset, and space repositories (see Fig. 1). For example, while the maximum number of likes amongst models is over 9000, the average model only receives 1.14 likes (see Tables 2, 3, 4). The majority of repositories get minimal engagement. For example, 91% of models and 88% of datasets have 0 likes; 84% of models, 91% of datasets, and 96% of spaces have 0 discussions; and 71% of models and 70% of datasets have 0 downloads. Meanwhile, most activity is concentrated in a small number of repositories. For example, \(<1\)% of models account for 80% of likes, 10% for 80% discussions, 30% for 80% commits, and \(<1\)% for 80% downloads. Upon increasing the threshold, 8% of models account for 99% likes, 15% for 99% discussions, and 1% for 99% downloads.
Most repositories have a community size of 1; for example, 87% of model repositories have 1 contributor and the 75th quartile value of n_committers is 1 across repository types (see Table 2). The respective maximum values of n_committers are 18, 100, and 282 across repository types, and the respective maximum values of n_community are 246, 110, and 4,685. The differences between n_committers and n_community are due to large n_disc_starters values, indicating a division of roles in repositories, where many developers participate in discussions but few are involved in model maintenance. The model repositories with the most n_committers are bigscience/bloom (\(n=18\)), bigcode/santacoder (\(n=16\)), and deepset/roberta-base-squad2 (\(n=15\)).
Correlation between community size and engagement
We correlate frequency counts over the different types of activity described in “Data collection” section (see Fig. 2). In model repositories, we find a strong positive correlations between n_community and n_likes (\(\rho = 0.75\), \(p < 0.001\)). In space repositories, we find strong correlations between various activities, especially n_likes) and n_discussions (\(\rho = 0.74\), \(p < 0.001\)), n_disc_starters (\(\rho = 0.76\), \(p < 0.001\)), and n_community (\(\rho = 0.76\), \(p < 0.001\)). However, in general, we observe weak correlations between most activities in model and dataset repositories. Furthermore, we do not find a strong correlation between commit activity (n_commits) and other types of activity, indicating that commit activity is not strongly linked to community engagement.
Impact of licenses on collaboration
A significant proportion of model and dataset repositories lack licenses, which can create uncertainty and potential legal issues for users and developers. Specifying a license is not the norm: the majority of model repositories (65%) and datasets (72%) do not have a license. Amongst the licensed models, the most commonly used licenses are Apache v2.0 (37%), MIT (17%), OpenRAIL (14%), and CreativeML OpenRAIL-M (10%). The most used licenses for datasets are MIT (28%), Apache v2.0 (15%), OpenRAIL (9%), and licenses from the family of Creative Commons v4.0 (7%).
The choice of license matters: there is a moderate to strong correlation between the use of a license and level of activity in model repositories (see Fig. 3). Furthermore, the Mann-Whitney U tests provide strong evidence of statistically significant differences between collaboration dynamics in model repositories with different types of licenses (all tests have \(p < 0.001\)). Specifically, model repositories with permissive licenses consistently have the highest levels of activity compared to model repositories with no license and those with restrictive licenses (see Table 5). However, repositories with restrictive licenses also exhibit significantly higher activity than those with no license. This pattern holds across all activity metrics measured, suggesting that while permissive licenses foster the highest engagement, restrictive licenses also promote more collaboration compared to model repositories that do not have a license.
Social network structure and dynamics of collaboration
In this section, we present findings from our analysis of a snapshot of the social network structure of collaboration in model repositories on the HF Hub. We begin with the structure and dynamics of collaboration in all model repositories (see “Collaboration in model repositories on the HF Hub” section), and then we compare collaboration patterns in Natural Language Processing (NLP), Computer Vision (CV), and Multimodal (MM) model repositories (see “Collaboration in model repositories in AI sub-fields” section).
Collaboration in model repositories on the HF Hub
The HF Hub collaboration network exhibits a right-skewed degree and PageRank centrality distributions, which indicates that influence in the HF developer community is concentrated amongst a small subset of developers. The majority of developers (89%) have not collaborated with others. Excluding these isolate developers, the remaining 10,524 developers have an average degree of 4.10 (SD: 32.63) and node degrees range from 1 to 3140. The right-skewed distributions of degree and PageRank centrality (see Fig. 4) suggest that a small group of influential developers plays a central role in driving collaboration on open models on the HF Hub. Specifically, the degree centrality distribution has a mean of 4 and a median of 2, with a maximum of 3140 and a standard deviation of 33, while the PageRank centrality distribution has a mean and median of 0.0001, a maximum of 0.04, and a standard deviation of 0.0005.
The HF Hub developer community exhibits a core-periphery structure, with a tightly interconnected core of prolific developers. The k-core decomposition analysis reveals that as the k-core value increases, the number of distinct communities decreases, ultimately converging into a single densely interconnected core at k = 26 (see Table 6). The high modularity (0.81) at k = 1 suggests that the whole network consists of loosely connected groups of developers. As the k-core value increases, the modularity decreases to 0.00 at k = 26, indicating a transition from a compartmentalised community structure with distinct clusters or modules to an integrated core characterised by high cohesion and a lack of discernible sub-groups. Concurrently, the sub-network density increases, reaching unity at k = 26.
Collaboration is characterised by high reciprocity values, ranging from 0.81 to 1.00 across all k-core levels (see Table 6), indicating the prevalence of mutual relationships amongst developers. The low assortativity values, ranging from − 0.49 to 0.08, suggest that developers collaborate regardless of their centrality in the network, implying that other factors, such as shared interests, skills, or project roles, may be more significant in driving collaboration than their network centrality. Furthermore, the relatively low average rich club coefficients, ranging from 0.04 to 0.41, indicate that highly central developers do not primarily collaborate with each other and a lack of elitism amongst power developers.
Collaboration in model repositories in AI sub-fields
Collaborations on models in sub-fields of natural language processing (NLP), computer vision (CV), and multimodal (MM), despite the different sizes of the respective communities, are similarly characterised by core-periphery structures ith high modularity and low density (see Tables 7,8, 9). At k = 1, all networks are highly modular (CV: 0.80, NLP: 0.82, MM: 0.71) and have very low density (CV: 0.01, NLP: 0.00, MM: 0.00), implying that collaborations in the respective AI sub-fields are clustered into distinct communities of collaborators. As the k threshold increases, the networks undergo a similar transformation process, with modularity decreasing to 0.00 and the number of communities reducing to a single cohesive community at the maximal k values (CV: 10, NLP: 25, MM: 26). Concurrently, density increases, reaching 1.00 for CV and MM and 0.97 for NLP at their respective maximal k values.
Collaboration in sub-fields is also similarly characterised by reciprocity and connectivity in the core. At k = 1, reciprocity values range from 0.84 to 0.93 and increase to 1.00 at the maximal k for CV and MM, while NLP maintains a high reciprocity of 0.98 at its maximal k. The average degree increases with k for all networks, reaching the corresponding maximal k value at the highest threshold. This suggests that as we move towards the core of the collaboration networks, developers become more interconnected and collaborate with a larger number of peers. However, the low average clustering coefficients and low average rich club coefficients across all networks indicate that the more prolific developers in the respective sub-fields tend to collaborate with a diverse set of individuals rather than forming tightly-knit groups.
Model adoption in spaces on HF Hub
In this section, we present the results of the analysis of model usage in spaces on the HF Hub, shedding light on model adoption and key developers in this ecosystem. Specifically, we present two key findings: model adoption in spaces is characterised by a right-skewed distribution (“Right-skewed distribution of model adoption” section), and a small cohort of developers (in particular, Big Tech companies) build the most used models across all spaces as well as in the three AI sub-fields (“Dominance of a few models by a few developers” section).
Right-skewed distribution of model adoption
The bipartite model usage network displays a disparity in model adoption in spaces. The degree distribution of the bipartite network is right-skewed, as shown in Fig. 5. Only three models are used in 1000 or more spaces, including runwayml/stable-diffusion-v1-5 (\(n=1747\)), skytnt/anime-seg (\(n=1162\)), and gpt2 (\(n=1002\)). The mean degree (6.68) is significantly higher than the median (1.00), and the large standard deviation (34.75) confirms the high variability in model usage. The majority of models have a low degree of usage, with at least 50% being used in only one space, while a small number of highly popular models dominate the usage, with the maximum degree reaching 1747. This suggests that a few key models are widely adopted in AI applications, while many other models have limited use cases. The model co-usage network provides an additional perspective on the uneven interdependencies of models in spaces, complementing insights gained from examining model downloads or individual model usage in spaces. Specifically, the degree distribution of this network exhibits a multi-modal pattern, with five distinct clusters, each exhibiting a right-skewed shape (see Fig. 5). A small cluster at the far-right tail of the distribution represents a few highly interconnected models with significantly higher co-usage degrees compared to the other clusters.
Dominance of a few models by a few developers
When we rank the models by their usage in spaces, we observe that major organisations, rather than individual developers or grassroots initiatives, have developed the most used models. Amongst the 100 most used models in spaces, the following organisations have developed the most models: Meta (\(n=8\)), Google (\(n=7\)), StabilityAI (\(n=5\)), OpenAI (\(n=4\)), Microsoft (\(n=4\)), and Fudan University (\(n=4\)). These five organisations account for 33% of the 100 most used models in spaces. We note that the individual user nitrosocke (\(n=5\)), an employee at StabilityAI, ranked highly amongst these organisations. With regards to the model co-usage network, the key developers of the 100 most co-used models in all spaces are: EleutherAI (\(n=15\)), Meta (\(n=12\)), h20ai (\(n=11\)), BigScience (\(n=9\)), and lmsys (\(n=9\)). These five organisations account for 56% of the 100 most co-used models in spaces.
The model usage networks in the sub-fields similarly exhibit right-skewed degree distributions, highlighting the dominance of a minority of models in each sub-field. The most used models in spaces with NLP tags (\(n=3995\)) are gpt2 (\(n=1001\)), bertbaseuncased (\(n=621\)), and gpt2medium (\(n=445\)). The organisations that developed the most models amongst the 100 most used models are Google (\(n=9\)), Meta (\(n=5\)), and Fudan University (\(n=5\)). For comparison, in the NLP model co-usage network, EleutherAI ranks first (\(n=16\)), followed by h20ai (\(n=12\)) and Meta (\(n=11\)). The most used models in spaces with CV tags (\(n=416\)) are saltacc/anime-ai-detect (\(n=500\)), openai/clip-vit-large-patch14 (\(n=454\)), and openai/clip-vit-base-patch32 (\(n=277\)). The most prolific developer of models in spaces with CV tags is the user lllyasviel (\(n=20\)), followed by Meta (\(n=8\)) and the user DucHaiten (\(n=7\)). For comparison, in the CV model co-usage network, LAION AI ranks as the developer of the most models amongst the top 100 (\(n=17\)). Finally, the most used models in spaces with MM tags (\(n=2394\)) are runwayml/stable-diffusion-v1-5 (\(n=1748\)), CompVis/stable-diffusion-v1-4 (\(n=925\)), and stabilityai/stable-diffusion-2-1 (\(n=854\)). Amongst the developers of the 100 most used models, Stability AI ranks first, with 15 of the 100 most used models and 22 of the top-ranked co-used models in such spaces. These findings highlight the key models and players in the NLP, CV, and MM communities.
Correlations between model likes and model usage
We observe a strong positive correlation between n_likes of models and n_usage_spaces (\(\rho = 0.66\), \(p < 0.001\)), and a weak positive correlation between n_downloads and n_usage_spaces (\(\rho = 0.29, p < .001\)). These findings suggest that the number of likes is more strongly associated with the usage of models in spaces compared to the number of downloads, and that likes in model repositories are a good indicator of its adoption in applications on the HF platform. However, as mentioned in “Data collection” section, we note that download counts are limited and therefore may only provide a snapshot of correlations between downloads and likes or usage, which may not generalise in all time periods.
Discussion
In this section, we discuss the key implications of our findings for research and practice. We highlight the study’s contributions to the literature in “Contributions to academic literature” section. We then reflect on the methodological considerations of using the HF Hub as a data source for research on open source AI in “HF Hub: a new source of research data” section. Building on these insights, we make five recommendations for future research to advance the research agenda on open source AI in “Recommendations for future research” section. Finally, we discuss the implications for practice and make recommendations to practitioners in “Implications for practice” section.
Implications for research
Contributions to academic literature
Uneven influence in the HF Hub developer community: We extend prior findings of right-skewed distributions of commit activity in model repositories [22] with observations of right-skewed distributions of various development activities on the HF Hub, including interactions in model, dataset, and space repositories; code collaborations between developers; and model usage in spaces. Activity distributions follow power law patterns, with a small fraction of repositories accounting for most interactions (e.g., \(<1\)% for 80% of likes, 10% for 80% discussions, 30% for 80% commits, \(<1\)% for 80% downloads). Similarly, the collaboration networks exhibit right-skewed centrality distributions, indicating that influence is concentrated amongst few developers, congruent with prior observations that OSS development patterns generally follow Pareto distributions [24,25,26,27,28]. Influence also flows across the HF Hub, with likes per model having strong correlations with their usage in spaces (\(\rho = 0.66\), \(p < 0.001\)).
Impact of license on collaboration: The Mann–Whitney U tests show that license choice significantly impacts the level of activity and engagement in repositories, with permissive licenses exhibiting the highest activity levels, followed by repositories with restrictive licenses, and finally ones with no license. Furthermore, the Pearson correlations indicate that the use of a license (permissive or restrictive) is associated with stronger correlations between various types activity compared to repositories without licenses. These findings highlight the important role of licencing decisions in influencing the collaborative and community dynamics in open model development and open source AI projects.
Core-periphery structure of the HF developer community: To the best of our knowledge, only one prior study has investigated model development practices in the HF developer community, showing that most models only have one contributor and that model maintenance chiefly involves “perfective tasks” to enhance model performance [22]. We extend this finding with three insights. First, we corroborate the findings that most developers (89%) are islands, who have not collaborated with other developers in model repositories on the HF Hub. This is not unique to the HF Hub: the majority of OSS projects are developed by individuals [75]. However, what may be specific about the small community sizes in model development is the nature of the model development life-cycle (“code once, train often”). Second, the social network structure of collaboration patterns amongst developers in model repositories is characterised by a core-periphery structure, with a dense core of highly active developers, akin to the “layered onion” structure common in OSS [76]. Third, collaborations have high reciprocity and low assortativity, signifying the prevalence of mutual relationships amongst developers, regardless of their social positions in the community.
Uneven model adoption in spaces: By examining model adoption in spaces, we empirically tested the observation of uneven model adoption and the disproportionate influence of industry-leading companies in the open source AI ecosystem [21]. We identified the popularity of a relatively small number of models used in spaces as well as the influential role of a few organisations, including Meta, Google, Stability AI, OpenAI, Microsoft, and EleutherAI, who have developed the most widely used models. Some critics of the open-source model of AI development fear that too many unknown actors will introduce distributed safety issues, while advocates of the development model tout democratisation of power as a core benefit. Our findings show that a few organisations possess majority influence in this ecosystem, which challenges both of these narratives. In many cases, the most influential actors in the open source AI ecosystem are one in the same as those in closed-source AI [21].
HF Hub: a new source of research data
This paper contributes to the research effort to use the HF Hub as a data source for empirical studies on open model development [22, 69, 71]. We share two reflections on methodological considerations. First, informed by prior work that underlines the importance of merging usernames for unique developers, we anticipated that this might be an issue on the HF Hub [28, 77, 121, 124, 125]. While our three-pronged approach strikes a balance between the impracticality of manually inspecting over 100,000 developers versus the risk of misclassification through a fully automated approach, it is still imperfect. Future research may consider more sophisticated approaches to this problem. Second, the API is not optimised for research purposes, which makes data collection time-consuming (e.g., one must make a unique API call to retrieve commit histories of each model and handle rate limits) and limited (e.g., user metadata is not available). The lack of user metadata hinders the ability to study the characteristics and behaviours of individual developers, such as their expertise and affiliations, as well as automated approaches to username merging that incorporate user metadata. To overcome these limitations, researchers may explore alternative approaches and tools, such as the HFCOMMUNITY database developed by Ait et al. to facilitate empirical studies of activity on the platform [71].
Recommendations for future research
We recommend five research directions that can advance the research agenda on open source AI.
-
1.
Implications of concentrations in the HF Hub developer community: We confirm prior observations that the models of a handful of companies are dominant amongst the HF Hub developer community [21]. We encourage future research to investigate what these concentrations mean in practice, such as the potential benefits that these companies accrue from their open model ecosystems, including increased visibility, crowdsourced contributions (e.g., via commits and discussions), and access to diverse fine-tuned versions shared by other developers on the HF Hub. Furthermore, there is a concern that dominant companies benefit from developers being locked-in to their ecosystems, potentially limiting competition and entrenching their dominance. Future research could investigate the factors contributing to such concentrations, such as the reputation of the companies developing the models, their access to resources and support, or the perceived performance and versatility of their models, as well as the implications of these concentrations for the broader AI community, including the impact on research, innovation, and the distribution of benefits and resources.
-
2.
Incentives and modes of participation: Future research could investigate the incentives of individual developers and companies. A number of companies have released open models on the HF Hub, such as Meta’s LlaMA models [127], Mistral AI’s Mixtral models [128], and OpenAI’s Whisper models [129]. Often these releases are presented as acts of “AI democratisation” [130]. Future research could critically examine the commercial incentives behind these releases. In addition, future research could examine commercial approaches to model governance and maintenance—for example, if and how companies welcome or engage with community contributions—and if and how companies collaborate with each other on open model development, as they do in OSS development [108, 110,111,112,113].
-
3.
Collaboration dynamics in active repository communities: We know that model maintenance focuses on model performance improvements [22]; and in the minority of repositories that have active communities, most developers contribute to discussions rather than commits (see Tables 2, 3, 4). Going further, we encourage researchers to examine collaboration dynamics in repositories with active communities from multiple angles. Given the sizeable differences in n_committers and n_disc_starters, future research could investigate the division of roles between discussion and code contributors, typical topics of discussion (e.g., model performance, new ideas, etc.), how discussions inform model maintenance if at all, and the journeys of developers from discussion contributors to committers, amongst others. In addition, future research could examine the governance approaches (e.g., contribution policies) that repository owners use to encourage collaboration. Future analyses could also take into account temporal dynamics, providing insights into evolving patterns, social structures, and trends of open model developer communities on the HF Hub.
-
4.
Impact of model size on collaboration Future research should examine the impact of model size (i.e., parameters) on the nature of collaboration in repositories on the HF Hub. For instance, it could examine how resource constraints (e.g., computational power or data availability) influence collaboration for various stakeholders (e.g., individual developers or developers from industry labs) on models of different sizes. By shedding light on facilitators and barriers for collaboration on open models, such research could guide efforts to foster inclusive and diverse communities.
-
5.
Collaboration beyond the HF Hub: While this analysis provides insights into the developer community that shares and fine-tunes models on the HF Hub, we have a limited understanding of the development of the various components involved in the development of models [4], which largely takes place in proprietary settings or on other platforms like GitHub [14]. We encourage future research to examine how the HF Hub is used in the wider ecosystem of platforms and offline venues for the collaborative development of open models and datasets. This research direction would enable comparisons of the collaboration patterns amongst model developers and model fine-tuners. In addition, researchers could undertake a multi-sited analysis, examining collaboration on the same project across platforms.
Implications for practice
Recommendations for open source practitioners
Beyond our academic research suggestions, we encourage open source researchers and practitioners to develop standardised metrics for studying open model development. Groups like the Linux Foundation’s Community Health Analytics in Open Source Software (CHAOSS) working group [131], which has created metrics to assess the health and sustainability of OSS developer communities, are well-positioned to lead this effort. The lack of empirical data on open model development hinders evidence-based decision-making in this rapidly evolving field, and by working together to establish appropriate metrics, open source practitioners can help to address the data gap regarding open models.
Recommendations for platform providers
We make two recommendations to HF as a platform for open model development. First, HF could work with researchers to identify features and API improvements that would aid research efforts concerning open model development on its platform, building on efforts by members of the HF community, such as Weyaxi/huggingface-leaderboard. This collaboration could include collecting and publishing data on open model development patterns and collaboration, which would help fill the current “data gap” in this area. HF may take inspiration from GitHub’s Innovation Graph [132] or its annual Octoverse reports [133], which provide access to data and insights on development activity on its platform. Second, a concerning proportion of models (64.67%) and datasets (72.13%) lack licenses, which may be due to uncertainty about how or whether they should be licensed [67, 134]. For comparison, the number of unlicensed repositories on GitHub is lower at 46% or 53% if including “other licenses” [135]. In the interest of promoting responsible development, HF should consider developing educational resources on licenses, such as guides or tutorial videos, or developing features, such as a license drop-down menu, which can inform developers of the options available as well as their merits and drawbacks. Such a feature could be considered amongst other recommendations to moderate models on the HF Hub, such as hiring AI safety researchers and proactively red-teaming unsafe models [53, 70].
Recommendations for policymakers
As open models become increasingly widely available and used, policymakers need empirical data to inform discussions about the benefits, risks, and governance of these models. Our analysis provides one empirical lens on the extent of model proliferation and adoption, which can help ground policy decisions. For example, it is illuminating to observe that most models (70.99%) have not been downloaded once or that 1% of models account for 99% of downloads. This is a reminder that the availability of a model does not mean it will be (widely) used. Furthermore, while download counts were limited to the past 30 days, the fact that only 86 models had over one million downloads indicates that the number of widely used models is not excessively large and governable. What is more, the analysis revealed the impact of models developed by a number of non-profit, grassroots initiatives like EleutherAI, BigScience, and BigCode. Following the charge of the French government to fund the digital commons to support open model development [59], policymakers may use such data to identify non-commercial projects that could be supported. Overall, the data points reported in “Development activity on the HF Hub” section could help policymakers assess the real-world impact of open models and develop appropriate governance frameworks to maximise their benefits while mitigating potential risks of open models.
Threats to validity
We evaluate the validity of our findings by following guidance for empirical software engineering research [114, 136].
Construct validity
Construct validity concerns the extent to which a measurement accurately assesses the theoretical construct it intends to measure. Our study aimed to measure typical patterns of development activity on the HF Hub, but we acknowledge several threats to construct validity. First, our analysis is limited to activity in public repositories and does not account for collaboration in private repositories. Second, download counts have a few limitations: they are limited to the past 30 days, download counts may be incorrectly reported (e.g., if the repository lacks a configuration file or if the model is used on-device versus in continuous integration), and dataset downloads are limited to the count of load_dataset() calls [115, 116]. Third, our operationalisation of collaboration relies on commits to model repositories, assuming that the co-occurrence of commits indicates collaboration. However, this assumption may not always hold true, especially in large repositories where developers may work on independent tasks. Future research could operationalise collaboration on specific files and quantify the relative contribution of developers to specific files [80]. Furthermore, this analysis is limited to snapshot of the HF Hub developer community in October 2023, which does not capture the dynamics of collaboration and activity over time, which should be considered in future research, as discussed in “Recommendations for future research” section.
Internal validity
Internal validity concerns the extent to which a study can confidently attribute the observed results to the investigated variables, minimising the influence of confounding factors or alternative explanations. As explained in “Username merging” section, there may be a slight inaccuracy in the enumeration of community size per repository and the number of developers included in the collaboration networks due to discrepancies in username data, such as multiple accounts or usernames per developer. This is a common problem in OSS research, and there is no perfect solution to username merging [77, 121, 125]. API limitations prevent the use of methods that incorporate user metadata for username merging [28, 137]. For example, we rejected 34 username pairs due to insufficient evidence to confirm the match with confidence.
External validity
External validity concerns the generalisability of the findings. While the HF Hub has gained significant popularity, it is important to acknowledge that there may be other platforms where open model development takes place and that our findings may not generalise to those platforms. Future research could explore collaboration practices across different platforms to provide a more comprehensive view of the open source AI ecosystem. That being said, we observe that development activity on the HF Hub is characterised by the Pareto principle, conforming with OSS development patterns on platforms like GitHub [24,25,26,27,28]. Another threat to the external validity of the findings concerns the analysis of model usage. While there were as many as 156,642 spaces at the time of data collection, they do not represent the use of open models beyond the HF Hub platform, thus limiting the generalisability of our claims, with the exception of finding a strong positive correlation between likes of model repositories and their usage in spaces (\(\rho = 0.66\), \(p < 0.001\)). Future research could address this limitation by exploring other sources of data on model adoption, such as academic publications, industry reports, or user surveys, to triangulate the findings.
Reliability
Reliability refers to the consistency and reproducibility of the study’s results. To enhance the reliability of our study, we have uploaded the Python scripts used for data collection and processing to a public GitHub repository [119]. Due to privacy and ethical considerations, we do not share the raw data (see Data Availability statement).
Conclusion
The burgeoning open source AI ecosystem has become a focal point of discussion amongst AI researchers, developers, and policymakers. This study offers empirical insights on practices in this emerging ecosystem via a quantitative analysis of development activity on the HF Hub. Concretely, we make three empirical contributions to the nascent research agenda on open source AI. First, we find that various types of development activity, from likes and downloads to discussions and commits, across 348,181 model, 65,761 dataset, and 156,642 space repositories exhibit right-skewed distributions. In addition, activity and engagement is highly imbalanced between repositories; for example, over 70% of models have 0 downloads and 1% account for 99% of downloads. Second, we analyse a snapshot of the social network structure of collaboration in model repositories, finding that the community has a core-periphery structure, with a core of highly prolific developers and a majority of isolate developers (89%) who do not collaborate with others. However, collaboration is characterised by high reciprocity and low levels of assortativity regardless of developers’ social positions in the HF developer community. Third, we examine model adoption through the lens of model usage in spaces, finding that a minority of models are widely used and developed by a handful of industry-leading companies, which signifies the concentrated influence of a handful of actors in the HF Hub ecosystem. These findings are a timely reminder that open source AI is not immune to the influence of dominant industry leaders [21]. We conclude with a discussion of the implications of our findings and recommendations for AI researchers, practitioners, and policymakers, with the hope that the practices in open model development can be more deeply investigated in the future.
Data availability
The authors may be contacted with enquiries about the research data, which may be shared upon reasonable request and only in compliance with applicable data protection regulations.
Notes
As defined by the OSI [2], OSS is software source code that anyone can inspect, use, modify, or redistribute.
The authors define “suitability for empirical research” as “the amount and adequacy of the features to enable software development practices and the sufficient quantity of data to enable the conduction of empirical studies about such practices” [69].
Model maintenance is defined as “a higher number of commits, regular commit frequency, shorter intervals between commits, fewer days without commits, and a slightly higher number of authors” [22].
N.B. We do not report data for downloads of spaces because spaces cannot be downloaded.
N.B. As per the API, data collection for participation in discussions was limited to users that had started discussions. It was not possible to collect data about users that had made comments in discussion threads.
References
OSI. (2024). The Open Source AI definition—Draft v. 0.0.8. https://opensource.org/deepdive/drafts/the-open-source-ai-definition-draft-v-0-0-8. Accessed 1 May 2024.
OSI. (2007). The Open Source definition (v1.9). https://opensource.org/osd/. Accessed 10 April 2023.
Langenkamp, M., & Yue, D. N. (2022) How open source machine learning software shapes AI. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, And Society. AIES ’22 (pp. 385–395). Association for Computing Machinery. https://doi.org/10.1145/3514094.3534167 . Accessed 17 August 2023.
White, M., Haddad, I., Osborne, C., Xiao-Yang, L., Abdelmonsef, A., & Varghese, S. (2024). The model openness framework: Promoting completeness and openness for reproducibility, transparency and usability in AI. . https://doi.org/10.48550/arXiv.2403.13784. arXiv:2403.13784 [cs]. Accessed 30 May 2024.
arXiv: arXiv.org e-Print archive. (2024). Accessed 19 April 2024.
Kaggle. (2024). Find open datasets and machine learning projects | Kaggle. https://www.kaggle.com/datasets. Accessed 19 April 2024.
CommonCrawl. (2024). Common Crawl—Open repository of Web Crawl data. https://commoncrawl.org/. Accessed 1 May 2024.
ImageNet. (2024). ImageNet. https://www.image-net.org/. Accessed 1 May 2024.
Ahmed, N., Wahed, M., & Thompson, N. C. (2023). The growing influence of industry in AI research. Science, 379(6635), 884–886. https://doi.org/10.1126/science.ade2420
Tarkowski, A. (2023) The mirage of Open-Source AI: Analyzing Metas Llama 2 release strategy. https://openfuture.eu/blog/the-mirage-of-open-source-ai-analyzing-metas-llama-2-release-strategy. Accessed 18 September 2023.
EleutherAI. (2021). EleutherAI models. https://www.eleuther.ai/releases. Accessed 18 September 2023.
Akiki, C., Pistilli, G., Mieskes, M., Gallé, M., Wolf, T., Ilić, S., & Jernite, Y. (2022). BigScience: A case study in the social construction of a multilingual large language model. https://doi.org/10.48550/arXiv.2212.04960. arXiv:2212.04960 [cs]. Accessed 6 October 2023.
HuggingFace. (2024). BigCode—Open and responsible development and use of LLMs for code. https://www.bigcode-project.org/. Accessed 19 April 2024.
Ding, J., Akiki, C., Jernite, Y., Steele, A. L., & Popo, T. (2023) Towards openness beyond open access: User journeys through 3 Open AI Collaboratives. https://doi.org/10.48550/arXiv.2301.08488. arXiv:2301.08488 [cs]. Accessed 6 October 2023.
HuggingFace. (2024). Hugging Face Hub API. https://huggingface.co/docs/huggingface_hub/v0.5.1/en/package_reference/hf_api. Accessed 19 April 2024.
. Law, H., & Krier, S. (2023). Open-source provisions for large models in the AI Act. Cambridge University Science and Policy Exchange. Accessed 9 August 2023.
Solaiman, I. (2023). The gradient of generative AI release: Methods and considerations. https://doi.org/10.48550/arXiv.2302.04844. arXiv:2302.04844 [cs]. Accessed 9 August 2023.
Kapoor, S., Bommasani, R., Klyman, K., Longpre, S., Ramaswami, A., Cihon, P., Hopkins, A., Bankston, K., Biderman, S., Bogen, M., Chowdhury, R., Engler, A., Henderson, P., Jernite, Y., Lazar, S., Maffulli, S., Nelson, A., Pineau, J., Skowron, A., Song, D., Storchan, V., Zhang, D., Ho, D. E., Liang, P., & Narayanan, A. (2024). On he societal impact of open foundation models. https://crfm.stanford.edu/open-fms/paper.pdf
Seger, E., Dreksler, N., Moulange, R., Dardaman, E., Schuett, J., Wei, K., Winter, C., Arnold, M., hËigeartaigh, S., Korinek, A., Anderljung, M., Bucknall, B., Chan, A., Stafford, E., Koessler, L., vadya, A., Garfinkel, B., Bluemke, E., Aird, M., Levermore, P., Hazell, & J., Gupta, A. (2023). Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives. https://doi.org/10.48550/arXiv.2311.09227. arXiv:2311.09227 [cs]. Accessed 12 February 2024.
Eiras, F., Petrov, A., Vidgen, B., Schroeder, C., Pizzati, F., Elkins, K., Mukhopadhyay, S., Bibi, A., Purewal, A., Botos, C., Steibel, F., Keshtkar, F., Barez, F., Smith, G., Guadagni, G., Chun, J., Cabot, J., Imperial, J., Nolazco, J. A., Landay, L., Jackson, M., Torr, P. H. S., Darrell, T., Lee, & Y., Foerster, J. (2024). Risks and opportunities of Open-Source Generative AI. https://doi.org/10.48550/arXiv.2405.08597. arXiv:2405.08597 [cs]. Accessed 28 May 20.
Widder, D. G., & West, S., Whittaker, M. (2023). Open (for business): Big Tech, concentrated power, and the political economy of Open AI, Rochester, NY. https://papers.ssrn.com/abstract=4543807. Accessed 18 August 2023.
Castaño, J., Martínez-Fernández, S., Franch, & X., Bogner, J. (2024). Analyzing the evolution and maintenance of ML models on Hugging Face. https://doi.org/10.48550/arXiv.2311.13380. arXiv:2311.13380 [cs]. Accessed 5 April 2024.
Heltweg, P., & Riehle, D. (2023). A systematic analysis of problems in open collaborative data engineering. ACM Transactions on Social Computing, 6(3–4), 8–1830. https://doi.org/10.1145/3629040.
Goeminne, & M., Mens, T. (2011). Evidence for the pareto principle in open source software activity. In The Joint Porceedings of the 1st international workshop on model driven software maintenance and 5th international workshop on software quality and maintainability (pp. 74–82). Citeseer. https://citeseerx.ist.psu.edu/document?repid=rep1 &type=pdf &doi=75780c99b5f30e13a7682b2900289cfff75807c4#page=78
Mockus, A., Fielding, R., & Herbsleb, J. (2002). Two case studies of open source software development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology, 11(3), 309–346.
Szymański, K., & Ochodek, M. (2023). On the applicability of the pareto principle to source-code growth in open source projects. In 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS) (pp. 781–789). https://doi.org/10.15439/2023F5221. Accessed 19 November 2023.
Xu, J., Christley, S., & Madey, G. (2006). 12–Application of social network analysis to the study of open source software. In J. Bitzer & P. J. H. Schröder (Eds.), The economics of open source software development (pp. 247–269). Elsevier. https://doi.org/10.1016/B978-044452769-1/50012-3
Zhang, Y., Zhou, M., Mockus, A., & Jin, Z. (2021). Companies’ participation in OSS development—An empirical study of OpenStack. IEEE Transactions on Software Engineering, 47(10), 2242–2259. https://doi.org/10.1109/TSE.2019.2946156
PaperswithCode. (2023). Papers with Code. https://paperswithcode.com/trends. Accessed 18 September 2023.
Gururaja, S., Bertsch, A., Na, C., Widder, D. G., & Strubell, E. (2023). To build our future, we must know our past: contextualizing paradigm shifts in natural language processing. https://doi.org/10.48550/arXiv.2310.07715. arXiv:2310.07715 [cs]. Accessed 16 May 2024.
Sonnenburg, S., Braun, M. L., Cheng, S. O., Bengio, S., Bottou, L., Holmes, G., LeCun, Y., Müller, K. R., Pereira, F., Rasmussen, C. E., Rätsch, G., Schölkopf, B., Smola, A., Vincent, P., Weston, J., & Williamson, R. C. (2007). The need for open source software in machine learning. Journal of Machine Learning Research, 8, 2443–2466.
Osborne, C. (2024). Public-private funding models in open source software development: A case study on scikit-learn. arXiv:2404.06484. Accessed 10 April 2024.
Haddad, I. (2022). Artificial intelligence and data in open source. Technical report, Linux Foundation. https://8112310.fs1.hubspotusercontent-na1.net/hubfs/8112310/LF
HuggingFace. (2023). Transformers. https://huggingface.co/docs/transformers/index. Accessed 26 December 2023.
GitHub. (2023). Machine Learning and Artificial Intelligence repositories on GitHub. https://github.com. Accessed 18 September 2023.
Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., McCain, M., Newhouse, A., Blazakis, J., McGuffie, K., & Wang, J. (2019). Release strategies and the social impacts of language models. https://doi.org/10.48550/arXiv.1908.09203. arXiv:1908.09203 [cs]. Accessed 9 August 2023.
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’21 (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922. Accessed 16 May 2022.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2020). The pile: An 800GB dataset of diverse text for language modeling. https://doi.org/10.48550/arXiv.2101.00027. arXiv:2101.00027 [cs]. Accessed 9 August 2023.
Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., & Weinbach, S.(2022). GPT-NeoX-20B: An open-source autoregressive language model. https://doi.org/10.48550/arXiv.2204.06745. arXiv:2204.06745 [cs]. Accessed 9 August 2023.
Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Tow, J., Rush, A. M., Biderman, S., Webson, A., Ammanamanchi, P. S., Wang, T., Sagot, B., Muennighoff, N., Moral, A .V., Ruwase, O., Bawden, R., Bekman, S., McMillan-Major, A., Beltagy, I., Nguyen, H., Saulnier, L., Tan, S., Suarez, P. O., Sanh, V., Laurençon, H., Jernite, Y., Launay, J., Mitchell, M., Raffel, C., Gokaslan, A., Simhi, A., Soroa, A., Aji, A. F., Alfassy, A., Rogers, A., Nitzav, A. K., Xu, C., Mou, C., Emezue, C., Klamm, C., Leong, C., Strien, D., Adelani, D. I., Radev, D., Ponferrada, E. G., Levkovizh, E., Kim, E., Natan, E. B., De Toni, F., Dupont, G., Kruszewski, G., Pistilli, G., Elsahar, H., Benyamina, H., Tran, H., Yu, I., Abdulmumin, I., Johnson, I., Gonzalez-Dios, I., Rosa, J., Chim, J., Dodge, J., Zhu, J., Chang, J., Frohberg, J., Tobing, J., Bhattacharjee, J., Almubarak, K., Chen, K., Lo, K., Von Werra, L., Weber, L., Phan, L., allal, L. B., Tanguy, L., Dey, M., Muñoz, M. R., Masoud, M., Grandury, M., Šaško, M., Huang, M., Coavoux, M., Singh, M., Jiang, M. T.-J., Vu, M. C., Jauhar, M. A., Ghaleb, M., Subramani, N., Kassner, N., Khamis, N., Nguyen, O., Espejel, O., Gibert, O., Villegas, P., Henderson, P., Colombo, P., Amuok, P., Lhoest, Q., Harliman, R., Bommasani, R., López, R.L., Ribeiro, R., Osei, S., Pyysalo, S., Nagel, S., Bose, S., Muhammad, S. H., Sharma, S., Longpre, S., Nikpoor, S., Silberberg, S., Pai, S., Zink, S., Torrent, T. T., Schick, T., Thrush, T., Danchev, V., Nikoulina, V., Laippala, V., Lepercq, V., Prabhu, V., Alyafeai, Z., Talat, Z., Raja, A., Heinzerling, B., Si, C., Taşar, D. E., Salesky, E., Mielke, S. J., Lee, W. Y., Sharma, A., Santilli, A., Chaffin, A., Stiegler, A., Datta, D., Szczechla, E., Chhablani, G., Wang, H., Pandey, H., Strobelt, H., Fries, J. A., Rozen, J., Gao, L., Sutawika, L., Bari, M. S., Al-shaibani, M. S., Manica, M., Nayak, N., Teehan, R., Albanie, S., Shen, S., Ben-David, S., Bach, S. H., Kim, T., Bers, T., Fevry, T., Neeraj, T., Thakker, U., Raunak, V., Tang, X., Yong, Z.-X., Sun, Z., Brody, S., Uri, Y., Tojarieh, H., Roberts, A., Chung, H. W., Tae, J., Phang, J., Press, O., Li, C., Narayanan, D., Bourfoune, H., Casper, J., Rasley, J., Ryabinin, M., Mishra, M., Zhang, M., Shoeybi, M., Peyrounette, M., Patry, N., Tazi, N., Sanseviero, O., Platen, P., Cornette, P., Lavallée, P. F., Lacroix, R., Rajbhandari, S., Gandhi, S., Smith, S., Requena, S., Patil, S., Dettmers, T., Baruwa, A., Singh, A., Cheveleva, A., Ligozat, A.-L., Subramonian, A., Névéol, A., Lovering, C., Garrette, D., Tunuguntla, D., Reiter, E., Taktasheva, E., Voloshina, E., Bogdanov, E., Winata, G. I., Schoelkopf, H., Kalo, J.-C., Novikova, J., Forde, J. Z., Clive, J., Kasai, J., Kawamura, K., Hazan, L., Carpuat, M., Clinciu, M., Kim, N., Cheng, N., Serikov, O., Antverg, O., Wal, O., Zhang, R., Zhang, R., Gehrmann, S., Mirkin, S., Pais, S., Shavrina, T., Scialom, T., Yun, T., Limisiewicz, T., Rieser, V., Protasov, V., Mikhailov, V., Pruksachatkun, Y., Belinkov, Y., Bamberger, Z., Kasner, Z., Rueda, A., Pestana, A., Feizpour, A., Khan, A., Faranak, A., Santos, A., Hevia, A., Unldreaj, A., Aghagol, A., Abdollahi, A., Tammour, A., HajiHosseini, A., Behroozi, B., Ajibade, B., Saxena, B., Ferrandis, C. M., McDuff, D., Contractor, D., Lansky, D., David, D., Kiela, D., Nguyen, D. A., Tan, E., Baylor, E., Ozoani, E., Mirza, F., Ononiwu, F., Rezanejad, H., Jones, H., Bhattacharya, I., Solaiman, I., Sedenko, I., Nejadgholi, I., Passmore, J., Seltzer, J., Sanz, J. B., Dutra, L., Samagaio, M., Elbadri, M., Mieskes, M., Gerchick, M., Akinlolu, M., McKenna, M., Qiu, M., Ghauri, M., Burynok, M., Abrar, N., Rajani, N., Elkott, N., Fahmy, N., Samuel, O., An, R., Kromann, R., Hao, R., Alizadeh, S., Shubber, S., Wang, S., Roy, S., Viguier, S., Le, T., Oyebade, T., Le, T., Yang, Y., Nguyen, Z., Kashyap, A. R., Palasciano, A., Callahan, A., Shukla, A., Miranda-Escalada, A., Singh, A., Beilharz, B., Wang, B., Brito, C., Zhou, C., Jain, C., Xu, C., Fourrier, C., Periñán, D. L., Molano, D., Yu, D., Manjavacas, E., Barth, F., Fuhrimann, F., Altay, G., Bayrak, G., Burns, G., Vrabec, H. U., Bello, I., Dash, I., Kang, J., Giorgi, J., Golde, J., Posada, J. D., Sivaraman, K.R., Bulchandani, L., Liu, L., Shinzato, L., Bykhovetz, M.H., Takeuchi, M., Pámies, M., Castillo, M. A., Nezhurina, M., Sänger, M., Samwald, M., Cullan, M., Weinberg, M., De Wolf, M., Mihaljcic, M., Liu, M., Freidank, M., Kang, M., Seelam, N., Dahlberg, N., Broad, N. M., Muellner, N., Fung, P., Haller, P., Chandrasekhar, R., Eisenberg, R., Martin, R., Canalli, R., Su, R., Su, R., Cahyawijaya, S., Garda, S., Deshmukh, S. S., Mishra, S., Kiblawi, S., Ott, S., Sang-aroonsiri, S., Kumar, S., Schweter, S., Bharati, S., Laud, T., Gigant, T., Kainuma, T., Kusa, W., Labrak, Y., Bajaj, Y. S., Venkatraman, Y., Xu, Y., Xu, Y., Xu, Y., Tan, Z., Xie, Z., Ye, Z., Bras, M., Belkada, Y., & Wolf, T. (2023). BLOOM: A 176B-parameter open-access multilingual language model. https://doi.org/10.48550/arXiv.2211.05100. arXiv:2211.05100 [cs]. Accessed 9 August 2023.
AI, S. (2022). Stable diffusion public release. https://stability.ai/blog/stable-diffusion-public-release. Accessed 9 August 2023.
Meta. (2023). Meta and Microsoft introduce the next generation of Llama. https://about.fb.com/news/2023/07/llama-2/. Accessed 8 October 2023.
Bdeir, A., & François, C. (2024). Introducing the Columbia convening on openness and AI. https://blog.mozilla.org/en/mozilla/ai/introducing-columbia-convening-openness-and-ai/. Accessed 25 March 2024.
Cihon, P. (2024). Helping policymakers weigh the benefits of open source AI. https://github.blog/2024-04-10-helping-policymakers-weigh-the-benefits-of-open-source-ai/. Accessed 12 April 2024.
Raymond, E. S. (2001). The Cathedral and the Bazaar: Musings on Linux and open source by an accidental revolutionary. O’Reilly Media, Incorporated.
Wladawsky-Berger, I. (2023). Are open AI models safe?. https://www.linuxfoundation.org/blog/are-open-ai-models-safe. Accessed 13 June 2023.
Pipatanakul, K., Jirabovonvisut, P., Manakul, P., Sripaisarnmongkol, S., Patomwong, R., Chokchainant, P., & Tharnpipitchai, K. (2023) Typhoon: Thai large language models. https://doi.org/10.48550/arXiv.2312.13951.arXiv:2312.13951 [cs]. Accessed 29 February 2024.
Nguyen, T. T., Nguyen, Q. V. H., Nguyen, D. T., Nguyen, D. T., Huynh-The, T., Nahavandi, S., Nguyen, T. T., Pham, Q.-V., & Nguyen, C. M. (2022). Deep learning for deepfakes creation and detection: A survey. Computer Vision and Image Understanding, 223, 103525. https://doi.org/10.1016/j.cviu.2022.103525. arXiv:1909.11573 [cs, eess].
Lakatos, S. (2023). A revealing picture: AI-generated ‘undressing’ images move from niche pornography discussion forums to a scaled and monetized online business. Technical report (December). https://graphika.com/reports/a-revealing-picture. Accessed 9 February 2024.
Thiel, D., Stroebel, M., & Portnoff, R. (2023). Generative ML and CSAM: Implications and mitigations. Technical report, Stanford University.
Goldstein, J. A., Sastry, G., Musser, M., DiResta, R., Gentzel, M., & Sedova, K. (2023). Generative language models and automated influence operations: Emerging threats and potential mitigations. https://doi.org/10.48550/arXiv.2301.04246. arXiv:2301.04246 [cs]. Accessed 9 August 2023.
Musser, M. (2023). A cost analysis of generative language models and influence operations. https://doi.org/10.48550/arXiv.2308.03740. arXiv:2308.03740 [cs]. Accessed 9 February 2024.
Tsamados, A., Floridi, L., & Taddeo, M. (2023) The cybersecurity crisis of artificial intelligence: Unrestrained adoption and natural language-based attacks, Rochester, NY. https://doi.org/10.2139/ssrn.4578165. Accessed 8 October 2023.
David, C., & Paul, J. (2023). ChatGPT and large language models: What’s the risk?. https://www.ncsc.gov.uk/blog-post/chatgpt-and-large-language-models-whats-the-risk. Accessed 11 August 2023.
Gulson, K. N., & Webb, P. T. (2021). Steering the mind share: Technology companies, policy and AI research in universities. Discourse: Studies in the Cultural Politics of Education. https://doi.org/10.1080/01596306.2021.1981828
Patel, D., & Ahmad, A. (2023). Google “We have no moat, and neither does OpenAI”. https://www.semianalysis.com/p/google-we-have-no-moat-and-neither. Accessed 27 July 2023.
Wiggers, K. (2023). 5 investors on the pros and cons of open source AI business models. https://techcrunch.com/2023/10/18/pros-cons-open-source-ai-business-models/. Accessed 19 April 2024.
Abboud, L., Levingston, I., & Hammond, G. (2024). Mistral in talks to raise-500mn at -5bn valuation. Financial Times. Chap. Mistral AI. Accessed 19 April 2024.
Chatterjee, M., & Volpicelli, G. (2023). France bets big on open-source AI. https://www.politico.eu/article/open-source-artificial-intelligence-france-bets-big/. Accessed 9 August 2023.
Foundation, M. (2023). Introducing Mozilla.ai: Investing in trustworthy AI | The Mozilla Blog. https://blog.mozilla.org/en/mozilla/introducing-mozilla-ai-investing-in-trustworthy-ai/. Accessed 30 October 2023.
Lehdonvirta, V., Wu, B., & Hawkins, Z. (2023). Cloud empires’ physical footprint: How trade and security politics shape the global expansion of U.S. and Chinese data centre infrastructures, Rochester, NY https://doi.org/10.2139/ssrn.4670764. Accessed 9 January 2024.
Srnicek, N. (2022). Data, compute, labor. In M. Graham, & F. Ferrari (Eds.), Digital work in the planetary market. https://direct.mit.edu/books/oa-edited-volume/5319/chapter/3800166/Data-Compute-Labor. Accessed 26 May 2022.
Maffulli, S. (2023). Meta’s LLaMa 2 license is not Open Source. https://blog.opensource.org/metas-llama-2-license-is-not-open-source/. Accessed 11 August 2023.
Nolan, M. (2023). Llama and ChatGPT are not open-source—IEEE Spectrum. IEEE Spectrum. Accessed 18 August 2023.
Liesenfeld, A., & Dingemanse, M. (2024). Rethinking open source generative AI: Open-washing and the EU AI Act. ACM. https://pure.mpg.de/pubman/faces/ViewItemOverviewPage.jsp?itemId=item_3588217. Accessed 3 June 2024.
Liesenfeld, A., Lopez, A., & Dingemanse, M. (2023)Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators. In Proceedings of the 5th international conference on conversational user interfaces. CUI ’23 (pp. 1–6). Association for Computing Machinery. https://doi.org/10.1145/3571884.3604316. Accessed 18 August 2023.
OSI. (2023). Deep Dive: AI. https://opensource.org/deepdive/webinars/. Accessed 2 November 2023.
HuggingFace. (2024). Hugging Face Hub. https://huggingface.co/. Accessed 19 April 2024.
Ait, A., Izquierdo, J. L. C., & Cabot, J. (2023). On the suitability of Hugging Face Hub for empirical studies. https://doi.org/10.48550/arXiv.2307.14841.arXiv:2307.14841 [cs]. Accessed 5 April 2024.
Gorwa, R., & Veale, M. (2024) Moderating model marketplaces: Platform governance puzzles for AI intermediaries. https://doi.org/10.48550/arXiv.2311.12573. arXiv:2311.12573 [cs]. Accessed 16 May 2024.
Ait, A., Izquierdo, J. L. C., & Cabot, J. (2023). HFCommunity: A tool to analyze the Hugging Face Hub Community. In 2023 IEEE international conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 728–732). ISSN: 2640-7574. https://doi.org/10.1109/SANER56733.2023.00080. Accessed 5 April 2024.
Castaño, J., Martínez-Fernández, S., Franch, X., & Bogner, J. (2023). Exploring the carbon footprint of Hugging Face’s ML models: A repository mining study. In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (pp. 1–12). https://doi.org/10.1109/ESEM56168.2023.10304801. Accessed 17 May 2024.
Eghbal, N. (2020). Working in public: The making and maintenance of open source software. Stripe Press.
Zhou, M., Mockus, A., Ma, X., Zhang, L., & Mei, H. (2016). Inflow and retention in OSS communities with commercial involvement: A case study of three hybrid projects. ACM Transactions on Software Engineering and Methodology, 25(2), 13–11329. https://doi.org/10.1145/2876443.
Krishnamurthy, S. (2005). Cave or community? An empirical examination of 100 mature open source projects. First Monday. https://doi.org/10.5210/fm.v0i0.1477.
Crowston, K., Annabi, H., Howison, J., & Masango, C. (2005). Effective work practices for FLOSS development: A model and propositions. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences (p. 197). IEEE. ISSN: 1530-1605.
Bird, C., Gourley, A., Devanbu, P., & Gertz, M., Swaminathan, A. (2006). Mining email social networks. In International Conference on Software Engineering: Proceedings of the 2006 International Workshop on Mining Software Repositories; 22–23 May 2006 (pp. 137–143). ACM.
Crowston, K., & Howison, J. (2006). Hierarchy and centralization in free and open source software team communications. Knowledge, Technology & Policy, 18(4), 65–85. https://doi.org/10.1007/s12130-006-1004-8.
Long, Y., & Siau, K. (2007). Social network structures in open source software development teams. Journal of Database Management, 18(2), 25–40. https://doi.org/10.4018/jdm.2007040102
Orucevic-Alagic, A., & Host, M. (2014). Network analysis of a large scale open source project. In 2014 40th EUROMICRO Conference on Software Engineering and Advanced Applications (pp. 25–29). IEEE. https://doi.org/10.1109/SEAA.2014.50. Accessed 18 March 2022.
Juran, J. M., & Joseph, M. (2005). Juran: Critical evaluations in business and management. Psychology Press.
Faloutsos, M., Faloutsos, P., & Faloutsos, C. (1999). On power-law relationships of the Internet topology. ACM SIGCOMM Computer Communication Review, 29(4), 251–262. https://doi.org/10.1145/316194.316229.
Mahanti, A., Carlsson, N., Mahanti, A., Arlitt, M., & Williamson, C. (2013). A tale of the tails: Power-laws in internet measurements. IEEE Network, 27(1), 59–64. https://doi.org/10.1109/MNET.2013.6423193.
Yamashita, K., McIntosh, S., Kamei, Y., Hassan, A. E., & Ubayashi, N. (2015). Revisiting the applicability of the pareto principle to core development teams in open source software projects. In Proceedings of the 14th international workshop on principles of software evolution (pp. 46–55). ACM.
Geiger, R. S., Howard, D., & Irani, L. (2021). The labor of maintaining and scaling free and open-source software projects. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), 175–117528. https://doi.org/10.1145/3449249.
Hossain, A. (2021). Regional OSS communities: The view from Dhaka, Bangladesh. https://www.fordfoundation.org/media/6667/regional-foss-communities_final-report_ahossain-1.pdf
Takhteyev, Y. (2012). Coding places: Software practice in a South American city. Acting with technology. MIT Press.
Feller, J., & Fitzgerald, B. (2002). Understanding open source software development. Pearson Education.
Bonaccorsi, A., & Rossi, C. (2006). Comparing motivations of individual programmers and firms to take part in the open source movement: From community to business. Knowledge, Technology & Policy, 18, 40–64.
Li, X., Zhang, Y., Osborne, C., Zhou, M., Jin, Z., & Liu, H. (2024). Systematic literature review of commercial participation in open source software. arXiv:2405.16880 [cs]. Accessed 28 May 2024.
Krogh, G., Haefliger, S., Spaeth, S., & Wallin, M. W. (2012). Carrots and rainbows: Motivation and social practice in open source software development. MIS Quarterly, 36(2), 649–676. https://doi.org/10.2307/41703471
Shah, S. K. (2006). Motivation, governance, and the viability of hybrid forms in open source software development. Management Science, 52(7), 1000–1014. https://doi.org/10.1287/mnsc.1060.0553
Lakhani, K. R., & Wolf, R. G. (2003). Why hackers do what they do: Understanding motivation and effort in free/open source software projects, Rochester, NY. https://doi.org/10.2139/ssrn.443040. Accessed 4 April 2023.
Ghosh, R. A., Glott, R., Krieger, B., & Robles, G. (2002). Free/Libre and open source software: Survey and study. International Institute of Infonomics.
Brooke, S. (2021). Trouble in programmer’s paradise: Gender-biases in sharing and recognising technical knowledge on Stack Overflow. Information, Communication & Society, 24(14), 2091–2112. https://doi.org/10.1080/1369118X.2021.1962943
Vasilescu, B., Capiluppi, A., & Serebrenik, A. (2014). Gender, representation and online participation: A quantitative study. Interacting with Computers, 26(5), 488–511. https://doi.org/10.1093/iwc/iwt047.
Braesemann, F., Stoehr, N., & Graham, M. (2019). Global networks in collaborative programming. Regional Studies, Regional Science, 6(1), 371–373. https://doi.org/10.1080/21681376.2019.1588155
Williams, A. (2023). Enabling global collaboration. Technical report, Linux Foundation, San Francisco, CA, USA. https://www.linuxfoundation.org/research/open-source-fragmentation. Accessed 31 October 2023.
Subramanyam, R., & Xia, M. (2008). Free/Libre Open Source Software development in developing and developed countries: A conceptual framework with an exploratory study. Decision Support Systems, 46(1), 173–186. https://doi.org/10.1016/j.dss.2008.06.006
Agerfalk, P. J., & Fitzgerald, B. (2008). Outsourcing to an unknown workforce: Exploring opensourcing as a global sourcing strategy. MIS Quarterly, 32(2), 385–409.
Birkinbine, B. (2020). Incorporating the digital commons: Corporate involvement in free and open source software. University of Westminster Press. https://doi.org/10.16997/book39
West, J., & Gallagher, S. (2006). Challenges of open innovation: The paradox of firm investment in open-source software. SSRN Scholarly Paper ID 904436. Social Science Research Network. https://doi.org/10.1111/j.1467-9310.2006.00436.x. Accessed 11 February 2022.
Chesbrough, H. (2023). Measuring the economic value of open source. Technical report, Linux Foundation, San Francisco, CA, USA. https://www.linuxfoundation.org/research/measuring-economic-value-of-os. Accessed 6 March 2023.
Lindman, J., Juutilainen, J.-P., & Rossi, M. (2009). Beyond the business model: Incentives for organizations to publish software source code. In C. Boldyreff, K. Crowston, B. Lundell, & A. I. Wasserman (Eds.), Open source ecosystems: Diverse communities interacting. IFIP Advances in Information and Communication Technology (pp. 47–56). Springer. https://doi.org/10.1007/978-3-642-02032-2_6
Dahlander, L., & Wallin, M. W. (2006). A man on the inside: Unlocking communities as complementary assets. Research Policy, 35(8), 1243–1259. https://doi.org/10.1016/j.respol.2006.09.011.
Lerner, J., & Tirole, J. (2002). Some simple economics of open source. The Journal of Industrial Economics, 50(2), 197–234. https://doi.org/10.1111/1467-6451.00174.
Pitt, L. F., Watson, R. T., Berthon, P., Wynn, D., & Zinkhan, G. (2006). The Penguin’s Window: Corporate brands from an open-source perspective. Journal of the Academy of Marketing Science, 34(2), 115–127. https://doi.org/10.1177/0092070305284972
Nguyen-Duc, A., Cruzes, D. S., & Terje, S., Abrahamsson, P. (2019). Do software firms collaborate or compete? A model of coopetition in community-initiated OSS projects. e-Informatica (Vol. XIII). https://doi.org/10.5277/e-Inf190102. arXiv:1808.06489 [cs]. Accessed 29 December 2023.
Zhang, Y., Stol, K.-J., Liu, H., & Zhou, M. (2022). Corporate dominance in open source ecosystems: A case study of OpenStack. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE (pp. 1048–1060). Association for Computing Machinery. https://doi.org/10.1145/3540250.3549117. Accessed 18 October 2023.
Germonprez, M., Allen, J. P., Warner, B., Hill, J., & McClements, G. (2013). Open source communities of competitors. Interactions, 20(6), 54–59. https://doi.org/10.1145/2527191.
Linåker, J., Rempel, P., Regnell, B., & Mäder, P. (2016). How firms adapt and interact in open source ecosystems: Analyzing stakeholder influence and collaboration patterns. In M. Daneva & O. Pastor (Eds.), Requirements engineering: Foundation for software quality. Lecture notes in computer science (pp. 63–81). Springer. https://doi.org/10.1007/978-3-319-30282-9_5
Teixeira, J., & Lin, T. (2014). Collaboration in the open-source arena: The WebKit case. In Proceedings of the 52nd ACM conference on Computers and people research—SIGSIM-CPR ’14 (pp. 121–129). https://doi.org/10.1145/2599990.2600009. arXiv:1401.5996. Accessed 21 October 2021.
Zhang, Y., Zhou, M., Stol, K.-J., Wu, J., & Jin, Z. (2020). How do companies collaborate in open source ecosystems? An empirical study of OpenStack. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE) (pp. 1196–1208). ACM. ISSN: 1558-1225.
Easterbrook, S., Singer, J., Storey, M.-A., & Damian, D. (2008). Selecting empirical methods for software engineering research. In F. Shull, J. Singer, & D. I. K. Sjøberg (Eds.), Guide to advanced empirical software engineering (pp. 285–311). Springer. https://doi.org/10.1007/978-1-84800-044-5_11
HuggingFace. (2024). Models—Hugging Face. https://huggingface.co/models. Accessed 19 April 2024.
HuggingFace. (2024). Datasets—Hugging Face Hub. https://huggingface.co/datasets. Accessed 19 April 2024.
HuggingFace. (2024). Spaces—Hugging Face. https://huggingface.co/spaces. Accessed 19 April 2024.
HuggingFace. (2024). Evaluate—Hugging Face. https://huggingface.co/docs/evaluate/index. Accessed 19 April 2024.
Osborne, C. (2024). Python scripts for mining research data from the Hugging Face Hub. https://github.com/ccosborne/hf-hub-mining/tree/main. Accessed 21 April 2024.
Lin, B., Robles, G., & Serebrenik, A. (2017). Developer turnover in global, industrial open source projects: Insights from applying survival analysis. In 2017 IEEE 12th International Conference on Global Software Engineering (ICGSE) (pp. 66–75). IEEE.
Robles, G., & Gonzalez-Barahona, J. (2005). Developer identification methods for integrated data from various sources. In International Conference on Software Engineering: Proceedings of the 2005 International Workshop on Mining Software Repositories: St. Louis, Missouri; 17–17 May 2005 (pp. 1–5). ACM. ISSN: 0163-5948.
Lopez-Fernandez, L. (2004). Applying social network analysis to the information in CVS repositories. In “International Workshop on Mining Software Repositories (MSR 2004)” W17S Workshop—26th International Conference On Software Engineering (Vol. 2004, pp. 101–105). IEE. https://doi.org/10.1049/ic:20040485. Accessed 22 October 2021.
Savić, M., Ivanović, M., & Jain, L. C. (2019). Complex networks in software, knowledge, and social systems. In Intelligent Systems Reference Library (Vol. 148). Springer. https://doi.org/10.1007/978-3-319-91196-0. Accessed 21 October 2019.
Goeminne, M., & Mens, T. (2013). A comparison of identity merge algorithms for software repositories. Science of Computer Programming, 78(8), 971–986.
Kouters, E., Vasilescu, B., Serebrenik, A., & Brand, M. G. J. (2012). Who’s who in Gnome: Using LSA to merge software repository identities. In 2012 28th IEEE International Conference on Software Maintenance (ICSM) (pp. 592–595). IEEE. ISSN: 1063-6773.
McKnight, P. E., & Najab, J. (2010). Mann–Whitney U test. In The Corsini Encyclopedia of Psychology (p. 1). Wiley. https://doi.org/10.1002/9780470479216.corpsy0524. Accessed 16 May 2024.
HuggingFace. (2024). Meta LlaMa models on the HF Hub. https://huggingface.co/meta-llama. Accessed 23 April 2024.
HuggingFace. (2024). Mistral AI models on the HF Hub. https://huggingface.co/mistralai. Accessed 23 April 2024.
HuggingFace. (2024). OpenAI models on the HF Hub. https://huggingface.co/openai. Accessed 23 April 2024.
Seger, E., Ovadya, A., Garfinkel, B., Siddarth, D., & Dafoe, A. (2023). Democratising AI: Multiple meanings, goals, and methods. https://doi.org/10.48550/arXiv.2303.12642. arXiv:2303.12642 [cs]. Accessed 23 March 2023.
CHAOSS. (2024). Community health analytics in open source software. https://chaoss.community/. Accessed 1 May 2024.
GitHub. (2024). GitHub innovation graph. https://innovationgraph.github.com/. Accessed 6 May 2024.
Daigle, K. (2023). Octoverse: The state of open source and rise of AI in 2023. https://github.blog/2023-11-08-the-state-of-open-source-and-ai/. Accessed 6 May 2024.
Hardy, M. (2023). Should we use open source licenses for ML/AI models?. https://opensource.org/deepdive/webinars/should-we-use-open-source-licenses-for-ml-ai-models/. Accessed 2 November 2023.
Weaver, O. (2020). Beware: Over half of the GitHub public repositories are not open source licensed!. https://openweaver.medium.com/beware-over-half-of-the-github-public-repositories-are-not-open-source-licensed-23c7d2b5b621. Accessed 2 November 2023.
Runeson, P., & Höst, M. (2008). Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering: An International Journal, 14(2), 131–164. https://doi.org/10.1007/s10664-008-9102-8
Amreen, S., Mockus, A., Zaretzki, R., Bogart, C., & Zhang, Y. (2020). ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems. Empirical Software Engineering: An International Journal, 25(2), 1136–1167.
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
Batagelj, V., & Zaversnik, M. (2003). An O(m) algorithm for cores decomposition of networks. https://doi.org/10.48550/arXiv.cs/0310049. arXiv:cs/0310049. Accessed 5 October 2023.
Newman, M. E. J. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23), 8577–8582. https://doi.org/10.1073/pnas.0601602103.
Clauset, A., Newman, M. E. J., & Moore, C. (2004). Finding community structure in very large networks. Physical Review E, 70(6), 066111. https://doi.org/10.1103/PhysRevE.70.066111. arXiv:cond-mat/0408187.
NetworkX. (2023). Density—NetworkX 3.1 documentation. https://networkx.org/documentation/stable/reference/generated/networkx.classes.function.density.html#density. Accessed 5 October 2023.
NetworkX. (2023). Reciprocity—NetworkX 3.1 documentation. https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.reciprocity.reciprocity.html#networkx.algorithms.reciprocity.reciprocity. Accessed 5 October 2023.
Zhou, S., & Mondragon, R. J. (2004). The rich-club phenomenon in the Internet topology. IEEE Communications Letters, 8(3), 180–182. https://doi.org/10.1109/LCOMM.2004.823426
McAuley, J. J., Costa, L. D. F., & Caetano, T. S. (2007). The rich-club phenomenon across complex network hierarchies. Applied Physics Letters, 91(8), 084103. https://doi.org/10.1063/1.2773951. arXiv:physics/0701290.
Smilkov, D., & Kocarev, L. (2010). Rich-club and page-club coefficients for directed graphs. Physica A: Statistical Mechanics and its Applications, 389(11), 2290–2299. https://doi.org/10.1016/j.physa.2010.02.001.
Newman, M. E. J. (2002). Assortative mixing in networks. Physical Review Letters, 89(20), 208701. https://doi.org/10.1103/PhysRevLett.89.208701
Saramäki, J., Kivelä, M., Onnela, J.-P., Kaski, K., & Kertész, J. (2007). Generalizations of the clustering coefficient to weighted complex networks. Physical Review E, 75(2), 027105. https://doi.org/10.1103/PhysRevE.75.027105
Acknowledgements
The authors would like to thank Loubna Ben Allal, Daniel van Strien, Peter Cihon, Mer Joyce, Stefano Maffulli, Matt White, Seb Elmes, David Gray Widder, Alek Tarkowski, Johan Linåker, Sean P. Goggins, and the reviewers at the Journal of Computational Social Science for their generous feedback on previous versions of this manuscript.
Funding
Cailean Osborne was supported by the Economic and Social Research Council Grant for Digital Social Science [ES/P000649/1]; Hannah Rose Kirk was supported by the Economic and Social Research Council Grant for Digital Social Science [ES/P000649/1]. Jennifer Ding was supported by the Ecosystem Leadership Award under the Engineering and Physical Sciences Research Council Grant [EP/X03870X/1] & the Alan Turing Institute.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Appendices
Appendix 1: Definition of network properties
See Table 1.
Appendix 2: Summary statistics of development activity in repositories
Appendix 3: Mann–Whitney U tests for activity in model repositories
See Table 5.
Appendix 4: Social network structure of collaboration on HF Hub
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Osborrne, C., Ding, J. & Kirk, H.R. The AI community building the future? A quantitative analysis of development activity on Hugging Face Hub. J Comput Soc Sc (2024). https://doi.org/10.1007/s42001-024-00300-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42001-024-00300-8