In the following, we outline the data acquisition procedure, describe in detail the dataset’s components, analyze basic statistical properties of the dataset, provide download links, and refer to some sample code in Python, which is also available for download. Please note that the LFM-1b dataset is considered derivative work according to paragraph 4.1 of Last.fm’s API Terms of Service.Footnote 11
Data acquisition
We first use the overall 250 top tagsFootnote 12 to gather their top artistsFootnote 13 using the Last.fm API. For these artists, we fetch the top fans, which results in 465,000 active users. For a randomly chosen subset of 120,322 users, we then obtain their listening histories.Footnote 14 For approximately 5,000 users, we cap the fetched listening histories at 20,000 listening events in order to avoid ending up with an extraordinarily uneven user distribution (cf. Sect. 3.3), in which a few users have an enormous amount of listening events. We define a listening event as a quintuple specified by user, artist, album, track, and timestamp. The period during which we fetched the data ranges from January 2013 to August 2014.
Dataset availability and content
The LFM-1b dataset of approximately 8 GB can be downloaded from www.cp.jku.at/datasets/LFM-1b. For ease of access and compatibility, the metadata on artists, albums, tracks, users, and listening events are stored in simple text files, encoded in UTF-8, while the user-artist-playcount matrix is provided as sparse matrix in a Matlab file, which complies to the HDF5 format. This makes the matrix also accessible from a wide range of programming languages. For instance, Python code for data import is provided along with the dataset (cf. Sect. 3.4).
Table 1 gives an overview of the dataset’s content, in particular the included files and respective pieces of information. Keys that are linked to each other are depicted in the same emphasis. Files LFM-1b_artists.txt, LFM-1b_albums.
txt, and LFM-1b_tracks.txt contain the metadata for artists, albums, and tracks, respectively. File LFM-1b_LEs.
txt contains all listening events, described by user, artist, album, and track identifiers. Each event is further attached a timestamp, which is encoded in Unix time, i.e., seconds since January 1, 1970 (UTC). File LFM-1b_LEs.mat contains the user-artist-playcount matrix (UAM) as Matlab file in HDF5 format. It comprises three items: (i) a 120,175-dimensional vector (idx_users), each element of which links to the user-ids in files LFM-
1b_users.txt, LFM-1b_LEs.txt, and LFM-1b
_users_additi
onal.txt, (ii) a 585,095- dimensional vector (idx_artists), whose elements link to the artist-ids in LFM-1b_LEs.txt and the metadata files, and (iii) a \(120{,}175 \times 585{,}095\) sparse matrix (LEs), whose rows correspond to users and columns to artists. User-specific information is given in LFM-1b_users.txt and LFM-1b_users_additi
onal.txt. While the former contains basic demographic information as well as overall playcount and date of registration with Last.fm, the latter provides 43 additional user descriptors that represent a unique feature of LFM-1b. Table 2 describes these user features, which are particularly valuable when creating user-aware music recommender systems.
Dataset statistics
Table 3 shows basic statistics of the dataset’s composition. The number of unique <user, artist> pairs corresponds to the number of entries in the UAM, which is a \(120{,}175 \times 585{,}095\) sparse matrix. Note that these numbers are smaller than the total numbers of unique users and artists reported in Table 3 since we discarded users who listened to less than 10 unique artists and artists listened to by less than 10 users when creating the UAM. We assume that data about these artists and users are too sparse to be informative or contain just noise. In particular, this approach effectively filters artists that are misspelled, which is evidenced by the substantial reduction in their number by \(81.66\%\) (from 3,190,371 to 585,095). The reduction in terms of users is much smaller (by \(0.21\%\), from 120,322 to 120,175), because users with such a narrow music artist taste are almost nonexistent on Last.fm. This filtering step yields a UAM that is very well manageable with today’s computers (approximately 200 MB).
Table 3 Statistics of items in the dataset
Table 4 Statistics on country distribution of users. All countries with more than 1000 users are shown
In the following, we present a more detailed analysis of the demographic coverage, distribution of listening events, and features related to music preference and consumption behavior.
Demographics
We compute and illustrate the distribution of users among country, age, and gender. Table 4 shows the countries where most users in the dataset originate from. We include all countries with more than 1,000 users. As can be seen, a majority of users do not provide country information (\(54.13\%\)). The country-specific percentages in the last column of the table are computed only among those users who provide their country. The distribution of users in the dataset reflects that of Last.fm users in general.
A histogram illustrating the age distribution is given in Fig. 1. Among all users, only \(38.31\%\) provide this piece of information. It can be seen that the age distribution is quite uneven and skewed toward the right (higher ages), but reflects the composition of Last.fm users. In addition to this, we can spot some seemingly erroneous information provided by some users, i.e., 165 of them indicated an age smaller or equal to 6 years, 149 indicated an age of at least 100 years. However, the share of these users only represents \(0.26\%\) of all users in the dataset. The age distribution has its arithmetic mean at 25.4 years, standard deviation of 9.7, a median of 23, and 25- and 75-percentile, respectively, at 20 and 28 years.
Table 5 depicts the gender distribution of users in the dataset. Among those who provide this information, more than two thirds are male, less than one third female. The larger share of male users on Last.fm is a known fact. The number of users who provide information on their gender (64,551 or 53.6%) is very close to the number of users who provide country information (65,132 or 54.1%), and considerably higher than the amount of users who indicate their age (46,095 or 38.3%). Therefore, users seem to be highly reluctant to reveal their age.
Table 5 Statistics on gender distribution of users
Listening events
To gain an understanding of the distribution of listening events in the dataset, Figs. 2 and 3 illustrate the sorted amount of listening events for all artists and for all users, respectively, plotted as red lines. The blue plots indicate the number of listeners each artist has (Fig. 2) and the number of artists each user listens to (Fig. 3). The axes in both figures are logarithmically scaled.
From Fig. 2, we observe that especially in the range of artists with extraordinarily high playcounts (left side of the figure), the number of playcounts decreases considerably faster than the number of listeners. For instance, the top-played artist is on average listened to 78.92 times per user, while the 1,000th most popular artist is listened to only 22.66 times per user, on average. On the other side, the 100,000 least popular artists are played only 1.99 times on average. This provides strong evidence of the “long tail” of artists [3].
From Fig. 3, we see that highly active listeners (in the left half of the figure) tend to have a rather stable relationship between total playcounts and number of artists listened to, whereas the average number of playcounts per artist strongly decreases for less active listeners. Indeed, the 1,000 most active listeners aggregate on average 29.73 listening events per artist, while for the 1,000 least active listeners, this number is only 3.04. Therefore, highly active users tend to listen to tracks by the same artists over and over again, while occasional and seldom listeners tend to play only a few tracks by their preferred artists. Furthermore, we can observe in Fig. 3 the considerable number of users for which we recorded approximately 20,000 listening events, for the reasons given in Sect. 3.1.
Table 6 shows additional statistics of the listening event distribution, both from a user and an artist perspective (second and third column, respectively). The first row shows the average number and standard deviation of playcounts, per user and per artist, computed from the values of the red plots in Figs. 2 and 3. The second row shows the average number of unique artists per user (second column) and the average number of unique users per artist (third column). These numbers are computed from the blue lines in the figures. The third row reveals how often, on average, users play artists they listen to (second column) and how often artists are listened to by users who listen to them at all, on average (third column). The last row is similar to the third one, but uses the median instead of the arithmetic mean to aggregate average playcounts. It shows that there exist strong outliers in the average playcount values, both per user and per artist, because the median values are much smaller than the mean values. For instance, users listen to each of their artists on average about 21 times, but half of all users listen to each of their artists on average only five times or less. Therefore, there are a few users who keep on listening to their artists over and over again, while a large majority do not listen to the same artist more than a few times, on average.
Table 6 Statistics of the distribution of listening events among users and artists
Descriptors of preference and consumption behavior
The LFM-1b dataset provides a number of additional user-specific features (cf. Table 2), in particular information about temporal listening habits and music preference in terms of mainstreaminess and novelty [13]. To characterize temporal aspects, we binned the listening events of each user into weekdays and into hours of the day, and computed the share of each user’s listening events over the bins. The distribution of these shares is illustrated in Fig. 4 for weekdays and in Fig. 5 for hours of the day. These box plots illustrate the median of the data by a horizontal red line. The lower and upper horizontal black lines of the box indicate the 25- and 75-percentiles, respectively. The horizontal black lines further above or below represent the furthest points not considered outliers, i.e., points within 1.5 times the interquartile range. Points beyond this range are depicted as blue plus signs. The red squares illustrate the arithmetic mean.
We can observe in Fig. 4 that the share of listening events does not substantially differ between working days. However, during weekend (Saturday and Sunday), there is a much larger spread. A majority of people listens less during weekends than during working days (lower median). At the same time, the top \(25\%\) of active listeners consume much more music during weekends (higher 75-percentile for Saturday, and even higher for Sunday). This is obviously the result of working and leisure habits.
In Fig. 5, we see that the distribution of listening events over hours of day vary more than over weekdays. It is particularly low during early morning hours (between 4 and 7 h) and peaks in the afternoon and early evening (between 17 and 22 h) when many people indulge in leisure time activities.
To compute the listener scores for novelty and mainstreaminess, we follow the approach presented in [13]. For novelty, we split user u’s listening history into time windows of fixed length and calculate the percentage of new items listened to, i.e., items appearing for the first time in u’s listening history. The novelty \(N_{ut}\) of u’s listening events in time window t is defined as \(N_{ut} = \frac{|\{l \in L_{ut} \; \wedge \; l \notin L_{ux} \forall x<t \}|}{| L_{ut} |}\), where \(L_{ut}\) is the entirety of items u listened to in time window t, including duplicates, and \(l \notin L_{ux} \forall x<t \) denotes all listening events not listened to by u at any time before t. Averaging over all time windows user u was active in, we obtain u’s overall novelty score \(N_{u}\). In the LFM-1b dataset, we provide novelty scores for time windows of 1, 6, and 12 months. To quantify the mainstreaminess
\(M_{ut}\) of a user u in time window t, we relate u’s distribution of playcounts over artists to the global playcount distribution of all users: \(M_{ut} = \sum _{a \in A}{\sqrt{ \frac{p_{uat}}{p_{ut}} \cdot \frac{p_{at}}{p_{t}}} }\), where \(p_{uat}\) is the frequency user u listens to each artist a in the global playcount vector A in time window t, \(p_{ut}\) and \(p_{at}\) represent the total number of playcounts of user u and artist a in time window t, respectively, and \(p_t\) denotes the sum of all playcounts in t. We again average over all time windows to compute an aggregate mainstreaminess score \(M_u\) for user u. The scores in the LFM-1b set are provided for time windows of 1, 6, and 12 months, as well as on a global scale. The main statistics of the novelty and the mainstreaminess scores (both computed on time windows of 12 months) are given in Table 7. We can see that most users are eager to listen to new music since the average share of new artists listened to every year is approximately \(50\%\). On the other hand, their music taste tends to be quite diverse and far away from the mainstream since the average overlap between the user’s distribution of listening events and the global distribution is only \(5\%\).
Table 7 Statistics of novelty and mainstreaminess scores
Sample source code
To facilitate access to the dataset, we provide Python scripts that show how to load the data and perform simple computations, e.g., basic statistics, as well as how to implement a basic collaborative filtering music recommender. The code package can be found on http://www.cp.jku.at/datasets/LFM-1b. File LFM-1b_stats.py shows how to load the UAM, compute some of the statistics reported in Sect. 3.3, and store them in a text file. Based on this text file, LFM-1b_plot.py demonstrates how to create plots such as the one shown in Figs. 2 and 3. In addition, we implement a simple memory-based collaborative filtering approach in LFM-1b_recommend-CF.py, which might serve as reference implementation and starting point for experimentation with various recommendation models.