Introduction

The internet revolution is changing the way we exchange information. Instead of going to a person to get access to a particular dataset we now simply download it.Footnote 1 At the expense of increased efficiency we also see increasing anonymity. It becomes increasingly difficult to identify end-users (Morville 2005). The internet revolution has also increased the efficiency with which people can manage their social ties, through community websites and email. However, the existence of such social ties in response to growing anonymity is not obvious. If the effort for maintaining these ties is considered too large and/or the benefits of it are considered to be too low it may not happen. Especially in data infrastructures we expect a growing anonymity, because:

  1. 1.

    The social ties (professional, dataset specific, ad-hoc) are weaker than social ties like friendship or kinship;

  2. 2.

    People and organisations may not yet have become aware of the need of more explicitly managing these ties in the face of growing anonymity;

  3. 3.

    Users are often scattered over many organisations or different units within a larger organisation; consequentially the chances that users of the same dataset meet each other to exchange ideas are limited;

  4. 4.

    For obvious reasons of privacy and bureaucracy personal details are often not registered at the dataset level, in which case it is impossible to identify users by following the data flow.

Such a set-up in which end-users are largely unknown potentially has some serious disadvantages:

  1. 1.

    Quality control: Producers cannot improve the quality of their product by asking their end-users for feedback if they don’t know their end users;

  2. 2.

    Justification of funding: If we don’t know who is using a dataset then we also don’t know how many people are actually using it. Justification of funding towards budget managers becomes problematic in this light;

  3. 3.

    Data management: Data managers are responsible for a complete coverage of relevant datasets for their organisation. Not knowing the end-users (and their information needs) also makes it difficult to anticipating their needs in terms of securing data availability;

  4. 4.

    Knowledge transfer: users cannot share amongst each other interesting ideas, innovations, ancillary data.Footnote 2 and concerns with regard to a specific dataset if they lack links to other users.

Are the consequences of growing anonymity as listed above really a problem? That depends on two factors: (1) the magnitude of these consequences relative to other problems and (2) the degree to which users of a particular dataset exchange feedback through their social network (of users the same dataset). For example if there are no concerns about data quality and no perceived benefit in knowledge transfer then users may feel less the need for exchanging feedback. For the potential users who lack access, data-sharing is they key problem. Improved access albeit with growing anonymity may be very well acceptable to such users. Indeed in the past decades spatial data infrastructure (SDI) research and policy making has been very much focussed on (enhancing) the flow of spatial data from producer to user (Clinton 1994; Azad and Wiggins 1995; Campbell and Masser 1995; Nedovic-Budic and Pinto 1999; Nedovic-Budic and Pinto 2000; Groot and McLaughlin 2000; Wehn de Montalvo 2003; Williamson et al. 2003; Masser 2005; Harvey and Tulloch 2006; De Vos 2007; Goodchild et al. 2007; Omran and van Etten 2007; Elwood 2008; www.ec-gis.org/inspire). The novelty in voluntary geographic information (VGI) is that it recognises that not only traditional mapping agencies but also individual citizens can have the role of spatial data producer (Budhathoki et al. 2008; Goodchild 2008). But also the emerging research on VGI is on spatial data (exchange, quality, etcetera) and not on the related issue of sharing of feedback. Sharing of spatial data is often impeded by cultural, financial, organisational and technical barriers. Most research and policy making over the past two decades has gone into understanding and then breaking down these barriers.

With progress being made in spatial data sharing attention is shifting towards how users exchange feedback. This aspect of data sharing has to date not at all been systematically studied. We identify three directions of feedback: user-to-user, user-to-producer and producer-to-user. Exchange of feedback has traditionally been through direct contact between users/producers, in meetings or via email. Exchanging feedback can be mediated through social networking media such as facebook, flickr, twitter, hyves or LinkedIn, discussion fora on a product-website (Scharl and Tochtermann 2007). To our best knowledge such social networking media have to date not been taken up by users for exchanging dataset-specific feedback, nor are such media supported or introduced by producers of spatial data. We therefore focus on direct contact between users/producers. If users cannot name other users (through their social network) then this implies very limited opportunities for exchanging feedback. The key questions addressed in this research are:

  1. 1.

    what do the social networks of users of a specific dataset look like? and

  2. 2.

    in what way is feedback exchanged through these networks?

The questions are answered for two particular datasets. For these two datasets we explore through surveys the user population. Survey non-response is also studied and it’s consequences for research outcomes discussed.

Materials and methods

Survey procedure and datasets

Survey procedure

Starting the survey we had an incomplete list of users and no insight in links between users. The snowball procedure (Hanneman and Riddle 2005) is in this case the appropriate method to obtain a map of the social network. One starts with an incomplete list of users. These users are requested to name new users (making use of their specific links in the network which are at that stage unknown to us researchers). This iterative procedure stops when no more new names are mentioned. The result is ideally a complete map of all links between all users. We started by mining user registrations provided by the producers and their salesperson, using registrations from January 2004 to August 2007. Survey questions Q3, Q7 and Q8 (appendix) were used as name generators (Marsden 2005). In case of non-response a first reminder was sent after 1 month and if necessary a second one after 2 months. Once the snowball came to a halt, we still did not know whether it yielded us information on the complete user population. We do not know if someone who did not respond is or is not a user. We do not know if all users were mentioned. Non response from a single contact in a particular organisation would leave the complete user population in that organisation undetected. We present a discussion on the non-response in section “Survey non-response”.

Social network analysis

We are aware of a whole body of literature on quantitative methods in social network analysis (e.g. Carrington et al. 2005; Hanneman and Riddle 2005). Our choice was to use them only very sparingly because the networks that we discovered were small and clear enough for visual interpretation.

Geo-datasets

The two datasets analysed are actually two product families. LGN is the national land cover database of the Netherlands. LGN is a product family. The first dataset (LGN1) was delivered in 1986 and updates have since appeared every 3–5 years. LGN is a 25 m resolution raster dataset with 5 main and 39 sub land cover classes. LGN is used in a wide range of environmental applications. User conferences have been organised to receive feedback from users. There is extensive documentation (Thunnissen et al. 1992 a, b; Hazeu 2005) and several scientific publications (de Bruin et al. 2004; de Wit and Clevers 2004; van Oort et al. 2004). Metadata is delivered to users when buying the dataset and is available from www.lgn.nl.

HGN is the historical land cover database of the Netherlands. It is a product family, comprising the following datasets: HGN1900, HGN1960 and HGN1990. They are scanned historical topographical maps (georeferenced, 1850–1935) and more recent topographical maps (1:25.000, 1940–1994). The first HGN products were delivered in 2004. In comparison with LGN, the user community is smaller. There have been no user surveys or conferences thus far. People or organisations buying HGN receive a report describing in detail the production method. For further details we refer to Knol 2003, 2004) and www.hgnnederland.nl.

Charged data

LGN and HGN are charged datasets, i.e. individual users or organisations pay for access. One implication of this is was that for both datasets the salesperson had kept a record of buyers, which was convenient in getting the snowball procedure started. For spatial data distributed free of charge over the internet with no form of registration collecting an initial list of users can be more difficult though not impossible.

Charged data implies a risk of non-response from illegal users. We included in the introduction letter to the survey a subsection raising this issue and giving respondents four instructions on this issue: (1) If you are uncertain about whether or not you are an illegal user you might first check internally with your legal officer; (2) all major users (such as provinces & ministries and most waterboards) all have a license at organization level so all their employees are legal users; (3) in some cases private companies get a project license: use of LGN beyond the duration of the project is illegal (LGN is to be deleted when the project ends). If this applies to you and if you had at any point in time such a project license then you can safely fill in; (4) of course you can always choose not to fill in your contact details. With these instructions and under these conditions (major users are all legal users) illegal use was not considered a major issue. It is, however, a relevant point of consideration when replicating this research for other changed data sets. And in surveys like this being open and clear on this issue towards respondents is essential.

User definition

The definition of the term “user” determines the extent of the user population. Many decision makers use reports based on data without being aware of the underlying data (Masser et al. 2008). We refer to this group as unaware users. Unaware users were excluded from the survey. We classified our respondents according to how frequently they produce images/graphs/tables for unaware users (appendix, survey question Q4). We identified 4 user roles (Q2) and acknowledged that one person can have multiple roles:

  1. 1.

    Intermediary = data manager, buyer/salesperson;

  2. 2.

    Direct user = I work directly with the geo-dataset;

  3. 3.

    Indirect user = I do not work directly with the geo-dataset. But I do know the dataset and it is used in processes in which I am involved;

  4. 4.

    Ex-user = I did at one time use the dataset but not during the past year.

Results

Response

From September 2007 to February 2008 we compiled a list of 339 email addresses of people named as LGN user and 91 names of people who might be HGN user. Table 1 shows their status. We use the term “real” user for all people who did respond and who are not ex-users. According to this definition we have 94 (LGN) and 36 (HGN) real users. Non-response is discussed in section “Survey non-response”.

Table 1 Response status

Results indicate that both datasets are being used in many different organisations (Table 2) and that many people in our list did not respond (LGN response rate: 41%, HGN: 54% see footnote.Footnote 3). Non response at the organisation level is presented in Table 2: our full list of organisations that possibly host LGN users contained 54 organisations. We got a response from 32 (59%) of these organisations. Both datasets have one single large organisation that hosts ca 45% of all users: Wageningen University and Research Centre. According to Table 2 around 80% of the organisations have 3 or less users, around 65% have only one user.

Table 2 LGN/HGN using organisations

Users

Table 3 describes the user population. LGN has a relatively high percentage indirect users and HGN has the highest percentage direct users. LGN users report relatively more frequently to unaware users. It suggests that LGN is more strongly embedded in decision-making processes and/or models.Footnote 4 This could be explained from the age of the datasets—LGN exists longer so it has had more time to become embedded. Also a more specialised product like HGN (historical land cover) may require more time to find its way into applications.

Table 3 Users

Cross-tabulation of the two classifications in Table 3 revealed that the two are uncorrelated. We also checked if the number of outgoing links differed between respondents according to their user role. When considering all names mentioned (including non-response) one finds both for LGN and HGN that intermediaries, in comparison with other user roles, have more links to other users (Tables 5, 6). However, this difference disappears when accounting for non-response which is also higher for the intermediaries (§3.5.1). If we count only the links to real users (those people who did respond and who are not ex-users) we find no difference between users roles in terms of number of links to other users.

Considering the high number of organisations with less than three LGN/HGN users (Table 2) a comparison between organisations in terms of user roles was impossible. For LGN the population was large enough to categorise organisations (Table 4). The organisation Wageningen UR as a single user has a fraction of intermediaries of 0.13 as opposed to around 0.33 for most organisation categories. It suggests that this organisation has an effective SDI in place. Possibly this is related to the size of the organisation and the number of GIS users within the organisation. Water boards have relatively many direct users and relatively few indirect users. This is in line with the more operational role of these organisations in Dutch land use planning and water management. Provinces have a more coordinating role, as an intermediary between national policies and implementation at municipal/waterboard level. The relatively high fractions of intermediaries and indirect users are in line with this role. For the other categories, the number of respondents seems too low to draw conclusions.

Table 4 User roles per organisation category for LGN

User interaction

We studied user interaction in terms of (1) users giving feedback to the producer (2) whether people needed personal help from others in accessing data and (3) the prime source of metadata: using available metadata versus asking another user. Of the respondents 34%/42% (LGN/HGN) has at one point in time given feedback to the producer. If these respondents give feedback also on behalf of all other users in their organisation then this covers for LGN ca 75% of the user population. Cross-tabulation with user roles revealed that around 50% of the intermediaries and direct users gave feedback, as opposed to 18% of the indirect users. Overall, the results suggest that it is mostly the organisations with a small number (1 or 2) of LGN/HGN users and the indirect users who are not giving feedback to the producer. With regard to help with access (2) we found for both datasets that 52% of the users needed no help with accessing metadata. With regard to metadata (3) we found that 79%/89% (LGN/HGN) used written metadata (as opposed to human help) as the prime source of metadata. We think these numbers are high in comparison with what may be found for other datasets, for a number of reasons: the nature of the user population (professional), the state of information systems within their organisation (advanced) and the amount of available written metadata on these two particular datasets which is extremely high. In the literature, the lack of written metadata as well as organised user communities for sharing and building up a metadata knowledge base has been noted as a serious problem (Engler and Hall 2007).

Network analysis

Figures 1 and 2 show the social networks of LGN and HGN based on answers to survey questions Q3, Q7 and Q8. Colours are for different organisations, squares are users with the intermediary role and circles are users without the intermediary role. Figures 1, 2, and 3 show only the real users: users who did respond and excluding pure ex-users. Tables 5 and 6 are based on counts of all names generated, including those of people who did not respond. Similarities in the two networks (Figs. 1, 2):

Fig. 1
figure 1

Sociogram of the LGN social network, only the real users Colours represent different organisations. Users with the intermediary role are indicated as squares, others as circles. Except for the blue colour (Wageningen UR), colours and numbers do not correspond with those in Fig. 2. Figure drawn with Netdraw (Borgatti 2002). (Color figure online)

Fig. 2
figure 2

Sociogram of the HGN social network, only the real users Colours represent different organisations. Users with the intermediary role are indicated as squares, others as circles. Except for the blue colour (Wageningen UR), colours and numbers do not correspond with those in Fig. 1. Figure drawn with Netdraw (Borgatti 2002). (Color figure online)

Fig. 3
figure 3

Number of links to other real* users (=outgoing links) * recall we defined “real” users as those who did respond and who are not an ex-user

Table 5 Names mentioned by LGN users
Table 6 Names mentioned by HGN users
  • Both networks have two central nodes: the producer (left) and the salesperson.Footnote 5 (right). Together these two have links to 65%/58% (LGN/HGN) of the populations depicted in Figs. 1 and 2. We expect these numbers are exceptionally high when compared with datasets that are distributed free of charge over the internet.Footnote 6

  • The number of links per user follows a power law distribution (Fig. 3) with few nodes having a high number of links and the majority a very low number of links. Of the respondents 60%/40% mention 0 other users, 85% mentions 2 or less other real users. Albert et al. (2000) have shown such networks are vulnerable to fragmentation when central nodes are removed. In our case a decision by the producer/salesperson to no longer register users corresponds with a central node removal and would severely fragment the network. Such a decision would make sense in terms of reducing bureaucracy but would at the same time severely limit users’ opportunities to find and interact with users;

  • Many (46%) of the users have no links to other real users within their own organisation;

  • Most users have no external links (LGN 81%, HGN 50%);

  • In the majority of cases the only external link is between the LGN/HGN using organisation and the producer. There are virtually no links between LGN/HGN users in different organisations;

  • Contrary to expectations, the number of links from intermediaries to real users is not higher than the number of outgoing links of non-intermediaries. This surprising result is further addressed in section “Accuracy”;

  • It may seem strange to find so few linkages internally between users inside a single organisation like Wageningen UR (WUR). However, it is not so strange if we consider that WUR is a very large organisation with a very broad scope of research. The part of the organisation in which people use maps had ca 926 employees at the time of this research. The 94 real LGN users (Table 1) represent only 10% of the organisation, for HGN the percentage is even lower. These people need not work together, they may be active in different scientific disciplines in different sub-units of the organisation.

Notable differences between the two networks:

  • HGN has one extra central node. This is a person who used the HGN producers’ list of users to advertise a report on a topic related to HGN. He has the same number of outgoing links as the other two central nodes, but almost no incoming links (as can be seen from the arrowheads). That is: this person is not mentioned as a user by the other users;

  • Within Wageningen UR there are almost no links between HGN users (only to and from the producer/salesperson). For LGN within Wageningen there are more links between users.

Survey non-response

For the large fraction of people who did not respond (Table 1) we simply do not know whether they are or are not real users. Accuracy, recall and non-response are important methodological concerns in social network analysis (Brewer 2000; Marsden 2005). We will look into two extreme possibilities:

  • Accuracy: assume that all non-respondents are false recalls, i.e. the non-respondents are not real users. This means that our respondents often replied inaccurately to our request for names;

  • Completeness: assume that all persons in our database are real users. In that case non-response results in an incomplete map of the social network. If a real user is not recalled by any of the users in our network we will remain unaware of the existence of this user. Especially where there are few social ties (as in Figs. 1, 2), poor recall can contribute to an incomplete mapping of the network.

We explore these two extremes in the following subsections.

Accuracy

Assuming that those who did not respond are no real users we calculated for each respondent his/her accuracy. Imagine respondent A who mentioned 4 names. To these 4 we also sent our survey and only 1 responded. In that case the accuracy is set to 1/4 = 25%. The rightmost column in Tables 5 and 6 shows the median of all individual respondents’ accuracies. Table 5 shows that intermediaries mention more names than other users (LGN median (μ) 5 vs. 2), but their accuracy is also lower (LGN median μ(%) 19% vs. 50%). Consequentially, the number of links to real users is not much different between the user roles. Similarly for HGN (Table 6) we find that intermediaries generate more names and have a lower accuracy.

We hypothesise that the low accuracies are at least partially attributable to the level of recall: organisation (or unit within) versus individual (cf. Hansen 1999). While the people who actually use LGN/HGN may change (change of job or tasks) the organisation (unit) may be a constant user. In that case it makes more sense for users to recall the organisation (or unit) and less sense to recall the individuals. This hypothesis is supported by closer analysis of the results of our request for names (Table 7): despite the explicit request for personal details, many respondents returned names of organisations (or units within). Another eligible unit of analysis could be models that use LGN/HGN as input. Discussion of our survey results with the user community suggested that people discuss with each other and recall what models they are working on, without exchanging specifics on input data. Just like organisational units, models may act as an entry point to identifying user communities that we shall pursue in further research.

Table 7 Request for names of other users

Completeness

The other extreme possibility is that all those who did not respond are in fact real users. Under this assumption, we have captured only 59%/67% of the organisations using LGN/HGN (Table 2) and only 41%/54% of the people using LGN/HGN. Results, however, may be even worse. A detailed list of the response was reviewed by the data producers and they pointed to certain organisations that were missing from the list and organisations for which a higher number of users was expected. The general methodological problems in identifying users as noted in the introduction of this paper remain: without any lead into an organisation, or with just one lead who cannot or will not cooperate, finding more users is like looking for a needle in a haystack. For the network topology (Figs. 1, 2, 3) that we found, with very few linkages to users, we are very much dependent on individual response. Networks with a more homogeneous topology (Albert et al. 2000) are less sensitive to non-response.

Figures 1 and 2 suggest that if users would like to contact other users, then finding them is almost impossible without help from the central nodes. In fact, such an opportunity arose shortly before the survey started. In June 2007, an LGN user conference was organised. One would expect that if users saw a need or benefit in user interaction, then this would be their chance. However, in our survey results, we found that apart from one exception conference attendees gave no names or only a fraction of the names of other conference attendees. The single exception was the LGN producer who organised the conference and did clearly recognise the importance of feedback from the users. It is unclear at this stage why other users did not recall names of other conference attendees when filling in our survey. Possibly users rely on the producer in continuing this role of mediator so that no additional effort is required from the users.

Discussion

We studied three categories of linkages between users: (1) producer to user (2) user to producer and (3) user to user. The energy of SDI programs (Clinton 1994; Groot and McLaughlin 2000; Williamson et al. 2003; Masser 2005; www.ec-gis.org/inspire) seems to be going mainly to increasing the efficiency and effectiveness of (1) with little interest in the role of users in SDIs (Budhathoki et al. 2008). SDI technologies, standards and policies are improving the efficiency of the transfer of data/metadata, so that users find it increasingly easy to find and access data and metadata. Progress in these fields in the Dutch spatial data infrastructure (VROM 2008) is reflected in our results where a large fraction of users required no help in accessing data (52%) or metadata (around 80%). The flow from user to producer is also present: around 35% of the users did at one point in time give feedback to the data producer; if on behalf of other staff in their organisation this covers 75% of the user population. This is probably quite high in comparison with datasets that are exchanged freely over the internet.

What is really absent is the user to user interaction. For two datasets we found that, discarding links to the producer, around 50% cannot name a single other user; the majority of users (88%) can only directly name 2 or less other real users. There were virtually no links between users in different organisations. Users often have an idea of which organisations or sub departments are likely users, but lack at a dataset level direct links to other users. For geo-datasets distributed freely over the world wide web we anticipate it to be even worse, with users having virtually no ties to other users of the same dataset, nor opportunities to identify other users. Technology can but is currently not used to be helpful in this respect.

Apart from the study by Omran and van Etten (2007), there are no other social network analyses specifically for geo-data with which we can compare our results. There are two marked differences between our work and the study by Omran and van Etten:

  1. 1.

    Their study was focused on understanding motivations for data sharing and how this was related to network topology. We were interested in how the network can be used for more the innocent purpose of sharing of metadata, requests for help, feedback on product quality, innovative ideas, and so on. Also, we took a greater interest in sharing across organisational boundaries, including more organisations in our analysis.

  2. 2.

    The people who handle the external links are different: only indirect users (Omran and van Etten) vs. mostly intermediaries and direct users (Tables 5, 6). Possibly this difference can be explained from organisational and cultural variables.

Understanding the almost inevitable non-response is an important topic in social network analysis (Brewer 2000; Marsden 2005). To our knowledge the work presented here is the first social network analysis on spatial data infrastructures. As well as empirically interesting it leads to considerations for future improvement. Possibly data acquisition can be refined by taking into account that users seem to recall at the organisational level or at the level of models rather than datasets.

Conclusions and further research

The internet revolution and evolving data infrastructures result in more efficient data sharing. Without additional efforts to strengthen feedback (user-to-user, user-to-producer and producer-to-user), these developments will result in greater anonymity and increased difficulties in finding out who is actually using a particular dataset. As a consequence sharing of feedback, concerns, ancillary and innovation among users and from users to producer may be impeded. SDI development to-date has been mostly been driven by users’ demand and producers efforts to make spatial data better accessible. Additional technologies (like social networking software or community websites), serving to facilitate the exchange of feedback between users have to our best knowledge not been taken-up. With lack of such support, the only thing users can do is fall back on their social network (of users of the same dataset).

We have analysed the social networks of two datasets commonly exchanged through the Dutch national SDI. We found that the majority of users has 0, 1 or 2 links to other users of the same dataset and found that there are virtually no links between users in different organisations. This seems very low and the intriguing question is why. Is it an artefact of inaccurate or incomplete response? We have presented suggestions for improvement in this paper. Are users unaware of the possible benefits of more user interaction? Or are users aware but quite happy with the current set-up? Or is the number of links low because no-one is taking the lead in promoting exchange of feedback?Footnote 7 Understanding motivations and impediments for exchange of feedback is a major objective for further research that follows from the outcomes of our research.

The work presented in this paper can contribute to methodological research on monitoring SDI programmes (Crompvoets et al. 2007; Georgiadou et al. 2006; Harvey and Tulloch 2006; Masser 2006). Working with a clearly demarcated population (the common denominator is the use of a particular dataset) makes the study reproducible and repeatable, thus fit for monitoring purposes. At the same time, the population definition allows individual users to exit and enter, which is in line with the dynamic nature of SDIs and organisations. Furthermore working with a larger population reduces the risk that results are affected by bias in the response of one or few informants. This is relevant in the field of data infrastructures because there may be concerns about the independence of informants (Rhind 2000). A survey such as presented here reduces the risk of bias and focuses on the actual users of data.

For further research we recommend mapping more social networks for datasets shared through data infrastructures. We hypothesise that the network topology will differ depending on cultural and organisational factors (De Vos 2007; Omran and van Etten 2007; Harvey and Tulloch 2006) and that it will differ depending on whether the data are shared either completely anonymous over the world wide web to users all over the world (e.g. as in Engler and Hall 2007) or less anonymous such as for commercial datasets and/or datasets with the user community confined to the borders of a country. We recommend research into extensions of the current method that may yield a more complete mapping of the social network. For monitoring SDI, we recommend to select a number of datasets and map their networks at regular intervals in time. And finally, we need to find out more about users’ perceived benefits of user-to-user linkages and perceived obstacles to strengthening such linkages.