In this section, we present how event metadata is scrapped from the Web, including event homepages and Twitter account statistics. Furthermore, we present a metadata analysis on top of this data and show which knowledge can be derived from it.
Data collection
The data collection task is mainly focused on event homepages because they are the main source of information about an event. Step 1. Homepages provide unstructured data, therefore the first step is to scrape and clean the data. Further channels were processed while gathering metadata of events, such as crawling WikiCFP, which provides metadata in a well-structured way, and Twitter account statistics. Step 2. Store the data in a way that they can be easily processed in large batches and analyzed, i.e., CSV format. Step 3. Share the collected data in an accessible way by importing it to OpenResearch.org using its bulk import serviceFootnote 19. Surprisingly, we found that some important conferences do not archive old editions, for example, for the SEMANTiCS conference events are not archived before 2013. The collected data are fully available online through the OpenResearch.org platform, which also provides LOD features and lets others further improve and enrich our collected data.
Data analysis
We create metadata-based metrics to conclude statements about the quality of the considered events and derive conclusions about the scholarly communication of the whole community. The selected metrics have been collected observing successful events as they provide indication for their quality. Due to lack of data, parts of our analysis were not possible for some recent years, such as when studying sponsorship packages for 2020, 2019, and 2018 (see Table 1). In addition, due to the global pandemic occurred in the beginning of 2020, i.e., COVID-19, generally scholarly communication has been affected Subramanya et al. (2020), such as the cancellation of SEMANTiCS 2020, or changes of several events from physical to virtual conferences, such as ESWC 2020. Therefore, some metadata, such as keynote speakers, is not available.
In these analysis, we use four personas to represent the needs and interests of different stakeholders of scientific events. A single metric is not meant to fit all personas at once, but to address different interests and requirements for one or more of the personas. As they address individual requirements for a persona, they are meant as a tool to match events that suit individual needs and interests and not as a global ranking. For each metric, the collected metadata is described first. After that, an analysis of this metric based on some event series is presented to test the collected data. Sponsors. One characteristic of events is the existence of sponsors in that event. Event homepages list their sponsors and additional sponsorship opportunities are provided. The latter will be referred to as “sponsor benefits”. Here we will base quality metrics on the willingness of sponsors to pay an amount of money for certain benefits. Events provide so-called “packages” and title them with names like “Gold Sponsorship” or “Bronze Sponsorship”. These packages have different monetary values, for a real-world example, VLDB2017 charges $10,000 for Gold Sponsorship and $3000 for a Bronze Sponsorship. The common benefit classes can be identified such as adding the “logo on the website” or having an “advertisement in conference brochure” which are purchasable at several event series. Events can be compared by their benefits and the minimal price a sponsor must pay to get this benefit. Table 1 shows a list of four conference series with their offered options for a set of benefits over the past six years.
Table 1 Some benefits and their minimum price over different events
Before we compare event series, we look at a single series and how their benefit prices develop over the last six years. Each benefit in a single event series with their price over the years makes a single set of data points. For each set of data points, the gradient was calculated. We group the trend lines by event series and draw the family of trend lines in a single trend chart. For x being years and y being monitory values, we calculated the gradient m of the trend line for N data points with the following formula:
$$\begin{aligned} m = \frac{ n \sum {(xy)} - \sum {(x)} \cdot \sum {(y)}}{ n \sum {(x^2)} - (\sum {x})^2 } \end{aligned}$$
In this step, we calculate the intercept b with the y axis as
$$\begin{aligned} b = \frac{\sum {(y)} - m \cdot \sum {(x)}}{N} \end{aligned}$$
Hereby, we present the points for a single common benefit per each single event of a series given as a 2D vector. The yearly values are shown in the first dimension and the monetary values are in the second dimension. Figure 5 shows such a trend chart for the SEMANTiCS conference series illustrated for years of 2012 to 2017. In this period, the sponsors could get the following benefit types: Acknowledgment in press releases, free conference registrations, advertise in the conference brochures, advertised via social media, advertisement inside the conference material and proceedings and in participant bags, article on the conference website, banners at the conference venue (physical conferences), booth at the conference, logos appearing at the conference website, logos appearing in the conference brochure, having own workshop or co-occurring events, giving speeches at the conference, adding sub-pages on the website, tweet with specific hashtags, and gaining Twitter followers by the conference iteself or its participants. Each benefit makes a single set of data points. Along the y axis, we have the monetary value of the benefit. As the gradients of the trend lines are not easy to see all the time we colored trend lines with a positive gradient in half opaque green and the ones with a negative gradient in half opaque orange. The trend lines start at the first year the benefit is available and end at the last year the benefit is available. For SEMANTiCS, we overall observed nine positive and five negative trends. The strongest positive gradient of the long-term benefits is of the benefit “booth at the conference” which costs a minimum of 2200€ in 2012 and 4750€ in 2017. The only higher gradient for SEMANTiCS is of “acknowledgment in press releases“ which develops from 2012 with 3500€ to 2017 with 4750€. The two going trends from 2012 to 2017 are “logo on website” and “logo in conference brochure”. They started quite high but reduced the minimal price for the last years to a lower value, which you can also see in Table 1. Another interesting point to see in the trends is that when SEMANTiCS changes from a sister-event as i-SEMANTiCS in 2014 to its own event since 2015 many new benefits come available for sponsors.
Organizers origin The term “origin” is used as the current home location or workplace of the person and not where the person is born. Figure 6 shows the origin of the persons involved in organizing one of the events in the VLDB series from 2012 to 2017.
It can be noticed that, for VLDB there are not many different countries per year, but some countries appear repeatedly for each year, so we queried the data again and this time we count how many events in this period are (by person involved in organizing the event) associated with this country. Table 2 shows the amount of persons for each country in sum from 2012 to 2017. In this case, Canada is only ranked number eight. Italy, which is only associated with two from six events, is in the top five.
Table 2 Summed country participation in the number of organizing persons from VLDB2012 to VLDB2017 The key question here is: Is there a trend for each country over the years? For readability, we only include the top ten countries and split them into two groups of five. Figures 7 and 8 shows the number of persons from a country over the event series. We observed peaks by a country participating in the organizing of an event whenever the event is located in this country or a neighboring country. For example, Turkey is highly involved in the VLDB event of 2012, and India is highly involved in 2016. It seems that VLDB events use locals for organizing the event if possible.
Event duration A metric to match events for individual preferences on event duration and program structure can easily be derived from the event start and end date. The event program structure for VLDB, SEMANTiCS, and WIMS have been manually collected, as these data are not available in a structured way across all events in our sample. Figure 9 shows the average number of parallel sessions, the average number of presentations (rounded values) per session, and the event duration for VLDB, SEMANTiCS, and WIMS in the last decade. For VLDB2012, no program information is available, so the cells in the program structure remain empty. Assuming a researcher prefers events with a single track and no parallel sessions. He can use this metric to find matching events, such as the latest WIMS iterations. And if he wants to have multiple parallel sessions, he can schedule the presentations that he wants to attend.
Acceptance Rate The acceptance rate of a conference in a particular year is defined as the ratio between the number of accepted articles and the number of submitted ones. The average acceptance rate (AAR) has been calculated for all editions of a particular series to get an overview of the overall acceptance rate of this series since the beginning. Figure 10 shows the average number of accepted and rejected papers of SEMANTiCS, ISWC, ESWC, and VLDB in the last decade (i.e., 2010–2020).
Events Co-location Many of the scientific events have co-located events, often categorized as conferences, workshops, tutorials, presentations, or exhibitions. The latter is often connected to a special sponsorship model. We reviewed the co-located events with SEMANTiCS, VLDB, and the years 2012 to 2017. Figure 11 shows the number of co-located events and tutorials in SEMANTiCS, VLDB, ISWC, and ESWC in the period 2010–2020. ISWC has a very strong standing with an average of 17 workshops in the whole period. In comparison, SEMANTiCS has the lowest average of 5 collocated workshops per event.
Keynote Speaker All events in our dataset have keynote speeches in their program. Renowned keynote speakers based on their expertise in a special field, accomplishment, or affiliation are an option to raise interest in attending the event. At the moment, to assess the reputation of a scientist, author-level metrics are widely used. These include the widely used h-index Hirsch (2005) or i10 index created by Google ScholarFootnote 20. All authorship statistics for this work are obtained from the respective Google Scholar profiles. Table 3 shows all keynote speakers of SEMANTiCS and ESWC, their affiliation, an average of author-level metrics of all speakers in the period 2012–2020. The collected data in the past seven years shows that some events show a tendency to the industry, while others show a tendency to the academic world, based on the affiliation of keynote speakers. Each individual event of SEMANTiCS has at least three keynote speakers with industrial affiliation. In 2014, there was no keynote speaker from academia at all. Exceptionally, in 2018, speakers from academia exceed the ones from industry. In ESWC, the number of speakers from academia exceeds the number of speakers from the industry in most of the years. On average four keynotes from industry and two from academia could be observed for SEMANTiCS series from 2012 to 2018, while an average of two keynotes from both industry and academia are given at ESWC series in the same period.
Table 3 The average h-index and i10 of the keynote speakers at SEMANTiCS and ESWC in the period 2012–2020