Advertisement

Indices

Chapter

Abstract

Usually, applications work only with a subset of records at a time. Therefore, before processing the portion, it must be located within the database. Hence, records should be stored in a manner that makes it possible to locate them efficiently whenever they are needed. The process of locating a specific set of records is determined by the predicates that are used to characterize these records.

18.1 Indices: A Query Optimization Approach

Usually, applications work only with a subset of records at a time. Therefore, before processing the portion, it must be located within the database. Hence, records should be stored in a manner that makes it possible to locate them efficiently whenever they are needed. The process of locating a specific set of records is determined by the predicates that are used to characterize these records.

SanssouciDB organizes its records in columns (see  Chap. 8). To determine the records, it is necessary to perform a scan on all the columns, which are used as filter criteria. In main memory column-oriented databases, which store column values continuously, i.e. in adjacent memory blocks, searching for a value with a full scan (by iterating through all items placed in memory sequentially) can be done by orders of magnitude faster than in row-oriented databases. Therefore, the usefulness of index structures in such databases is limited. Nevertheless, because the complexity of a full column scan is linear, it is just a matter of data volume that will make the speed advantage of indices relevant to main memory column-oriented databases.

In this chapter, we discuss the topic of inverted indices in the context of main memory databases in more detail.

18.2 Technical Considerations

Let us look at the world_population table from  Chap. 11 again. For the readers convenience we repeat (see Fig. 18.1). Let us assume that we want to locate the records of all the people from Berlin. The dictionary and the attribute vector of the column are shown in Fig. 18.2.
Fig. 18.1

Example database table named world_population

Fig. 18.2

City column of the world_population table

Fig. 18.3

Index offset and index positions

Fig. 18.4

Query processing using indices: Step 1

To determine the set of berlin records, we need to set the filter criterion on the city attribute of the table. The respective SQL query could look like depicted in Listing 18.1. Open image in new window

Now, let us do a back of the envelope calculation. Assuming the table contains 8 billion records, our CPU is able to process 2 GB per second per core, and the city column is encoded using 20 bit. The memory footprint of the city column’s attribute vector can be calculated using
$$ 8\,\text{billion} \cdot 20\,\text{bit} = 160\, \text{billion\;bit}= 20\, \text{billion}\; \text{Byte}\approx 18.6\,\rm{GB} $$
The time it takes a single core to process that amount of data can be calculated using
$$ 18.6\,\rm{GB} \div 2\,\rm{GB}/\rm{sec} \approx 9.3\,\rm{sec} $$
This calculation shows that scanning the whole column with 40 cores takes \({\approx}230\,\rm{ms}\). Despite that this scan speed is unthinkable for a row-oriented database, the speed might not be sufficient for all applications.

18.3 Inverted Index

Now, let us investigate a more complicated, but more efficient algorithm presented in detail in [FSKP12]. We consider an inverted index for the attribute city. An inverted index maps each distinct value to a position list, which contains all positions where the distinct value can be found in the column. The index for the dictionary-encoded column consists of the following two parts (Fig. 18.3):
  • Index offsets (IO) : this vector stores for each dictionary entry (or in other words, for each unique value of the attribute vector) the offset for the position list in the positions vector. This means that the offset vector stores references to the first occurrence of the particular dictionary code in the positions vector.

  • Index positions (IP) : the index position vector contains a position list of all distinct values of the attribute vector sorted by the integer valueID. In contrast, the attribute vector stores valueIDs by position.

Let us see how much data the CPU has to read using an index. We continue with the query shown in Listing 18.1. The following steps need to be executed to determine the position of berlin in the attribute vector.
  1. 1.

    We need to perform a binary search on the dictionary to determine the dictionary position related to berlin. As depicted in Fig. 18.4, berlin is at position 1.

     
  2. 2.

    The dictionary position of berlin corresponds directly to the position of the index offset vector shown in Fig. 18.5. In this example, the dictionary position of berlin is 1, so the corresponding index offset vector position is 1.

     
  3. 3.

    Since the attribute of the respective search criterion is not necessarily a primary key, it is possible that the same value is used by many records. Consequently, more than one attribute vector entry can be filled with that value. As explained in the beginning of this chapter, the index position vector represents a sorted list of the values in the attribute vector. To determine the range of values to read from the index position vector, we simply read the value of the index offset vector at the position, we determined in Step 1, and the value of the next higher position (see Fig. 18.6).

     
  4. 4.

    Since 3 is already the offset of the next value in the index positions vector, we only need to read positions 1 and 2. As shown in Fig. 18.7, IP vector position 1 contains the value 4 and position 2 contains the value 6, which are the exact positions of the dictionary code for berlin in the attribute vector. By retrieving the offsets of berlin and dresden from the IO vector, we are able to determine the exact range of all values we need to read in order to resolve the respective attribute vector positions of berlin.

     
  5. 5.

    With the positions resolved in Step 4, we are able to jump directly to the respective attribute vector positions of all other columns of that table in order to materialize the complete records of all the people that live in Berlin (Fig. 18.8).

     
Using this approach, we reduce the data volume read by a CPU from the main memory by providing a data structure that does not require the scan of the entire attribute vector. Investigations regarding the influence of using indices on memory traffic and performance are shown in Sect. 18.4.

18.4 Discussion

In the previous section, we explained the idea of using an inverted index on a dictionary-encoded column to increase response time for lookup requests. An index increases the memory consumption per column. In this section we compare data lookup using full table scan against an index, regarding memory consumption and lookup performance. We first introduce the following symbols that we use.

18.4.1 Memory Consumption

In the beginning of this chapter, we explained that an index consists of an IO vector and an IP vector. To determine the overall size of the index, we need to calculate the size of these two structures.
$$ I_m = IO_m + IP_m $$
The allocated memory of a vector can simply be calculated by multiplying its length (number of entries) with its width (size of a single entry).
$$ \begin{aligned}IO_{m}&= IO_{l} \cdot IO_{w}\\IP_{m} &= IP_{l} \cdot IP_{w}\end{aligned}$$
The length of IP directly corresponds to the length of the attribute vector \(AV_l\), since it is basically a sorted version of the corresponding attribute vector. The width of IP is determined by the bit-encoded length of the attribute vector, since it contains direct positions to the values in the attribute vector.
$$\begin{aligned}IP_{l}&= AV_{l}\\IP_{w}&= \lceil log_{2}{(AV_l)}\rceil{{\text{bits}}}\end{aligned}$$
The length of IO directly corresponds to the length of the dictionary \(D_l\), which in turn is determined by the number of distinct values in the respective column. The width of IO is derived from the biggest offset into IP, because IO contains the bit-encoded offsets used to determine the position ranges in IP. As the maximum offset stored in IO can be the length of IP, the resulting width of IO is \(\lceil log_2(IP_l) \rceil \).
$$\begin{aligned}IO_l&= D_l\\IO_w&= \lceil log_2{(IP_l)} \rceil \rm{{bits}}\end{aligned}$$
Summarizing, we combine the above formulas to a single equation for calculating the size of an index structure.
$$\begin{aligned}I_m &= D_l \cdot \lceil log_2{(IP_l)} \rceil + AV_l \cdot \lceil log_2{(AV_l)} \rceil \rm{{bits}}\\I_m &= (D_l + AV_l) \cdot ( \lceil log_2{(AV_l)} \rceil ) \rm{{bits}}\end{aligned}$$
Let us now calculate the actual size of an index for the city column of our world_population table from Fig. 18.1. We need to determine \(D_l\), \(IP_l\), and \(AV_l\). Based on the assumption that there are about 1 million cities around the world and that the world population is 8 billion, we just need to insert these numbers into our formula.
$$ I_m = 10^6 \cdot \lceil log_2{(8 \cdot 10^9)} \rceil + 8 \cdot 10^9 \cdot \lceil log_2{(8 \cdot 10^9)} \rceil \rm{{bits}} $$
So from this formula, we get an index size of about 31 GB for the city column.

18.4.2 Lookup Performance

Independent of using an index or not, we need to perform a binary search on the dictionary to determine the encoded value for the respective search term. Let us assume that we need to read \(log_2{(D_l)}\) entries to perform the binary search. Since the binary search on the dictionary has to be done for both access methods we can ignore it, when we compare them.

Description

Unit

Symbol

Memory consumption of the index

bits

\(I_m\)

Length of the index offset vector

\(IO_l\)

Width of the index offset vector

bits

\(IO_w\)

Memory consumption of index offset vector

bits

\(IO_m\)

Length of the index positions vector

\(IP_l\)

Width of the index positions vector

bits

\(IP_w\)

Memory consumption of index positions vector

bits

\(IP_m\)

Length of dictionary (number of distinct values in column)

\(D_l\)

Length of attribute vector

\(AV_l\)

Width of attribute vector

bits

\(AV_w\)

In case of a full column scan, we need to traverse the attribute vector sequentially, by reading \(AV_l\) entries, each with a size of \(\lceil log_2{(D_l)} \rceil \) bits. Again, assuming 8 billion rows in the table and 1 million cities, we need to read
$$ 8 \cdot 10^9 \cdot \lceil log_2{(10^6)} \rceil \rm{{bits}} = 160.000.000.000\,\rm{{bits}} $$
for a full attribute vector scan.
Now, when using an index the situation is different. After the dictionary lookup, we directly read the upper and lower limit from the index offset vector (see Fig. 18.6). We neglect this step for our performance consideration since the number of bytes read does not impact the overall scan performance. Having determined the upper and lower limit for the index positions vector, we need to traverse through it (see Fig. 18.7). The number of entries to read from IP depends on the distribution of values in the column. Reading an attribute value that is used more frequently, we need to read more entries, on less frequently used values we need to read less. Assuming a uniform distribution of values we need to read \(AV_l \div D_l\) entries. The width of the entries is \( \lceil log_2{(AV_l)} \rceil \). Combining both equations, we get
$$ IndexPositions = \frac{AV_l \cdot \lceil log_2{(AV_l)}\rceil }{D_l} $$
for the number of bits to read from the index positions vector. Taking our world population example, where we look for all people living in Berlin, we come up with
$$ \frac{8 \cdot 10^9 \cdot \lceil log_2{(8 \cdot 10^9)}\rceil \rm{{bits}}}{10^6} = 264.000\,\rm{{bits}} $$
to read using an index. Assuming a CPU performance of 2MB/ms/core, a single core needs about 9 s to scan the complete attribute vector. Accessing the column using an index, the CPU needs 0.0157 ms to read the attribute vector positions of the people living in Berlin. Thus, in this example, we improve performance by a factor of \({\approx }573{,}248\) with the index, compared to a sequential attribute vector scan.
Fig. 18.5

Query processing using indices: Step 2

Fig. 18.6

Query processing using indices: Step 3

Fig. 18.7

Query processing using indices: Step 4

Fig. 18.8

Query processing using indices: Step 5

Fig. 18.9

Attribute vector scan versus index position list read for a column with 30 million entries (note the log-log scale)

We compare the theoretical memory traffic for the attribute vector scan and the position read in Fig. 18.9 for different dictionary sizes on a column with 30 million entries. With a uniform distribution, the index leads to less memory traffic, if at least 8 distinct values are present.

Reference

  1. [FSKP12]
    M. Faust, D. Schwalb, J. Krueger, H. Plattner, Fast lookups for in-memory column stores: group-key indices, lookup and maintenance. in ADMS ’12: Proceedings of the 3rd International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures at VLDB’12, 2012Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Hasso Plattner InstitutePotsdamGermany

Personalised recommendations