The Linux kernel offers many different RCU implementations, each one serving a different purpose. The first Linux kernel RCU implementation was Classic RCU. A problem with Classic RCU was lock contention due to the presence of one global lock that had to be acquired from each CPU wishing to report a quiescent state to RCU. In addition, Classic RCU had to wake up every CPU (even idle ones) at least once per grace period, thus increasing power consumption.
Tree RCU offers a solution to both these problems since it reduces lock contention and avoids awakening dyntick-idle [26] CPUs. It can easily scale to thousands of CPUs, while Classic RCU could only scale to several hundred. Apart from the original Tree RCU implementation, several flavors of Tree RCU are provided [27], for example:
-
RCU-sched, where anything that disables preemption acts as an RCU read-side critical section. This is useful if code segments with preemption disabled need to be treated as explicit RCU readers.
-
RCU-bh, where RCU read-side critical sections disable softirq processing. This is useful if grace periods need to complete even when softirqs monopolize one or more of the CPUs (e.g., if the code is subject to network-based denial-of-service attacks).
-
Sleepable RCU (SRCU), which is a specialized RCU version that permits general sleeping in RCU read-side critical section.
In this article, we focus on the original Tree RCU implementation, which is the same as RCU-sched in non-preemptible builds.
Below we present a high-level explanation of Tree RCU along with some implementation details, a brief overview of its data structures, and some use cases that are helpful in understanding how RCU’s fundamental mechanisms are actually implemented.
High-level explanation
In Classic RCU, each CPU had to clear its bit in a field of a global data structure after passing through a quiescent state. Since CPUs operated concurrently on this data structure, a spinlock was used to protect the mask, and this design could potentially suffer from extreme contention.
Tree RCU avoids this performance and scalability bottleneck by creating a heap-like node hierarchy. The key here is that CPUs will not try to acquire the same node’s lock when trying to report a quiescent state to RCU; in contrast, CPUs are split into groups and each group will contend for a different node’s lock. Each CPU has to clear its bit in the corresponding node’s mask once per grace period. The last CPU to check in (i.e., to report a quiescent state to RCU) for each group will try to acquire the lock of the node’s parent, until the root node’s mask is cleared. This is when a grace period can end. A simple node hierarchy for a 6-CPU system is presented in Fig. 4.
As can be seen in the figure, CPUs 0 and 1 will acquire the lower-left node’s lock, CPUs 2 and 3 will acquire the lower-middle node’s lock, and CPUs 4 and 5 will acquire the lower-right node’s lock. The last CPU reporting a quiescent state for each of the lower nodes will try to acquire the root node’s lock, and this procedure happens once per grace period.
The node hierarchy created by Tree RCU is tunable and is controlled, among others, by two Kconfig options, namely:
-
CONFIG_RCU_FANOUT_LEAF: Controls the maximum number of CPUs contending for a leaf-node’s lock. Default value is 16.
-
CONFIG_RCU_FANOUT: Controls the maximum number of CPUs contending for an inner-node’s lock. Default value is 32 for 32-bit systems and 64 for 64-bit systems.
More information can be found at the init/Kconfig file.
Data structures
Let us now present three major data structures (rcu_data, rcu_node, and rcu_state) of Tree RCU’s implementation.
Suppose that a CPU registers a callback that will eventually be invoked. Tree RCU needs to store some information regarding this callback. For this, the implementation maintains some data organized in the per-CPU rcu_data structure, which includes, among others:
-
the last completed grace-period number this CPU has seen; used for grace-period ending detection (completed);
-
the highest grace-period number this CPU is aware of having started (gpnum);
-
a Boolean variable indicating whether this CPU has passed through a quiescent state for this grace period;
-
a pointer to this CPU’s leaf of hierarchy; and
-
the mask that will be applied to the leaf’s mask (grpmask).
Of course, when a CPU registers a callback, this is also stored in the respective per-CPU data structure.
Then, when a CPU passes through a quiescent state, it has to report it to RCU by clearing its bit in the respective leaf node. The node hierarchy consists of rcu_node structures which include:
-
a lock protecting the respective node;
-
the current grace-period number for this node;
-
the last completed grace-period number for this node;
-
a bit-mask indicating CPUs or groups that need to check in in order for this grace period to proceed (qsmask);
-
a pointer to the node’s parent;
-
the mask that will be applied to parent node’s mask (grpmask); and
-
the number of the lowest and the highest CPU or group for this node.
Lastly, the RCU global state and the node hierarchy are included in an rcu_state structure. The node hierarchy is represented in heap form in a linear array, which is allocated statically at compile time based on the values of NR_CPUS and other Kconfig options. (Note that small systems have a hierarchy consisting of a single rcu_node.) The rcu_state structure contains, among others:
-
the node hierarchy;
-
a pointer to the per-CPU rcu_data variable;
-
the current grace-period number; and
-
the number of last completed grace period.
There are several values that are propagated through these different structures, e.g., the grace-period number. However, this was not always the case, and it was often the discovery of bugs that led to such changes in the source code.
Finally, we have already mentioned that Classic RCU had a suboptimal dynticks interface, and that one of the main reasons for the creation of Tree RCU was to leave sleeping CPUs lie, in order to conserve energy. Tree RCU avoids awakening low-power-state dynticks-idle CPUs using a per-CPU data structure called rcu_dynticks. This structure contains, among others:
-
a counter tracking the irq/process nesting level; and
-
a counter containing an even value for dynticks-idle mode, else containing an odd value.
These counters enable Tree RCU to wait only for CPUs that are not sleeping, and to let sleeping CPUs lie. How this is achieved is described below.
Use cases
The common usage of RCU involves registering a callback, waiting for all pre-existing readers to complete, and finally, invoking the callback. During all these, special care is taken to accommodate sleeping CPUs, offline CPUs and CPU hotplugs [29], CPUs in user-land, and CPUs that fail to report a quiescent state to RCU within a reasonable amount of time. In the next subsections, we will discuss some use cases of RCU, as well as the interaction of RCU with the described data structures, and the functions involved.
Registering a callback
A CPU registers a callback by invoking call_rcu(). This function queues an RCU callback that will be invoked after a specified grace period. The callback is placed in the callback list of the respective CPU’s rcu_data structure. This list is partitioned in four segments:
-
1.
The first segment contains entries that are ready to be invoked (DONE segment).
-
2.
The second segment contains entries that are waiting for the current grace period (WAIT segment).
-
3.
The third segment contains entries that are known to have arrived before the current grace period ended (NEXT_READY segment).
-
4.
The fourth segment contains entries that might have arrived after the current grace period ended (NEXT segment).
When a new callback is added to the list, it is inserted at the end of the fourth segment. More information regarding the callback list and its structure can be found in RCU’s documentation [30].
In older kernels (e.g., v2.6.x), call_rcu() could start a new grace period directly, but this is no longer the case. In newer Linux kernels, the only way a grace period can start directly by call_rcu() is if there are too many callbacks queued and no grace period in progress. Otherwise, a grace period will start from softirq context.
Every softirq is associated with a function that will be invoked when this type of softirqs is executed. For Tree RCU, this function is called rcu_process_callbacks(). So, when an RCU softirq is raised, this function will eventually be invoked (either at the exit from an interrupt handler or from a ksoftirq/n kthreadFootnote 5) and will start a grace period if there is need for one (e.g., if there is no grace period in progress and this CPU has newly registered callbacks, or there are callbacks that require an additional grace period). RCU softirqs are raised from rcu_check_callbacks() which is invoked from scheduling-clock interrupts. If there is RCU-related work (e.g., if this CPU needs a new grace period), rcu_check_callbacks() raises a softirq.
The synchronize_rcu() function, which is implemented on top of call_rcu() in Tree RCU, registers a callback that will awake the caller after a grace period has elapsed. The caller waits on a completion variable and is consequently put on a wait queue.
Starting a grace period
The rcu_start_gp() function is responsible for starting a new grace period; it is normally invoked from softirq context and an rcu_process_callbacks() call. However, in newer kernels, rcu_start_gp() neither directly starts a new grace period nor initializes the necessary data structures. It rather advances the CPU’s callbacks (i.e., properly re-arranges the segments) and then sets a flag at the rcu_state structure to indicate that a CPU requires a new grace period. The grace-period kthread is the one that will initialize the node hierarchy and the rcu_state structure and by extension start the new grace period.
The RCU grace-period kthread first excludes concurrent CPU-hotplug operations and then sets the quiescent-state-needed bits in all the rcu_node structures in the hierarchy corresponding to online CPUs. It also copies the grace-period number and the number of the last completed grace period in all the rcu_node structures. Concurrent CPU accesses will check only the leaves of the hierarchy, and other CPUs may or may not see their respective node initialized. However, each CPU has to enter the RCU core in order to acknowledge that a grace period has started and initialize its rcu_data structure. This means that each CPU (except for the one on which the grace-period kthread runs) needs to enter softirq context in order to see the new grace-period beginning (via rcu_process_callbacks()).
The grace-period kthread resolved many races present in older kernels. For example, races occurring when CPUs requiring a new grace period were trying to directly initialize the node hierarchy, something that can potentially lead to bugs; see Sect. 9.
Passing through a quiescent state
Quiescent states for Tree RCU (RCU-sched) include: (i) context switch, (ii) idle mode (idle loop or dynticks idle), and (iii) user-mode execution. When a CPU passes through a quiescent state, it updates its rcu_data structure by invoking rcu_sched_qs(). This function is invoked from scheduling-related functions, from function rcu_check_callbacks(), and from the ksoftirq/n kthreads. However, the fact that a CPU has passed through a quiescent state does not mean that RCU knows about it. Besides, this fact has been recorded in the respective per-CPU rcu_data structure and not in the node hierarchy. Thus, a CPU has to report to RCU that it has passed through a quiescent state, and this will happen—again—from softirq context, via the rcu_process_callbacks() function; see below.
Reporting a quiescent state to RCU
After a CPU has passed through a quiescent state, it has to report it to RCU via rcu_process_callbacks(), a function whose duties include:
-
Awakening the RCU grace-period kthread (by invoking the rcu_start_gp() function), in order to initialize and start a new grace period, if there is need for one.
-
Acknowledging that a new grace period has started/ended. Every CPU except for the one on which the RCU grace-period kthread runs has to enter the RCU core and see that a new grace period has started/ended. This is done by invoking the function rcu_check_quiescent_state(), which in turn invokes note_gp_changes(). The latter advances this CPU’s callbacks and records to the respective rcu_data structure all the necessary information regarding the grace-period beginning/end.
-
Reporting that the current CPU has passed through a quiescent state (via rcu_report_qs_rdp(), which is invoked from rcu_check_quiescent_state()). If the current CPU is the last one to report a quiescent state, the RCU grace-period kthread is awakened once again in order to clean up after the old grace period and propagate the new ->completed value to the rcu_node structures of the hierarchy.
-
Invoking any callbacks whose grace period has ended.
As can be seen, the RCU grace-period kthread is used heavily to coordinate grace-period beginnings and ends. Apart from this, the locks of the nodes in the hierarchy are used to prevent concurrent accesses which might lead to problems; see Sect. 9.
Entering/exiting dynticks-idle mode
When a CPU enters dynticks-idle mode, rcu_idle_enter() is invoked. This function decrements a per-CPU nesting variable (dynticks_nesting) and increments a per-CPU counter (dynticks), both of which are located in the per-CPU rcu_dynticks structure. The dynticks counter must have an even value when entering dynticks-idle mode. When a CPU exits dynticks-idle mode, rcu_idle_exit() is invoked, which increments dynticks_nesting and the dynticks counter (which must now have an odd value).
However, dynticks-idle mode is a quiescent state for Tree RCU. So, the reason these two variables are needed is the fact that they can be sampled by other CPUs so that it can be safely determined if a CPU is (or has been, at some point) in a quiescent state for this grace period. The sampling process is performed when a CPU has not reported a quiescent state for a long time and the grace period needs to end (see Sect. 5.3.7).
Interrupts and dynticks-idle mode
When a CPU enters an interrupt handler, the function rcu_irq_enter() is invoked. This function increments the value of dynticks_nesting and, if the prior value was zero (i.e., the CPU was in dynticks-idle mode), also increments the dynticks counter. When a CPU exits an interrupt handler, rcu_irq_exit() decrements dynticks_nesting, and if the new value is zero (i.e., the CPU is entering dynticks-idle mode), also increments the dynticks counter. It is self-evident that entering an interrupt handler from dynticks-idle mode means exiting the dynticks-idle mode. Conversely, exiting an interrupt handler might mean entrance into dynticks-idle mode.
Forcing quiescent states
If not all CPUs have reported a quiescent state and several jiffies have passed, then the grace-period kthread is awakened and will try to force quiescent states on CPUs that have yet to report one. More specifically, the grace-period kthread will invoke rcu_gp_fqs(), which works in two phases. In the first phase, snapshots of the dynticks counters of all CPUs are collected, in order to credit them with implicit quiescent states. In the second phase, CPUs that have yet to report a quiescent state are scanned again, in order to determine whether they have passed through a quiescent state from the moment their snapshots were collected. If there are still CPUs that have not checked in, they are forced into the scheduler in order for them to report a quiescent state to RCU.