Advertisement

Tuning Recovery

  • Murali Vallath
Chapter

Abstract

Every single system is prone to failure, be it natural, mechanical, or electronic; this could be the human system, automobiles, computer hardware, elevators, application servers, applications, database servers, databases, and network connectivity. Based on the critical nature of the item and its everyday use, these types of failures need an alternative way to provide the required service and or a method to keep the systems up and functioning. For example, human systems can fail due to sickness; and the sickness can be simple like a fever or complex like a heart attack. The immediate need in this situation is to visit a doctor and get treated. Treatments would help control the situation and get the body functioning again. An automobile can fail, which could be due to a simple failure like a flat tire. A backup option in this case would be a spare tire and some essential tools used to replace the tire. In some unavoidable conditions, an alternative method of transportation has to be used, for example, a bus or taxi. Electronic devices such as computer hardware are also prone to failures; these hardware come in many forms to comprise the entire enterprise configuration. Normally, protection against hardware failures is achieved by providing redundancy at all tiers of the configuration. This helps because when one component fails, the other will help continue operation.

Keywords

Data Block Master Node Recovery Operation Control File Shared Storage 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Every single system is prone to failure, be it natural, mechanical, or electronic; this could be the human system, automobiles, computer hardware, elevators, application servers, applications, database servers, databases, and network connectivity. Based on the critical nature of the item and its everyday use, these types of failures need an alternative way to provide the required service and or a method to keep the systems up and functioning. For example, human systems can fail due to sickness; and the sickness can be simple like a fever or complex like a heart attack. The immediate need in this situation is to visit a doctor and get treated. Treatments would help control the situation and get the body functioning again. An automobile can fail, which could be due to a simple failure like a flat tire. A backup option in this case would be a spare tire and some essential tools used to replace the tire. In some unavoidable conditions, an alternative method of transportation has to be used, for example, a bus or taxi. Electronic devices such as computer hardware are also prone to failures; these hardware come in many forms to comprise the entire enterprise configuration. Normally, protection against hardware failures is achieved by providing redundancy at all tiers of the configuration. This helps because when one component fails, the other will help continue operation.

On the database side, the storage system that physically stores the data needs to be protected. An example is mirroring the disk, where the data is copied to another disk to provide safety and failover when a disk in the array fails. This will provide the required redundancy against disk failures.

What happens when a privileged user accidently deletes rows from a table in a production database? What happens when this damage is only noticed a few days after the accident occurred? What happens when lightening hits the production center and the electric grid, causing a short circuit that damages the entire storage subsystem? In all these situations, an alternative method over and beyond the redundant hardware architecture is required to get to the bottom of the problem for resolution, namely, a process to retrieve and recover the lost data.

The answer is that a copy of the data needs to be saved regularly to another media and stored in a remote location. Such a method of data storage will protect the enterprise from losing its valuable data. The method of copying data from a live system for storage in a remote location is called a backup process.

Backing up the database and related datafiles are just not sufficient; when issues arise, they should be able to restore and recover the database with easy and quick measures. As database sizes grow larger and larger, simple backup techniques or media to store them may not be sufficient to meet the SLA requirements of the business. Recovery of a database should be efficient and optimized for performance to make the environment highly available. After all, if recovery was never a concern and databases are always secure from data loss, why would we need to make a backup of the data? So the end result is to ensure recovery of the database.

In a RAC environment, multiple instances provide access to data, giving availability to the environment. However, servers or instances in a RAC environment are also prone to failures; and recovery of instances is critical to make changes made by users available to other instances in the cluster.

Commonly, in a RAC environment, there are primarily two types of recovery scenarios: instance recovery and media recovery. However, when all instances in a RAC environment crash while the underlying method to recover still continues to be instance-level recovery, the terminology is crash recovery.

Instance Recovery

Instance recovery is to recover the database when an instance crashes midstream during user activity. Unlike in a traditional single instance database scenario, recovery of an instance in a RAC environment is dynamic and happens while the database is up and active. It is probably the most important aspect of recovery that applies to RAC. The idea of having multiple nodes in a clustered configuration is to provide availability with the assumption that if one or more instances in the cluster where to fail, the remaining instance would provide business continuum. For this reason, instance recovery becomes more critical.

One of the primary requirements of a RAC configuration is to have the redo logs of all instances participating in the cluster on the shared storage. The primary reason for such a requirement is to provide visibility of the redo logs of any instance in the cluster to all other instances. This allows for any instance in the cluster to perform an instance recovery operation during an instance failure.

Instance failure could happen in several ways; the common reason for an instance failure is when the node itself fails. The node failure could be due to several reasons including power surge, operator error, and so forth. Other reasons for an instance failure could be because a certain background process fails or dies or when there is a kernel-level exception encountered by the instance, causing an ORA-0600 or ORA-07445 error. Issuing a SHUTDOWN ABORT command could also cause an instance failure.

Instance failures could be of different kinds:
  • The instance is totally down and the users do not have any access to the instance.

  • The instance is up; however, when connecting to it, there is a hang situation or the user gets no response.

In the case in which an instance is not available, users could continue accessing the database via one of the other surviving instances in an active-active configuration provided the failover option has been enabled in the application.

Recovery from an instance failure happens from another instance that is up and running that is part of the cluster configuration and whose heartbeat mechanism detected the failure first and informed the LMON process on the node. The LMON process on each cluster node communicates with the CM on the respective node and exposes that information to the respective instances.

LMON provides the monitoring function by continually sending messages from the node on which it runs and often by writing to the shared disk. When the node fails to perform these functions, the other nodes consider that node as no longer a member of the cluster. Such a failure causes a change in a node’s membership status within the cluster.

The LMON process controls the recovery of the failed instance by taking over its redo log files and performing instance recovery.

How Does Oracle Know That Recovery Is Required for a Given Data File?

The system change number (SCN) is a logical clock inside the database kernel that increments with each and every change made to the database. The SCN describes a “version” or a committed version of the database. When a database performs a checkpoint operation, an SCN (called the checkpoint SCN) is written to the data file headers. This is called the start SCN. There is also an SCN value in the control file for every data file, which is called the stop SCN. There is another data structure called the checkpoint counter in each data file header and also in the control file for each data file entry. The checkpoint counter increments every time a checkpoint happens on a data file and the start SCN value is updated. When a data file is in hot backup mode, the checkpoint information in the file header is frozen; but the checkpoint counter still gets updated.

When the database is shut down gracefully, with the SHUTDOWN NORMAL or SHUTDOWN IMMEDIATE command, Oracle performs a checkpoint and copies the start SCN value of each data file to its corresponding stop SCN value in the control file before the actual shutdown of the database.

When the database is started, Oracle performs two checks (among other consistency checks):
  • To see if the start SCN value in every data file header matches with its corresponding stop SCN value in the control file.

  • To see if the checkpoint counter values match.

If both these checks are successful, then Oracle determines that recovery is not required for that data file. These two checks are done for all data files that are online.

If the start SCN of a specific data file doesn’t match the stop SCN value in the control file, then at least a recovery is required. This can happen when the database is shut down with the SHUTDOWN ABORT statement or if the instance crashes. Oracle performs a check on the data files by checking the checkpoint counters. If the checkpoint counter check fails, then Oracle knows that the data file has been replaced with a backup copy (while the instance was down) and therefore, media recovery is required.

Note

Applying the redo records in the online log files to the data files performs instance recovery. However, media recovery may require applying the archived redo log files as well.

The Instance Recovery Process

After one or more nodes in a cluster fail, the recovery process would consist of several steps to restore the data contained in the redo log and database buffers at the instance (cache) level and blocks at the database level:
  1. 1.

    During the first phase of recovery, GES remasters the enqueues and GCS remasters its resources from the failed instance among the surviving instances.

     
  2. 2.

    The first step in the GCS remastering process is for Oracle to assign a new incarnation number.

     
  3. 3.

    Oracle determines how many more nodes are remaining in the cluster. (Nodes are identified by a numeric number starting with 0 and incremented by 1 for every additional node in the cluster). In our example, there are three nodes remaining in the cluster.

     
  4. 4.

    Subsequently, in an attempt to recreate the resource master of the failed instance, all GCS resource requests and write requests are temporarily suspended (the global resource directory, or global resource directory (GRD), is frozen).

     
  5. 5.

    All the dead shadow processes related to the GCS are cleaned from the failed instance.

     
  6. 6.

    After enqueues are reconfigured, one of the surviving instances can grab the instance recovery enqueue.

     
  7. 7.

    At the same time as GCS resources are remastered, SMON determines the set of blocks that need recovery. This set is called the recovery set. As we discussed in  Chapter 2, with cache fusion, an instance ships the contents of its block to the requesting instance without writing the block to the disk, that is, the on-disk version of the blocks may not contain the changes that are made by either instance. Because of this behavior, SMON needs to merge the content of all the online redo logs of each failed instance to determine the recovery set and the order of recovery.

     
  8. 8.

    At this stage, buffer space for recovery is allocated, and the resources that were identified in the previous reading of the redo logs are claimed as recovery resources. This is done to avoid other instances from accessing those resources.

     
  9. 9.

    A new master node for the cluster is created (assigning a new master node is only performed if the failed node was the previous master node in the cluster). All GCS shadow processes are now traversed (redistributed between all of the remaining instances), GCS is removed from a frozen state, and this completes the reconfiguration process.

     

Note

GCS shadows to be traversed could be estimated from V$GES_ENQUEUE. For example,

SELECT COUNT(*) FROM v$ges_enqueue WHERE resource_name1 LIKE '%BL%';

would help determine the blocks to be traversed.

The following extract is from the alert log file of an Oracle Database version 11.2.0.2 recovering instance; it displays the steps that Oracle has to perform during instance recovery:

Thu Apr 10 09:00:29 2014

Reconfiguration started (old inc 4, new inc 6)

List of instances:

1 (myinst: 1)

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Thu Apr 10 09:00:29 2014

LMS 0: 1 GCS shadows cancelled, 1 closed, 0 Xw survived

Set master node info

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Post SMON to start 1st pass IR

Thu Apr 10 09:00:29 2014

Instance recovery: looking for dead threads

Beginning instance recovery of 1 threads

Submitted all GCS remote-cache requests

Post SMON to start 1st pass IR

Fix write in gcs resources

Reconfiguration complete

parallel recovery started with 2 processes

Started redo scan

Thu Apr 10 09:00:50 2014

Completed redo scan

read 34017 KB redo, 8416 data blocks need recovery

Thu Apr 10 09:01:25 2014

Started redo application at

Thread 2: logseq 4, block 418

Recovery of Online Redo Log: Thread 2 Group 4 Seq 4 Reading mem 0

Mem# 0: +DATA/rac/onlinelog/group_4.267.781699385

Thu Apr 10 09:01:36 2014

minact-scn: master found reconf/inst-rec before recscn scan old-inc#:6 new-inc#:6

Thu Apr 10 09:01:42 2014

Completed redo application of 25.25MB

Thu Apr 10 09:03:14 2014

LMON (ospid: 5187) waits for latch 'enqueue hash chains' for 80 secs.

Thu Apr 10 09:07:04 2014

.......

Thu Apr 10 09:13:44 2014

GES: System Load is HIGH.

GES: Current load is 31.70 and high load threshold is 10.00

Thu Apr 10 09:13:48 2014

Completed instance recovery at

Thread 2: logseq 4, block 68453, scn 1495515

3745 data blocks read, 12020 data blocks written, 34017 redo k-bytes read

Thread 2 advanced to log sequence 5 (thread recovery)

Thu Apr 10 09:14:00 2014

minact-scn: master continuing after IR

minact-scn: Master considers inst:2 dead

Thu Apr 10 09:14:20 2014

  1. 10.

    During remastering of GRD resources and when the GCS services managed on the failed instance get relocated, instance recovery is paused.

     
  2. 11.

    Oracle performs recovery by traversing through the redo logs in two passes. It’s called the rollforward and rollback phases of recovery. During rollforward recovery, all dirty blocks affected by committed transactions are written to the datafile; and during rollback recovery, all blocks updated on the disk as a result of transactions that have not been committed are rolled back.

     
  3. 12.
    Oracle starts the database recovery process and begins the cache recovery process, that is, rolling forward committed transactions. This is made possible by reading the redo log files of the failed instance. Because of the shared storage subsystem, redo log files of all instances participating in the cluster are visible to other instances. This makes any one instance (nodes remaining in the cluster and were started first) that detected the failure to read the redo log files of the failed instance and start the recovery process.
    • Cache recovery is the first pass of reading the redo logs by SMON on the active instance. The redo logs files are read and applied to the active instance performing the recovery operation through a parallel execution.

    • During this process, SMON will merge the redo thread ordered by the SCN to ensure that changes are applied in an orderly manner. It will also find the block written record (BWR) in the redo stream and remove entries that are no longer needed for recovery because they were PIs of blocks already written to disk. SMON recovers blocks found during this first pass and acquires the required locks needed for this operation. The final product of the first pass log read is a recovery set that only contains blocks modified by the failed instance, with no subsequent BWRto indicate that the blocks were later written. The recovering SMON process will then inform each lock element’s master node for each block in the recovery list that it will be taking ownership of the block and lock for recovery. Other instances will not be able to acquire these locks until the recovery operation is completed. At this point, full access to the database is available.

     
  4. 13.

    While cache recovery takes place, work occurs at a slower pace. Once this stage of the recovery operation is complete, it is considered full database availability, now all data is accessible including that which resided on the failed instance.

     
  5. 14.
    After completion of the cache recovery process, Oracle starts the transaction recovery operation, that is, rolling back of all uncommitted transactions.
    • Compared to cache recovery where the recovery is of a forward nature, that is, rolling forward of the transactions from the redo logs, the transaction recovery handles uncommitted transactions; hence, operation is to roll back all uncommitted transactions of a failed instance. In addition, during this pass, the redo threads for the failed instances are merged by SCN, and the redo is applied to the data files.

    • During this process of rolling back uncommitted transactions, Oracle uses a technology called fast-start fault recovery in which it performs the transaction recovery as a deferred process as a background activity. Oracle uses a multi-version and consistency method to provide on-demand rollback of only those rows blocked by expired transactions. This feature prevents new transactions from waiting for the rollback activity to complete.

    • Depending on how the transaction failed, the recovery is performed either by the SMON process or by the server process that initiated the transaction. The following output illustrates the various steps from the time the redo log is scanned to the activity performed by the SMON process completing the recovery.

     

Wed Aug 27 10:07:59 2014

Completed redo scan

read 3298 KB redo, 589 data blocks need recovery

* validated domain 0, flags = 0x0

Wed Aug 27 10:08:00 2014

Started redo application at

Thread 2: logseq 353, block 1930

Wed Aug 27 10:08:03 2014

Recovery of Online Redo Log: Thread 2 Group 3 Seq 353 Reading mem 0

Mem# 0: +SSKY_DATA/SSKYDB/ONLINELOG/group_3.266.854233685

Mem# 1: +SSKY_FRA/SSKYDB/ONLINELOG/group_3.259.854233687

Wed Aug 27 10:08:04 2014

Completed redo application of 1.93MB

Wed Aug 27 10:08:10 2014

Completed instance recovery at

Thread 2: logseq 353, block 8527, scn 36904445

498 data blocks read, 765 data blocks written, 3298 redo k-bytes read

Wed Aug 27 10:08:13 2014

Thread 2 advanced to log sequence 354 (thread recovery)

SMON[INST-TXN-RECO]:about to recover undo segment 1 status:3 inst:1

SMON[INST-TXN-RECO]:about to recover undo segment 2 status:3 inst:1

SMON[INST-TXN-RECO]:about to recover undo segment 3 status:3 inst:1

SMON[INST-TXN-RECO]:about to recover undo segment 4 status:3 inst:1

SMON[INST-TXN-RECO]:about to recover undo segment 5 status:3 inst:1

SMON[INST-TXN-RECO]:about to recover undo segment 6 status:3 inst:1

SMON[INST-TXN-RECO]:about to recover undo segment 7 status:3 inst:1

SMON[INST-TXN-RECO]:about to recover undo segment 8 status:3 inst:1

SMON[INST-TXN-RECO]:about to recover undo segment 9 status:3 inst:1

SMON[INST-TXN-RECO]:about to recover undo segment 10 status:3 inst:1

Wed Aug 27 10:08:15 2014

FAST-START PARALLEL ROLLBACK

Fast-start parallel rollback is performed by SMON, which acts as a coordinator and rolls back transactions using parallel processing across multiple server processes. The parallel execution option is useful when transactions run for a longer duration before committing. When using this feature, each node spawns a recovery coordinator and recovery process to assist with parallel rollback operations.

Setting the parameter FAST_START_PARALLEL_ROLLBACK enables the fast-start parallel rollback feature. This setting indicates the number of processes to be involved in the rollback operation. The valid values are FALSE, LOW, and HIGH. The default value is LOW.
  • LOW—DOP is set to 2 * CPU_COUNT

  • HIGH—DOP is set to 4 * CPU_COUNT

In most cases, transaction recovery can be monitored by querying the V$FAST_START_TRANSACTIONS view. However, if the recovery data is less than 10 undo blocks, Oracle fails to report data through this view, in which case querying the X$KTUXE view could be helpful:

SELECT usn,

state,

undoblockstotal                  "Total",

undoblocksdone                   "Done",

undoblockstotal - undoblocksdone "ToDo",

decode(cputime, 0, 'unknown',

SYSDATE + (((undoblockstotal - undoblocksdone) / (

undoblocksdone / cputime)) /

86400))

"Estimated time to complete"

FROM   v$fast_start_transactions;

Note

In Oracle Database Release 12.1.0, there are improvements in the overall performance of the RAC environment. If in a RAC environment, multiple nodes in the cluster fail and only one surviving node remains, the RDBMS kernel module will shut down all LMS background processes and treat the cluster database as a single instance. The alert log shows “Decreasing number of real time LMS from 1 to 0” illustrating this.

Tuning Instance Recovery

One of the primary reasons for implementing a RAC solution is continued availability. When one server or instance fails, there are other servers and instances in the configuration that will support and continue to provide data availability to the application and users. This means that when one instance or server fails, one of the surviving instances should be able to read the changes made by the failed instance and perform any data recovery. This is the reason why it’s a requirement that redo log files are located on shared storage and accessible by all instances in the cluster.

When a server or instance fails, users connected to the instance can failover or get reconnected transparently to one of the other instances depending on the type of connection implementation. As discussed in the previous section, instance recovery is performed in two phases: cache recovery followed by transaction recovery. Until instance recovery is completed, data changes made on instance 1 are not available or written to the datafiles.

What is being recovered? Data blocks changed by committed transactions (and hence written to redo log files) but not yet written to datafiles will need to be written to datafiles. This would mean the size and content of the redo log files would have an impact on the instance recovery. The first step would be to tune the redo log activity, what size of redo and how frequently the redo log files should switch, and how to size the redo logs.

Like most performance optimization situations, there is a positive and negative side to optimizing the redo log activity. Keeping the redo log files small and switching the redo log files more frequently (frequent checkpoint) would help reduce the recovery time; however, the I/O activity on the database files would increase.

Depending on the number of datafiles in a database, a checkpoint can be a highly resource intensive operation because all datafile headers are frozen during the checkpoint. The idea is to balance or have a performance trade-off when it comes to tuning frequency of log switches. Ideally, it would be a good practice to have about four to five redo log switches per hour. Using the query following, the hourly redo log switches could be determined for a week:

Script: MVRACPDnTap_redologswitches.sql

select

substr(to_char(first_time,'MM/DD-Day'),1,9) day,

to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'99') "00",

to_char(sum(decode(to_char(first_time,'HH24'),'01',1,0)),'99') "01",

to_char(sum(decode(to_char(first_time,'HH24'),'02',1,0)),'99') "02",

to_char(sum(decode(to_char(first_time,'HH24'),'03',1,0)),'99') "03",

to_char(sum(decode(to_char(first_time,'HH24'),'04',1,0)),'99') "04",

to_char(sum(decode(to_char(first_time,'HH24'),'05',1,0)),'99') "05",

to_char(sum(decode(to_char(first_time,'HH24'),'06',1,0)),'99') "06",

to_char(sum(decode(to_char(first_time,'HH24'),'07',1,0)),'99') "07",

to_char(sum(decode(to_char(first_time,'HH24'),'08',1,0)),'99') "08",

to_char(sum(decode(to_char(first_time,'HH24'),'09',1,0)),'99') "09",

to_char(sum(decode(to_char(first_time,'HH24'),'10',1,0)),'99') "10",

to_char(sum(decode(to_char(first_time,'HH24'),'11',1,0)),'99') "11",

to_char(sum(decode(to_char(first_time,'HH24'),'12',1,0)),'99') "12",

to_char(sum(decode(to_char(first_time,'HH24'),'13',1,0)),'99') "13",

to_char(sum(decode(to_char(first_time,'HH24'),'14',1,0)),'99') "14",

to_char(sum(decode(to_char(first_time,'HH24'),'15',1,0)),'99') "15",

to_char(sum(decode(to_char(first_time,'HH24'),'16',1,0)),'99') "16",

to_char(sum(decode(to_char(first_time,'HH24'),'17',1,0)),'99') "17",

to_char(sum(decode(to_char(first_time,'HH24'),'18',1,0)),'99') "18",

to_char(sum(decode(to_char(first_time,'HH24'),'19',1,0)),'99') "19",

to_char(sum(decode(to_char(first_time,'HH24'),'20',1,0)),'99') "20",

to_char(sum(decode(to_char(first_time,'HH24'),'21',1,0)),'99') "21",

to_char(sum(decode(to_char(first_time,'HH24'),'22',1,0)),'99') "22",

to_char(sum(decode(to_char(first_time,'HH24'),'23',1,0)),'99') "23",

to_char((sum(decode(to_char(first_time,'HH24'),'00',1,0))+

sum(decode(to_char(first_time,'HH24'),'01',1,0))+

sum(decode(to_char(first_time,'HH24'),'02',1,0))+

sum(decode(to_char(first_time,'HH24'),'03',1,0))+

sum(decode(to_char(first_time,'HH24'),'04',1,0))+

sum(decode(to_char(first_time,'HH24'),'05',1,0))+

sum(decode(to_char(first_time,'HH24'),'06',1,0))+

sum(decode(to_char(first_time,'HH24'),'07',1,0))+

sum(decode(to_char(first_time,'HH24'),'08',1,0))+

sum(decode(to_char(first_time,'HH24'),'09',1,0))+

sum(decode(to_char(first_time,'HH24'),'10',1,0))+

sum(decode(to_char(first_time,'HH24'),'11',1,0))+

sum(decode(to_char(first_time,'HH24'),'12',1,0))+

sum(decode(to_char(first_time,'HH24'),'13',1,0))+

sum(decode(to_char(first_time,'HH24'),'14',1,0))+

sum(decode(to_char(first_time,'HH24'),'15',1,0))+

sum(decode(to_char(first_time,'HH24'),'16',1,0))+

sum(decode(to_char(first_time,'HH24'),'17',1,0))+

sum(decode(to_char(first_time,'HH24'),'18',1,0))+

sum(decode(to_char(first_time,'HH24'),'19',1,0))+

sum(decode(to_char(first_time,'HH24'),'20',1,0))+

sum(decode(to_char(first_time,'HH24'),'21',1,0))+

sum(decode(to_char(first_time,'HH24'),'22',1,0))+

sum(decode(to_char(first_time,'HH24'),'23',1,0))),'999') total

from v$log

where trunc(first_time) > (trunc(sysdate) - 7)

group by substr(to_char(first_time,'MM/DD-Day'),1,9)

order by 1 ;

It is noticed in the output and the graphical view of the data (Figure 10-1) that there are high redo log switch rates throughout the day. However, it’s extremely high between 2 AM and 10 AM every day. Further analysis of the data revealed that during this period, the application has high batch load activity. Comparing this to the ideal switch rates of four or five per hour, the current redo log switch intervals are significantly high.
Figure 10-1.

Redo log switches

For performance reasons, redo log switches should be kept to as low a value as possible when considering the recovery times involved. Large redo log files with few switches per hour would help reduce I/O contention and wait times. However, smaller sized redo log files with more frequent switches will ensure data is moved from the log files to the data files, keeping the recovery required to a minimal size should an instance crash. However, more controlled methods such as using the parameter FAST_START_MTTR_TARGET (discussed later in this chapter) using the related advisory should be adhered to to improve recovery times.

Parallelism

This would mean instance recovery should be made efficient, which means it needs to be tuned. Instance recovery can be tuned using the parameter RECOVERY_PARALLELISM.

RECOVERY_PARALLELISM specifies the number of processes to participate in instance or crash recovery. To force a serial crash recovery, this parameter should be set to 0 or 1. Along with the RECOVERY_PARALLELISM parameter, increasing the PARALLEL_EXECUTION_MESSAGE SIZE value to a higher number would help improve recovery slave session performance. In the same context, increasing the PARALLEL_MIN_SERVER value from its default value of 0 would help pre-spawn recovery slaves at startup, thus avoiding the need to spawn them when recovery is required and improving recovery performance.

Thu Aug 28 15:51:41 2014

Post SMON to start 1st pass IR

Thu Aug 28 15:51:41 2014

minact-scn: Inst 2 is now the master inc#:4 mmon proc-id:5372 status:0x7

minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.023bc527 gcalc-scn:0x0000.023bdc42

minact-scn: master found reconf/inst-rec before recscn scan old-inc#:4 new-inc#:4

Thu Aug 28 15:51:41 2014

Instance recovery: looking for dead threads

Thu Aug 28 15:51:43 2014

Submitted all GCS remote-cache requests

Fix write in gcs resources

Thu Aug 28 15:51:43 2014

Reconfiguration complete (total time 9.4 secs)

Thu Aug 28 15:51:45 2014

Beginning instance recovery of 1 threads

parallel recovery started with 4 processes

Thu Aug 28 15:51:46 2014

Started redo scan

Thu Aug 28 15:51:52 2014

Completed redo scan

read 2492 KB redo, 422 data blocks need recovery

* validated domain 0, flags = 0x0

Thu Aug 28 15:51:53 2014

Checkpoint

Another parameter that could help in improving instance recovery time is setting the FAST_START_MTTR_TARGET value to a higher number from its default value of zero. This parameter helps reduce the recovery time by increasing the incremental checkpoint interval and reducing the amount of blocks to be recovered during instance recovery. Setting this parameter to a high value could help reduce the overall instance recovery time; however, it would add additional overhead on the performance of the database. Frequent checkpointing will increase the activity of the DBWR process and cause increasing overall I/O. The ideal value of the parameter could be determined by using the advisory view V$MTTR_TARGET_ADVICE:

SELECT inst_id,

mttr_target_for_estimate MTFE,

dirty_limit              DL,

advice_status            Advice,

estd_cache_writes        ECW,

estd_total_ios           ETIO

FROM   gv$mttr_target_advice;

INST_ID       MTFE         DL ADVIC        ECW       ETIO

---------- ---------- ---------- ----- ---------- ----------

1        372       9609 ON          7079      21369

1         39       1000 ON         19306      33596

1        122       3151 ON          8902      23192

1        205       5295 ON          7079      21369

1        288       7440 ON          7079      21369

In the output, there is a great fluctuation in the estimated I/O operations (ETIO) as a result of setting the value of the parameter FAST_START_MTTR_TARGET to the max (3,600 seconds) supported. Reducing the value of this parameter would mean higher frequency of incremental checkpointing, reducing instance recovery time; however, there could be a significant increase in DBWR activity. It’s important that a balance is obtained between performance during normal activity vs. recovery time during instance and database crash.

Setting a value for the FAST_START_MTTR_TARGET would mean the other recovery-related parameters (FAST_START_IO_TARGET, LOG_CHECKPOINT_INTERVAL, and LOG_CHECKPOINT_TIMEOUT) should be disabled.

Note

In Oracle Database 12c (12.1.0), the default value for PARALLEL_MIN_SERVERS is computed by CPU_COUNT * PARALLEL_THREADS_PER_CPU *2, and the default value of PARALLEL_EXECUTION_MESSAGE_SIZE is set to 16,384.

Redo Log Sizing Advisory

Earlier, we discussed the various factors that influence the recovery times. Among these various factors one item that stands apart from the rest is the redo log file size. The DBWR and the LGWR activity depend on the redo log file sizes. Whereas larger redo log files provide better performance, undersized redo log files increase checkpoint activity, thus increasing workload contentions and higher CPU utilization. There should be an ideal value that will help maintain performance while at the same time try to reduce the time required to perform instance/crash recovery.

Using the data provided in the OPTIMAL_LOGFILE_SIZE column in V$INSTANCE_RECOVERY view, the redo log file sizes can be tuned. However, the column is not populated by default by the advisory; it requires setting the parameter FAST_START_MTTR_TARGET. We now try to understand the effect of setting the right sizes and the suggestions by the advisor.
  1. 1.

    Determine the current size of the redo log files using the following query:

     

SELECT group#,

thread#,

sequence#,

bytes,

blocksize,

members,

archived,

status

FROM   v$log;

GROUP#    THREAD#  SEQUENCE#      BYTES  BLOCKSIZE    MEMBERS ARC STATUS

---------- ---------- ---------- ---------- ---------- ---------- --- ----------

1          1        393   52428800        512          2 NO  CURRENT

2          1        392   52428800        512          2 NO  INACTIVE

3          1        392   52428800        512          2 NO  INACTIVE

4          2        355   52428800        512          2 NO  CURRENT

5          2        354   52428800        512          2 NO  INACTIVE

6          2        354   52428800        512          2 NO  INACTIVE

From the output generated by executing the query, its understood that the current size of the redo log file is 50 MB.
  1. 2.

    As discussed earlier, to determine the optimal size of the redo log file size, the FAST_START_MTTR_TARGET parameter should be set. We check the current value of this parameter:

     

SQL> show parameter fast_start

NAME                                 TYPE        VALUE

------------------------------------ ----------- -------

fast_start_io_target                 integer     0

fast_start_mttr_target               integer     70

fast_start_parallel_rollback         string      LOW

From the preceding output, the current FAST_START_MTTR_TARGET value is set much higher than the default value.
  1. 3.

    Check if the advisory has updated the OPTIMAL_REDOLOG_FILE SIZE column. Executing the following query against the V$INSTANCE_RECOVERY view will help understand the current value for the redo log file size:

     

SELECT actual_redo_blks     ARB,

target_redo_blks     TRB,

target_mttr          TMTTR,

estimated_mttr       EMTTR,

optimal_logfile_size OLS,

ckpt_block_writes    CBW

FROM   v$instance_recovery;

ARB        TRB      TMTTR      EMTTR        OLS        CBW

---------- ---------- ---------- ---------- ---------- ----------

2089       7309         15          2        367       1040

Based on the current activity of the database, the suggested MTTR value is 15, and the recommended size for the redo log file size is 367. Making changes to the database would help reduce the actual redo blocks to be recovered.

These measurements and tuning of the redo log file size are only guidelines to help improve/optimize instance/crash recovery times. Instance recovery times can vary depending on if the instance crash was due to node failure or instance failure (without node failure). In the case of a node failure, additional activity, for example, reconfiguring the group membership of the cluster, has to be completed before instance recovery begins. Once the clusterware completes the cluster-level reconfiguration, LMON process takes over to complete its activity related to the cluster level. Oracle records this in the LMON trace file of the instance that performs recovery. (The output following is from a Oracle Database version 12.1.0 instance recovery and contains more details compared to the prior versions of Oracle).

*** 2014-04-02 22:23:22.077

kjxgmrcfg: Reconfiguration started, type 1

CGS/IMR TIMEOUTS:

CSS recovery timeout = 31 sec (Total CSS waittime = 65)

IMR Reconfig timeout = 75 sec

CGS rcfg timeout = 85 sec

kjxgmcs: Setting state to 4 0.

*** 2014-04-02 22:23:22.239

Name Service frozen

kjxgmcs: Setting state to 4 1.

kjxgrdecidever: No old version members in the cluster

kjxgrssvote: reconfig bitmap chksum 0x19a68269 cnt 1 master 1 ret 0

kjxgrpropmsg: SSMEMI: inst 1 - no disk vote

kjxgrpropmsg: SSVOTE: Master indicates no Disk Voting

2014-04-02 22:23:22.253464 : kjxgrDiskVote: nonblocking method is chosen

kjxgrDiskVote: Only one inst in the cluster - no disk vote

2014-04-02 22:23:22.468535 : kjxgrDiskVote: Obtained RR update lock for sequence 5, RR seq 4

*** 2014-04-02 22:23:22.589

2014-04-02 22:23:22.589726 : kjxgrDiskVote: derive membership from CSS (no disk votes)

2014-04-02 22:23:22.589788 : proposed membership: 1

2014-04-02 22:23:22.833840 : kjxgrDiskVote: new membership is updated by inst 1, seq 6

2014-04-02 22:23:22.833913 : kjxgrDiskVote: bitmap: 1

CGS/IMR TIMEOUTS:

CSS recovery timeout = 31 sec (Total CSS waittime = 65)

IMR Reconfig timeout = 75 sec

CGS rcfg timeout = 85 sec

kjxgmmeminfo: can not invalidate inst 3

.....

.....

.....

2014-04-02 22:23:22.939618 :

Reconfiguration started (old inc 4, new inc 6)

TIMEOUTS:

Local health check timeout: 70 sec

Rcfg process freeze timeout: 70 sec

Remote health check timeout: 140 sec

Defer Queue timeout: 163 secs

CGS rcfg timeout: 85 sec

Synchronization timeout: 248 sec

DLM rcfg timeout: 744 sec

List of instances:

1 (myinst: 1)

OMF 1

2014-04-02 22:23:22.943798 : * Begin lmon rcfg step KJGA_RCFG_FREEZE

* kjfc_pub_bigns: previous MM mode, no change unless there is a rora requester (MM->SM)

.....

.....

* dead instance detected - domain 0 invalid = TRUE

.....

2014-04-02 22:23:23.026708 :

2014-04-02 22:23:23.026750 :  Post SMON to start 1st pass IR

2014-04-02 22:23:23.043695 :  Submitted all GCS cache requests

2014-04-02 22:23:23.043950 : * Begin lmon rcfg step KJGA_RCFG_FIXWRITES

.....

.....

.....

2014-04-02 22:23:23.066174 :

Reconfiguration complete (total time 0.1 secs)

* DOMAIN MAPPINGS after Reconfiguration :

*   DOMAIN 0 (valid 0): 1

* End of domain mappings

.....

.....

.....

*** 2014-04-02 22:24:23.130

kjfmPriCheckAllowed: this is the only instance of this DB

kjfmPriCheckAllowed: instance must lower LMS priorities

kjfmPriCheck: # of real time LMS decreasing from 1 to 0

kjfmPriCheck: cannot make further changes to LMS priorities - single-inst mode

.....

.....

.....

*** 2014-04-02 22:27:21.439

global enqueue service detaching from CM:pid=22953

2014-04-02 22:27:21.479573 : kjxgrdisable: IMR recording device closed, terminating IMR

Crash Recovery

Crash recovery is the failure of all instances in a RAC environment or one instance in a single-instance environment. Operations are identical to an instance recovery; however, recovery is performed by the first instance that is started following a crash.

Thread Recovery

Thread recovery is used in both the crash recovery and instance recovery situations. The aim of a thread recovery is to restore the data block images that were in the cache of the failed instance but had not been checkpointed back to disk.

One redo thread is processed at a time, beginning from the last checkpoint and continuing to the end of the thread. All redo is applied from each thread before the next thread is started.

Online Block Level Recovery

Online block recovery is unique to the RAC implementation. Online block recovery occurs when a data buffer becomes corrupt in an instance’s cache. Block recovery will occur if either a foreground process dies while applying changes or if an error is generated during redo application. If the block recovery is to be performed because of the foreground process dying, then PMON initiates online block recovery. However, if this is not the case, then the foreground process attempts to make an online recovery of the block.

Under normal circumstances, this involves finding the block’s predecessor and applying redo records to this predecessor from the online logs of the local instance. However, under the cache fusion architecture, copies of blocks are available in the cache of other instances, and therefore, the predecessor is the most recent past image (PI) for that buffer that exists in the cache of another instance. If, under certain circumstances, there is no PI for the corrupted buffer, the block image from the disk data is used as the predecessor image.

Media Recovery

Any database is prone to failures; and during such failures, there could be situations when there is loss of data either due to data corruption or human error or unforeseen disaster. In the case of the initial two situations, the database is normally restored either completely, for example, when a disk goes bad or partially (point in time) when a specific object needs to be restored. In the third situation, “unforeseen disaster,” a new database will need to be configured and the data restored to it (if the external media is available); or a disaster recovery strategy will need to be implemented. This strategy will require using tools such as Data Guard or Oracle Streams that will allow users to connect to this disaster recovery location when the primary database is down.

Performance tuning strategies that apply to taking data of disk (backup) to a backup media would also apply to taking data from backup media and reapplying it back to the disk (restore). Tuning backup would require looking at various layers of the hardware stack. Some of the key tuning principles include the following:
  1. 1.

    Determine throughput: It’s important to determine the maximum input disk, output media, and network throughput. Throughput of the various devices should be determined to understand how fast I/O could be done. For example, to test the maximum throughput of the disk I/O subsystem, tools such as ORION can help. Network throughput will depend on the type of configuration; for example, in a RAC environment, multiple instances could be used to perform backup in parallel, which means interconnect would be used to some extent. If the backup sets are being written to external media, it may be good idea to have dedicated backup networks and when tuning the network, checking the network buffer sizes, media management client/server buffer size, client/socket timeout parameter settings, and so forth.

     
  2. 2.

    Configure disk for optimal performance: Using ASM for storage of archive logs and backup sets would help improve overall performance and would also improve recovery time.

     
  3. 3.
    Optimize throughput to the backup destination: To get maximum potential of the bandwidth between the backup process and the I/O subsystem (disk or tape), utilize asynchronous I/O.
    • For disk backup: On systems where native asynchronous I/O is not supported, set DBWR_IO_SLAVES parameter to a 4 per session.

    • For tape backup: Set BACKUP_TAPE_IO_SLAVES parameter to 1 per channel process.

     
  4. 4.
    Optimize Backup methods: Utilizing tools such as RMAN to backup data and restore data would help maintain a balanced backup strategy. When using RMAN backup, optimize the various RMAN parameters:
    • Number of channels

    • Parallelism

    • Number of backup sets defined by FILESPERSET parameter

    • Maximum size of backup set

     
  5. 5.
    Tuning RMAN: Apart from configuration parameters we discussed in the previous step, there are a few other parameters that may help to tune the backup and restore operations:
    • _BACKUP_FILE_BUFCNT

    • Defaults to 16, normally matches the number of disks

    • Number of input buffers per channel allocated: When the number of disks is large and a higher number of channels need to be allocated, then this value can be increased to improve overall performance.

    • Achieve balance between memory usage and I/O

    • _BACKUP_FILE_BUFSZ

    • Defaults to 1048576 and normally matches the stripe size of the I/O subsystem. When larger stripe sizes are used when configuring ASM diskgroups, the value of this parameter needs to be adjusted or sized to match the stripe size.

     

Caution

Underscore parameters are undocumented and should be set or modified only with prior approval from Oracle support. Prior to Oracle Database 11g Release 2, _BACKUP_FILE_BUFCNT was called _BACKUP_KSFQ_BUFCNT, and _BACKUP_FILE_BUFSZ was called _BACKUP_KSFQ_BUFSZ.1

Fast Recovery Area

The fast recovery area (called flash recovery area in earlier versions of Oracle) is an Oracle-managed directory, file system, or ASM diskgroup that provides a centralized disk location for backup and recovery files. All the files you need to completely recover a database from a media failure are part of the fast recovery area.

Some of the recovery-related files stored in the fast recovery area include the following:
  • Current control file

  • Online redo logs

  • Archived redo logs

  • Flashback logs

  • Control file autobackups

  • Datafile and control file copies

  • Backup pieces

Implementing the FRA (flash recovery area) feature involves configuring two important parameters.
  • DB_RECOVERY_FILE_DEST_SIZE specifies the maximum storage in bytes of data to be used by the FRA for this database. Note, however, that this value does not include certain kinds of disk overhead: Block 0 or the OS block header of each Oracle file is not included in this size. Allow an extra 10% for this data when computing the actual disk usage required for the FRA.

  • DB_RECOVERY_FILE_DEST_SIZE does not indicate the real size occupied on disk when the underlying file system is mirrored, compressed, or in some other way affected by overhead not known to Oracle. For example, if you can configure the fast recovery area on a normal redundancy (two-way mirrored) ASM disk group, each file of X bytes occupies 2X bytes on the ASM disk group. In such a case, DB_RECOVERY_FILE_DEST_SIZE must be set to no more than 1/2 the size of the disks for the ASM diskgroup. Likewise, when using a high redundancy (three-way mirrored) ASM diskgroup, DB_RECOVERY_FILE_DEST_SIZE must be no more than 1/3 the size of the disks in the disk group, and so on.

As with any feature that requires storage space, the first step in the process is to size the storage area that would contain the FRA data. Sizing should be based on what kind of files and to what retention level the files will be saved in this location. The recovery area should be able to contain a copy of all datafiles in the database and the incremental backups per the backup strategy. Oracle has provided a formula to calculate space required for FRA location:

space required = size of a copy of database + size of an incremental backup                                   + size of (n + 1) days of archived redo logs                                  + size of (y + 1) days of foreign archived redo logs                                  (for logical standby) + size of control file +                                  size of an online redo log member × number of log groups                                  + size of flashback logs (based on                                  DB_FLASHBACK_RETENTION_TARGET  value),

where n is the interval in days between incremental updates, and y is the delay in applying the foreign archived redo logs on a logical standby database.

Once the size is determined, the physical configuration of the FRA could be initiated. Assuming that the database is already configured for archiving, and the database is currently in ARCHIVELOG mode, the first step to this process is to connect to the target database using RMAN:
  1. 1.

    Specify the size of the FRA using the following command:

     

ALTER SYSTEM SET DB_RECOVERY_FILE_DEST_SIZE = <> SCOPE = SPFILE;

  1. 2.

    Specify the location of the FRA using the following command:

     

ALTER SYSTEM SET DB_RECOVERY_FILE_DEST ='+SSKY_FRA'  SCOPE = SPFILE;

The preceding steps will configure the FRA, and the area will be managed by RMAN during its routine backup operations based on predefined retention policies. It’s important that the destination file location specified by the parameter is on shared storage. Configuring this area to reside on ASM storage provides good performance benefits.
  1. 3.

    FRA is automatically managed for space utilization based on the retention policies defined in RMAN. The space utilization could be verified using the following scripts:

     

14:04:01 SQL> SELECT * FROM V$RECOVERY_FILE_DEST;

NAME                             SPACE_LIMIT SPACE_USED SPACE_RECLAIMABLE NUMBER_OF_FILES     CON_ID

-------------------------------- ----------- ---------- ----------------- --------------- ----------

+SSKY_FRA                         5368709120  235929600                 0             117          0

Elapsed: 00:00:02.23

14:04:13 SQL> SELECT * FROM V$FLASH_RECOVERY_AREA_USAGE;

FILE_TYPE               PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE NUMBER_OF_FILES     CON_ID

----------------------- ------------------ ------------------------- --------------- ----------

CONTROL FILE                           .37                         0               1          0

REDO LOG                              3.98                         0               4          0

ARCHIVED LOG                          4.05                      2.01              28          0

BACKUP PIECE                          3.94                      3.86               8          0

IMAGE COPY                           15.94                     10.43              76          0

FLASHBACK LOG                            0                         0               0          0

FOREIGN ARCHIVED LOG                     0                         0               0          0

AUXILIARY DATAFILE COPY                  0                         0               0          0

8 rows selected.

  1. 4.

    Time to time errors could be generated due to various reasons. A common reason is when the FRA is not purged and is reaching its maximum limits. Space utilization could be checked using the following scripts:

     

SELECT object_type,

message_type,

message_level,

reason,

suggested_action

FROM   dba_outstanding_alerts;

Note

If the FRA is not large enough to hold the flashback logs, then the database will overwrite flashback logs from the earliest SCNs. Consequently, the flashback database window can be shorter than the value specified by the flashback retention target.

Configuring FRA is a good practice for the overall recovery operations using RMAN. It helps to have one holistic location for your backup and recovery process and helps in easy restore and instance/crash/media recovery operation.

Conclusion

As discussed in the early chapters of the book, the primary reason for moving to a RAC environment is availability and then scalability. For business continuum after an instance or server or database crashes, it’s important that the data is available for users. This could be achieved by optimizing data recovery. In the case of crash or instance recovery, after the first pass (cache recovery) is completed, users can start accessing the data. The second phase of recovery (transaction recovery) can or could take longer and completes in the background. Depending on the business and SLA requirements, using some of the parameters discussed could help reduce the time required to complete recovery; however, the optimization should be measured against overall performance of the environment.

Although backup is important, equally important is the restore operation and the recovery of the database from failures. However, not much attention is given to this area to optimize the restore operation. There are very few parameters or areas of the system that need to be superficially tuned just to improve efficiency of the backup and restore operations. Most of the tuning completed from other areas of the hardware infrastructure and the database should also help improve backup and restore performance, for example, tuning the I/O subsystem, tuning the network, tuning the database, and so forth.

Footnotes

  1. 1.

    Chien, Timothy and Greg Green, “Recovery Manager (RMAN) Configuration and Performance Tuning Best Practices.” Oracle Open World, 2010.

Copyright information

© Murali Vallath 2014

Authors and Affiliations

  • Murali Vallath
    • 1
  1. 1.NCUnited States

Personalised recommendations