Healthy SQL pp 19-41 | Cite as

Creating a Road Map

  • Robert Pearl


You will begin your journey on the road to a healthy SQL Server infrastructure with a checklist. A checklist is a simple way to organize the things you’ll need to do, both long and short term. It provides an agenda to keep you on track and is a tangible item that you can refer to and affirmatively say, “I am done with these items; only three more to go.” It is also something you can present to management and colleagues to show that you have documented your servers. Moreover, the checklist gives you clarity and purpose, allows you to focus on what needs to be done, keeps you on schedule, and acts as a motivator when you’re not sure what to do next. Follow the checklist and follow the road map that I am laying out for you here to achieve a certifiably healthy SQL Server environment.


Virtual Machine Central Processing Unit Health Check Performance Counter Snapshot Isolation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

You will begin your journey on the road to a healthy SQL Server infrastructure with a checklist. A checklist is a simple way to organize the things you’ll need to do, both long and short term. It provides an agenda to keep you on track and is a tangible item that you can refer to and affirmatively say, “I am done with these items; only three more to go.” It is also something you can present to management and colleagues to show that you have documented your servers. Moreover, the checklist gives you clarity and purpose, allows you to focus on what needs to be done, keeps you on schedule, and acts as a motivator when you’re not sure what to do next. Follow the checklist and follow the road map that I am laying out for you here to achieve a certifiably healthy SQL Server environment.

The checklist serves three major purposes, which will be expanded upon in this chapter. It helps you know what to collect, it helps you know what to compare, and it helps you know what to document. By starting out with a checklist, you can be methodical about thoroughly evaluating your SQL Server instances.

Before you proceed to build your checklist, I will talk about the terms you will need to know. This terminology will help you sound intelligent when talking about how you know your servers are healthy and what you are doing to make them so. Then I will discuss the need for an inventory and show you how to take one. Afterward, you will return to building the healthy SQL checklist.

Statistics and Performance

Enter the world of statistics 101. No, you don’t need to be a statistician (statistics are boring!); you should just be familiar with basic terms so you can discuss certain topics with your peers, project managers, and higher-ups in relation to how your SQL Server instances are performing and so you can build reports based on performance trends.

Understanding the Terms

One nonstatistical term you will be hearing a lot throughout this book and your database administrator (DBA) travels with respect to performance is bottleneck. Literally, a bottleneck means the narrowest part of a bottle. In SQL Server usage, a bottleneck is something that slows down the performance of a server, usually to a point noticeable by the end user. A bottleneck implies an impasse and is an occurrence that affects the entire SQL Server system or component that is limited by resources. You might say there is a “resource bottleneck” on your server.

I’ll be using the following terms extensively throughout the book:
  • Baseline: The baseline is the norm. It is the measure of performance of your SQL Server instances when they are operating normally without any issues and is your typical state of performance. It is a starting point to use for future data comparisons; you can use this calculation to measure average performance over time.

  • Delta: Delta’s most common meaning is that of a difference or change in something. To calculate the delta as the difference, you must have two points to work with. You need a starting point and an ending point (or two points in between if you are looking for a specific difference). To figure out the change, you subtract your beginning point from your ending point.

  • Deviation: This just means how far from the normal baseline the performance statistics are or how varied they are from the average baseline numbers.

  • Trending: This means beginning to deviate from normal operation. Over time you will see your baselines move in a particular direction, which will indicate the prevailing tendency of the performance of your SQL Server.

  • Measures: This is an instrument or a means to measure or capture data on how a SQL Server is performing. One example of a measure is a performance counter or object. You use counters and objects to measure your performance.

  • Metric: A metric is the measurement of a particular characteristic of a SQL Server’s performance or efficiency.

  • Metadata: Basically this is informational data about, or properties of, your data. The information that you will collect about the performance of your SQL Server is referred to as metadata.

  • Historical data: Historical data is data saved, stored, and persisted beyond the restart of services of the SQL Server instance, in-bound memory processes, or statistics cleared manually. It exists in tables on the disk storage in the database and enables you to view and compare performance (statistics) over time, reaching into historical points in time.

  • Peak: Peak is the time during your busiest server activity, usually during production/business hours or during batch processing.

  • Nonpeak: Conversely, nonpeak time is the most inactive period, usually off-peak or during nonbusiness hours. Activity is at its quietest during nonpeak hours.

Applying Real-World Principles of Statistics to Performance

Let’s look at establishing a baseline as an example of applying a principle of statiscs to performance management of your database. To increase the performance of your database applications, you monitor your database for a few days on your server during regular business hours. You’ll also find yourself monitoring daily after you install and deploy your SQL Server, when you start to add load to it. If your peak load continues to operate at the same level, without any ostensible performance issues, then you are operating at a normal baseline. This will be your average norm. As long as you are not experiencing any serious bottlenecks or downtime, your SQL Server is in a good, average range. This is your mean (or average) performance range. Take performance measurements at regular intervals over time, even when no problems occur, to establish a server performance baseline. In addition, you can use your baseline to discern frequent patterns and understand how the server is utilizing resources. To determine whether your SQL Server system is performing optimally, you can then compare each new set of measurements with those taken earlier.

The following areas affect the performance of SQL Server:
  • System resources (hardware)

  • Network architecture

  • The operating system

  • Database applications

  • Client applications

At a minimum, use baseline measurements to determine the following:
  • Peak and off-peak hours of operation

  • Production query or batch command response times

  • Database backup and restore completion times

Once you’ve established a server performance baseline, you then can compare the baseline statistics to the current server performance. Numbers far above or far below your baseline may indicate areas in need of tuning or reconfiguration and are candidates for further investigation. For example, if the amount of time to execute a set of queries increases, examine the queries to determine whether they can be rewritten, whether column statistics can be updated, or new indexes must be added. Real-time performance and trend analysis of resource utilization is also possible.

You may even need to create multiple baselines on a single SQL Server, depending on the situation. Even though you can expect your performance to be steady without any changes, you can expect your SQL Server to perform differently under different circumstances. Your servers may be humming along from your starting baseline, and then at night your backup operations kick in, or you run DBCC integrity checks. Both of these are I/O-intensive operations and will surely send the I/O stats higher than the usual baseline.

Therefore, you may want to create another baseline for different periods, when you expect performance behavior to change. The maintenance window where you take backups, rebuild indexes, run database integrity checks, and so on, is grounds for another baseline. So, you’ll have your typical performance during normal business hours but have alternate performance, which in fact would be typical and consistent with overnight maintenance or batch processing. Another period where you can expect performance to be taxed is in companies that have monthly, quarterly, and year-end processing, when users are running reports at these times. You can expect an increase in load and therefore can have another baseline.

Once you establish a baseline for each server, you continuously collect and sample data and observe whether there are any changes since the baseline was first created. If all the server measurements are performing well within the mean and operating at the expected baselines, you can consider SQL Server to be healthy. If there are any differences, you will have to determine what in the environment may have changed and why these changes are occurring and then adjust your baselines accordingly.

Any change going forward in time from the original baseline is a deviation from the norm. The incremental changes are what are known as the deltas. Deltas could indicate a number of changes in the environment and could be as simple as a service pack upgrade, code change, new database created, and so forth. It could be related to performance changes over time, say because of index fragmentation; a depletion of resources; or something deployed to the server.

The changes that occur may skew your baseline. The changes that you make can resolve the performance issue, and then you need to reset the baseline. When you observe these deviations, you can identify any health issues and seek to remediate them in accordance with best practices. You can further compare collected data to established best practices and identify these potential changes to your SQL Server instances.

Once you’ve had a chance to analyze the data and examine the differences, you can determine what changes, if any, should be implemented. Sometimes, established or industry best practices may not be a best practice in your environment. What is best practice for some may not necessarily be best practice for others. That is why to truly establish your own best practices, you set a baseline, trend the data over time, measure the deltas, and determine which best practices should be implemented.

Creating and maintaining a baseline is similar to the life-cycle development process, which consists of the analysis and requirements phase, the design and development phase, then testing and QA, and finally deployment.


Of course, before you can build and use your checklist, you will want to create another list, called the inventory list. If you are a new database administrator (DBA) on the job, an accidental DBA inheriting new SQL Server instances, or a database consultant at a new client site, you want to first answer the question, “Where are all the SQL Server instances?” You need to find out basic information about the currently existing SQL Server instances in the environment. You can call this elementary and important exercise discovery. In other words, before you can do anything else, you need to do a discovery of the SQL Server landscape and answer these questions to create an inventory:
  • Where are the servers? Are they in the same domain?

  • What are the server names/instance names/ports/IP addresses?

  • Can you connect? Do you connect via SQL Server Management Studio (SSMS) or Remote Desktop Protocol (RDP) using the command mstsc?

  • What version, build, service pack, and edition are the servers?

  • Are they set up with Windows Authentication or SQL Authentication (sysadmin/sa passwords)?

  • Are they clustered? Are they virtual or physical?


Before you do anything else, you need to do a discovery of the SQL Server landscape by taking inventory.

Ideally, you also can get this inventory from existing documentation, spreadsheets, and asset management repositories. If none of this exists, then you will have to discover them on your own, with the help of some scripts. You can use a handy little command-line utility called sqlcmd, available and updated since SQL Server.

In SQL Server 2005 as part of the Microsoft SQL Server bundle, you can use sqlcmd to execute Transact-SQL (T-SQL) queries and procedures directly from the command line. For the full syntax of the commands, refer to the following TechNet page:

You can get a list of all the SQL Server instances broadcasting in a Windows network by opening a DOS window command prompt and running sqlcmd –L. Figure 2-1 shows a DOS window with sqlcmd –L running.
Figure 2-1.

Sample output of sqlcmd –L

There are a number of ways to collect basic information about SQL Server, such as its version, edition, and other top-level properties. What is the build and current service pack level? What is the instance name vs. the actual server name? Is it clustered or stand-alone? Is it virtual or physical? What is the security mode of the server? Another thing I’m often asked is what version of .NET is currently installed on SQL Server? In the following text, I will provide a simple and quick way to get all this information from one script.

Although you can get some of this information from the older, time-tested T-SQL command select @@version, which returns system and build information for the current installation of SQL Server, its output is one long nvarchar string. You can find @@version information on the following MSDN page:

You would in fact need to parse the string to derive some of its useful information in a readable format that could be output to a column and thus stored in a table. You can also use, as is suggested, the SERVERPROPERTY (T-SQL) function to retrieve the individual property values. You can find a complete reference to this on MSDN here:

The SERVERPROPERTY function has been around for several versions now and has added some useful parameters for SQL Server 2012 to 2014. One obvious example, with the release of 2012, is the AlwaysOn parameter IsHadrEnabled. This parameter when tells you whether AlwaysOn availability groups are enabled on a server instance. Again, this is available only in versions SQL Server 2012 and newer.

So, when I need a quick way to retrieve this basic data, I run a script that I put together that uses both SERVERPROPERTY and @@version. You can use a batch script to run it against multiple SQL Server instances in the environment. Here is the GetSQLServerEngineProperties script:



DECLARE @majorVersion NVARCHAR(4)


SET @ver = SUBSTRING(@ver,1,CHARINDEX('.',@ver)+1)

SET @majorVersion  = CAST(@ver AS nvarchar)

SELECT SERVERPROPERTY('ServerName') AS [ServerName],SERVERPROPERTY('InstanceName') AS [Instance],

SERVERPROPERTY('ComputerNamePhysicalNetBIOS') AS [ComputerNamePhysicalNetBIOS],

SERVERPROPERTY('ProductVersion') AS [ProductVersion],

    CASE @MajorVersion

WHEN '8.0' THEN 'SQL Server 2000'

WHEN '9.0' THEN 'SQL Server 2005'

WHEN '10.0' THEN 'SQL Server 2008'

WHEN '10.5' THEN 'SQL Server 2008 R2'

WHEN '11.0' THEN 'SQL Server 2012'

WHEN '12.0' THEN 'SQL Server 2014'


SERVERPROPERTY('ProductLevel') AS [ProductLevel],

SERVERPROPERTY('Edition') AS [Edition],



        WHEN 0 THEN 'NO'

        WHEN 1 THEN 'YES'


    AS [IsClustered],






     CASE SERVERPROPERTY('IsIntegratedSecurityOnly')




You can also use a central management server (CMS), on versions SQL Server 2008 and newer, to run T-SQL scripts and queries against multiple instances of SQL Server simultaneously by creating server groups that contain the connection information for one or more instances of SQL Server. You designate one instance to manage all the others and serve as the CMS. You can create a CMS in SSMS by selecting View ➤ Registered Servers ➤ Central Management Servers. Right-click and select Register Central Management Server to register an instance of the CMS, as shown in Figure 2-2. To learn more on how to create a CMS, visit MSDN here:
Figure 2-2.

Registering the CMS

The script uses what I see as the most significant properties for the purposes I mentioned. It comes in handy when doing inventory and Microsoft licensing true-ups, which are similar to audits, to reconcile the number of purchased SQL Server licenses vs. the number of running and deployed SQL Server engines. After you collect the data, you can store it, parse it, query it, and run reports on it. You also can add some of the other properties and modify the script to your own desires.

You can take a look at the SQL Server properties via T-SQL. The script I use runs on versions SQL Server 2005 through SQL Server 2014. Figure 2-3 shows the output for one SQL Server instance.
Figure 2-3.

Server properties for one server

For the SQL Server DBA and from a healthy SQL perspective, when it comes to implementing changes in production, you must take a measured approach. Any changes, whether it is new code, a new service pack or update, or a modification in configuration settings, must be tested on a separate SQL Server that is dedicated to thorough and scripted testing.

Once you determine which changes should be made and go through the testing cycle, the changes should be documented and approved by some change-control process. Change control can be as simple as multiple signatures on a piece of paper from all the stakeholders, agreeing that these are the changes that will be made. More sophisticated systems include enterprise change control systems that serve as a ticketing, documenting, and approval mechanism to track, streamline, and automate the process as much as possible.

In addition to having a change-control process in place, especially for code changes, there should be a rigorously tested deployment or rollout script. The rollout scripts should always be accompanied by rollback scripts. These rollback scripts, which will roll back any changes should the deployment fail for any reason, will quickly and seamlessly let you fall back to where the server was, before the changes. This is almost on par with a Windows OS systems restore to a recovery point. Such a script contains all the changes that will be made to the SQL Server instances and ensures a smooth rollout.

The sign-off gives the DBA the go-ahead to deploy these changes. Of course, I argue that the DBA should be one of the stakeholders and involved in the change-control process. The DBA should be at those meetings, or at least be aware of the scheduled changes to take place. The following are the typical testing life-cycle phases that lead to implementation:
  • Unit testing: You unit test each piece of code.

  • Integration testing: When putting several units together that interact, you need to conduct integration testing to make sure that integrating these units has not introduced any errors.

  • Regression testing: After integrating (and maybe fixing), you should run your unit tests again. This is regression testing to ensure that further changes have not broken any units that you’ve already tested. The unit testing you already did has produced the unit tests that can be run again and again for regression testing.

  • Acceptance tests: When a user/customer/business receives the functionality, they (or your test department) will conduct acceptance tests to ensure that the functionality meets their requirements.

Once you approve the changes, go through the normal testing cycles, and send the changes to production deployment, your baseline will be out-of-date. Since SQL Server instances are never static, you need to repeat the data collection process by updating your baselines.

Essentially, the process of creating and modifying baselines, comparing them to evolving data trends, and implementing changes is what makes up the SQL Server health check flow and part of the performance tuning life cycle (PTL). The PTL essentially organizes database performance tuning in a way that allows you to methodically capture and compare metrics and identify performance issues.

The PTL is similar to the SQL Server health check and follows a similar pattern as shown in Figure 2-4. Each step is part of the integral Healthy SQL methodology by which you determine what to collect, establish a baseline of the collected data, calculate the changes or delta of changes over time, and identify any potential health issues. Then you establish in-house best practices that keep data close to the baseline or norm. Then you determine and document the changes to be implemented in production. Once any changes are made, you repeat the cycle from the beginning. In other words, any time there is a change, you need to start collecting new performance data and establish a new baseline or norm.
Figure 2-4.


To summarize, the following are the steps taken in a SQL Server health check:
  1. 1.

    Collect (via the checklist)

  2. 2.


  3. 3.


  4. 4.

    Identify any health issues

  5. 5.

    Establish best practices

  6. 6.

    Determine changes/implement in production

  7. 7.



You can also see the similarities of the healthy SQL health check compared to the PTL cycle in Figure 2-4.

The PTL consists of defining metrics to collect, creating your baseline, comparing deltas and identifying bottlenecks or any health issues, testing changes/resolutions, implementing, and modifying your baseline. You regularly repeat this exercise because SQL Server performance is never static! It constantly changes over time.

The Checklist

Let’s get back to the business of checklists. A checklist will aid you in gathering detailed information about your SQL Server instances and help document them, collecting information such as hardware, instance, and operating system configuration. You can use other checklists to help you set up and collect performance metadata and statistics.

You must also be aware that there is no checklist that is one-size-fits-all. There are several checklists that collect different information for different purposes, all within the SQL Server realm. Each organization or company might have its own customized checklist to fit its internal needs. It is an evolving list that is constantly updated and changed. Once a checklist is in place, it can be used by other IT professionals and is a great place to start for new members of the DBA team. To wit, the checklist as documentation will be extremely valuable to anyone who will be managing or monitoring the databases.

You might have a pre-installation checklist, an installation checklist, a configuration settings checklist, a cluster checklist, a SQL services checklist, an audit checklist, a backup and maintenance checklist, a currently installed features checklist, DBA daily task checklists, and so forth.

So, where does the checklist originate? There are several resources online and checklists put together by various database professionals, including myself, available on the Web. Most of the information in a solid checklist comes primarily from experience and observation. When you are continuously performing repetitive tasks, a DBA checklist will help you more efficiently manage tasks, better organize your day, and ultimately make you a better database professional. You also want to build a checklist that others can use and ultimately follow.

So, while I will include a template for you to create your own healthy SQL checklist, I will focus on key areas about health checks and the importance of having a DBA checklist. The checklist that you are going to use and create is one that helps you with your objective of achieving a healthy SQL Server environment.

The key to a checklist is that it creates your initial documentation of your SQL Server instances. Now there are several ways to document your SQL Server, and many scripts, tools, and third-party software programs can effectively automate this process. You will learn about these tools later in the book.

In addition, a checklist serves as a foundation for creating a baseline of data that you can use for future comparison. A baseline is a major component of healthy SQL and one that will be covered extensively in this book. To accurately measure performance, a baseline of data is essential for future comparison. You create and use a checklist to keep track of the items you want to capture for the health check. The checklist helps you know what information to collect.

Before you collect this information, you should first know exactly what this collection of information is. So, what is metadata? Metadata is simply data about your data. Traditionally, there is structural metadata, in which you can get information about the structure of a database schema, such as tables, columns, data types, and so on, rather than the data itself. There is also system metadata, in which the functions SERVERPROPERTY and @@version (covered earlier in the chapter), which return data about your various SQL Server properties, are prime examples.

When I begin to talk about dynamic management views (DMVs) and functions in more depth, you will be heavily relying on what I call performance metadata. When you query these system views and functions, you are accessing a goldmine of performance data. The workload on the server impacts the system resources when doing such operations as updating, inserting, deleting, and retrieving data. Throughout the next chapters, you will be collecting, storing, and analyzing all of this performance metadata and using it to determine whether there are any performance issues that should be addressed. You will then create reports to visualize this data and present it to management. The report will help you identify areas for improvement and help you determine what steps to take.

It is important to understand metadata as an essential term, because every piece of information about the state of health of your SQL Server instances and the performance and configuration data that you will collect and use as the main artery of healthy SQL is metadata.

What to Collect

To do a comprehensive performance analysis and review of the company’s SQL Server infrastructure, its existing SQL Server instances, and the databases residing there, you will want to collect several metrics in specific categories.

You want to collect a pool of data that will give you an overview of the server and database configurations, the key performance indicator (KPI) performance counters and measures, the SQL Server metadata, and other various information that will be included in a health check. This information will be used to highlight where the SQL Server instance can be improved.

Because there are so many things that can affect performance, security, and the overall health of your SQL Server instances, a good place to start is getting an existing snapshot of the SQL Server environment. What I am going to share with you in this book is my somewhat-famous, 15-point health check, which is what I offer my clients (see Figure 2-5). When I first come into a client’s network, it is what I use to find out the current state of affairs of their SQL Server instances. I can come into an environment and pretty much know the versions, configuration, and resources without asking any questions. This is based on the scripts I use to capture the information set out by the checklists I use.
Figure 2-5.

My 15-point health check shown as a star schema

It is this precise list that I have used successfully with all my clients. This initial metadata capture is an important start in identifying any issues that will potentially degrade performance and compromise security. I run some various scripts I have compiled and created over the years to take a snapshot of the SQL Server. This snapshot reflects a point-in-time overview on the state of the health of the SQL Server. A report is generated that you can use to identify areas for improvement, present to the client, and make best-practice recommendations.

Here are the top-level categories:
  • Health or state of the SQL Server information

  • Configuration

  • Performance

  • Security

Optimal performance means sufficient resources are allocated for SQL Server instances. You must consider these resources in your performance review and overall system health. Prime factors, relating to adequate resources, are the central processing unit (CPU), memory, and disk I/O. Other configuration factors, such as proper tempdb configuration, also help with improved performance. Drilling down into query performance, locking, blocking, and deadlocks are common causes of slow performance. Virtualization and its rapid adaptation, especially for SQL Server instances, have introduced additional considerations and caveats. While CPU, memory, and disk I/O are still critical to SQL Server performance, virtualization can mask the issue.

CPU, Memory, I/O, Locking, and Blocking

There are key areas to focus on when seeking to identify bottlenecks. If you can capture this data, create a baseline, and trend it over time, you can discern when things go awry with performance. In many cases, you will leverage the available dynamic management objects (DMOs) in SQL Server, which do not persist this data; they only aggregate it since the SQL Server instance was last started or the statistics were cleared. Therefore, for each category of measure, you create data points in time and collect historical data by persisting this information in archive tables.

So, categorically from a high-level view, the things you should capture for your performance health check review are system configuration options, performance counters (Windows OS and SQL Server), DMOs, and wait statistics. Each of these provides you with a piece of the puzzle and window into the state and health of your SQL Server instances. You will learn about each of these in further detail, and I even dedicate upcoming chapters to them.

You will heavily rely on a specific collection of system views, mostly DMOs that are available to you from more than 200 such dynamic views and functions. You will use the DMOs to baseline CPU, random access memory (RAM), and I/O usage.

When you think about the resources that SQL Server uses, you can analyze the utilization of them at the server level, at the database level, and even at the query level. Capturing the resource bottleneck at the server level will tell you where SQL Server is being pegged. For example, say by analyzing your wait statistics, you see that CPU is a top wait, so you subsequently run Performance Monitor on CPU processor time and observe that it is consistently greater than 90 percent for a significant amount of time. You can even pinpoint which CPU processor is actually pegged. Then, you can find out, by percentage, which database on SQL Server is accounting for the highest use of CPU. Once you know the database in question, you can further arrange your top expensive query analysis to pinpoint the exact query causing the most CPU and then begin to fine-tune your query accordingly. You can even correlate the offending query to the CPU processor or scheduler reporting high CPU time. I will provide several methods and scripts to correlate this data throughout the book.

Collecting Information from the System Catalog

The system catalog is a set of views showing metadata that describes the objects in an instance of SQL Server. This metadata is fairly static; it changes occasionally but not often. Nonetheless, it is something that should be monitored; that is, you want to capture it if there are any changes to the SQL Server configuration values and even set up an alert. The catalog views replaced many of the system stored procedures and system tables. The catalog views in SQL Server are organized into more than two dozen categories. For a complete list of these categories, you can visit the following page on MSDN:

With respect to the SQL Server system catalog information, you will capture only some basic metadata pertinent to your health checks. sys.configurations, part of the serverwide configuration catalog views, will provide serverwide configuration options set at the server level. This information is commonly displayed via the legacy sp_configure system stored procedure. You can also view configuration settings via sys.sysconfigures. Among the most relevant of settings you will find here are maximum memory and minimum memory configuration options. These are often the ones that are not properly configured and should be changed accordingly from the default values.

Select * from sys.dm_os_performance_counters will get the total existing SQL Server performance counters, including the object_name, counter_name, instance_name, the counter value, and the type. How this data can be calculated will be explained in the next chapter.

You can find more information about your databases, such as names, IDs, status, database options, recovery models, sizes, physical locations, and more in the Database and Files Catalog Views category. You can inspect the data from sys.database_files, sys.databases, and sys.master_files to find the database settings and parameters that are of interest. You can find the complete MSDN reference to these catalog views here:

Information in the final report will contain the detailed drill-down data along with recommendations and best practices, as assessed by the DBA on the project. The main areas studied, some of which will require further drill-down and analysis, in this initial “collection and analysis” period can be categorized as follows:
  • State of Health

  • Configuration and Server Information

  • Performance, CPU, and Memory

  • Security

State of Health can be broken down as follows:
  • Disk space full/used/percentage free

  • Database log file usage

  • Database status check (online, in-recovery, suspect, read-only, and so on)

  • Check all service statuses (such as clustering, replication, service broker, log shipping, database mirroring) to verify that they are running correctly

  • Uptime status (how long the server has been online and running)

  • Error log checking for severity 17 and greater

  • Error log check for 701 insufficient memory errors and I/O requests taking longer than 15 seconds

  • Error log check for disk/subsystem errors 823, 824, and 825

  • Last time database backups/dbcc checkdb

  • List of scheduled jobs (backups, maintenance, and so on)

Configuration and Server Information includes:
  • OS build, settings, IsClustered

  • Database sizes, parameters, settings

  • SQL Server settings, configurations

  • Memory settings/AYOU/target versus total memory/VAS/MemtoLeave

  • CPU processor/parallelism settings

  • Virtual page size

  • Active traces/default

The Performance, CPU, and Memory category encompasses a great many items:
  • Memory usage/stats

  • Buffer cache hit ratio

  • Page life expectancy

  • IO activity/databases using the most I/O

  • CPU usage/processors

  • Causes of the server waits

  • Index information

  • Count of missing indexes, by database

  • Most important missing indexes

  • Unused indexes

  • Most costly indexes (high maintenance)

  • Most used indexes

  • Most fragmented indexes

  • Duplicate indexes

  • Most costly queries, by average IO

  • Most costly queries, by average CPU

  • Most executed queries

  • Queries suffering most from blocking

  • Queries with the lowest plan reuse

  • Top stored procedure usage by execution count

  • Top stored procedure usage by I/O pressure

  • Top stored procedure usage by CPU pressure

  • Top stored procedure usage by memory pressure

And finally Security is broken down as follows:
  • Security mode

  • Super admin (sa) access check

  • Weak password/blank/reverse passwords

  • Database/DBO/user rights (SOX/HIPPA compliance)

  • List of logins

  • Information about remote servers (linked servers)

Table 2-1 details the overall general categories and subcategories of the 15-point health check that I employ at various clients.
Table 2-1.

15-Point Health Check and Subcategories






General Server Info



Server Instance Properties



SQL Server Configuration Parameters



SQL Memory Usage Stats



Cache Statistics



Error Captures (in last 24 hours)



Logical Disc Volumes



Historical CPU Usage



Database Parameters



Database Size Summary and Distribution Statistics



Database Recovery Model



Database Log Utilization



Database Backup Information



SQL Job Information



Long-running Jobs and Average Runtime



Maintenance Plan Jobs



I/O Activitity



I/O Activity Ranked by Database



I/O Activity Ranked by I/O Stalls



Top Server Waits



Historical Blocking Information



Index Usage and Statistics



Unused Indexes



Most-Used Indexes



Logical Index Fragmentation



Most-Costly Indexes



Top Query Performance



Most-Executed Queries



Most-Costly Queries by Memory Usage



Most-Costly Queries by I/O Volumes



Most-Costly Queries by CPU



Security Checks






Windows Accounts with Admin Privileges



Database and User rights, including DBO



Remote Servers

Virtually Speaking…

Many companies have made virtualization a major part of their deployment and disaster recovery strategy. Virtualization technologies have come to the point where production SQL Server instances are commonly being provisioned as virtual machines. There is still some debate about whether virtualization can support SQL Server efficiently, although recent developments have narrowed this gap. There are many benefits to virtualization, including administrative management, resource management, high availability, recovery, streamlined provisioning of SQL Server instances, rack space, and cost savings on hardware power, and cooling.

However, it doesn’t make the DBA’s job any easier. When there is a SQL Server performance issue, the DBA is the first person who is approached. Unfortunately, virtualization doesn’t make some performance issues transparent. This is a case where you need to work with your virtual machine (VM) administrator because in most instances the VM host is not visible or accessible to the DBA. Therefore, the DBA must be aware of how virtualization works, as well as the specific performance metrics to pay attention to. Many of the SQL Server best-practice recommendations for physical servers don’t necessarily work well for virtual servers.

When I first encountered virtualization, it wasn’t quite up to par yet and so was used to rapidly provision and deploy test and development servers. You could test new features, service packs, hot fixes, code, and so on. As part of my disaster recovery strategy, the limits of virtualization were then stretched by using a process to move servers from a physical box to a virtual machine, otherwise known as physical to virtual (P2V). My company’s physical production servers resided in the NYC data center, and the virtual infrastructure was across the river in New Jersey at the colo site. The hardware was backed by blades with the latest, at that time, VMware platform. At the time, in the event of a DR scenario, the company would temporarily operate at a slightly degraded performance level. This made virtualization a perfect component for my DR plan.

A significant problem early on with SQL Server performance and virtualization was that the additional layer (the hypervisor) between the OS/SQL Server and the hardware (memory, CPU, and disk) in and of itself was the cause of some latency unacceptable to high-transaction online transaction processing (OLTP) systems. Recent developments have significantly improved the ability for SQL Server to scale and perform optimally on a virtual platform and reduce the overhead associated with the translation layer. Second Level Address Translation (SLAT) is one technology that supports virtual to physical memory address translation, which can significantly reduce CPU time and conserves memory resources for the VM.

Each processor type, Intel or AMD, has its own name known as Extended Page Tables (EPT) or Rapid Virtualization Indexing (RVI), respectively. The performance penalty is reduced by allowing the host to directly handle the mapping between the virtual memory used in the VMs and the physical memory in the virtualization host.

Although virtualization technologies have improved by leaps and bounds, it does not mean that every SQL Server machine should be virtualized. In some cases, such as a SQL Server cluster, physical servers may be a better option, even though it is possible to cluster across virtualized environments. Again, there is some debate about virtualizing your SQL clusters among SQL Server pros, but as the technology advances, use cases do become more attractive. Microsoft SQL Server 2008 R2 and newer supports advanced capabilities for high availability, such as guest and host clustering.

The reasons against such an implementation have more to do with who will manage the cluster failover, and not to minimize the DBA’s role here. The advances in virtualization will overcome the existing technical complexities.


Do not cluster your virtual SQL Server nodes on one VM host because this is a single point of failure (SPOF).

If you do decide to set up SQL Failover Clustering on a virtual environment, remember that it is clustered across different physical VM hosts. If you’re a DBA, it may seem that I am stating the obvious; nonetheless, I recently came across a client with multiple SQL Server instances that were clustered on the same VM host. This is an obvious single point of failure and defeats the purpose of a high-availability cluster, so be aware in your DBA travels.

The virtualization debate, however, is beyond the scope of this book. What is relevant here is that the DBA managing a SQL Server environment where several instances, if not all, are virtualized needs to adhere to certain best practices and performance measures.

Let’s come back to what you need to know as a DBA to get healthy SQL performance. If you have a SQL Server instance that is virtual, you have an additional layer of abstraction (in other words, the hypervisor) between you and your data. The host’s allocations of resources are often the cause of unseen performance issues that are not apparent by observing the VM guest. The trouble is that you need to know which layer is causing you performance issues. Is it the host? Or is it the guest? As such, you need to rely on the VM performance counters to get a complete picture of what is happening.

Memory Ballooning

Memory ballooning is a memory management technique that allows a physical host to take advantage of unused memory on its guest virtual machines. This is bad for SQL Server!

If the VM host attempts to retake the memory, there is an option available to reserve the guest memory assigned to individual virtual machines, called “Reserve all guest memory (All locked).” By enabling this feature—located under the Virtual Hardware settings for the virtual machine (vSphere 5.5), as shown in Figure 2-6—the selected virtual machine is not impacted, as shown in Figure 2-6. In the event that this option is not enabled, there may be a case for enabling the Lock Pages In Memory option in SQL Server, which is under DBA control.
Figure 2-6.

VM configuration for guests: “Reserve all guest memory (All locked)”

If your production virtual SQL Server instances are located on shared hosts, it is recommended that this feature is enabled on your virtual machines. Since the DBA is unaware how the host is configured, the DBA must consult with the VM administrator and ensure it is enabled. The last thing you want is for any process to take away memory from one of your SQL Server instances. Although this is the intended function of the memory balloon driver, you don’t want this to happen because it will have a negative impact on SQL Server performance.

Over-allocation of Memory/CPU

It’s OK to want to over-allocate your memory and CPU resources. What you don’t want is to have them over-committed because that’s where performance issues manifest themselves.

CPU Ready is a metric that measures the amount of time a VM is waiting ready to run against the pCPU (the physical CPU). In other words, CPU Ready indicates how long a vCPU (virtual CPU) has to wait for an available core when it has work to perform. So, while it’s possible that CPU utilization may not be reported as high, if the CPU Ready metric is high, then your performance problem is most likely related to the CPU. CPU Ready Time, for example, is one of the VM-specific counters to look at and is the amount of time the virtual machine (guest) is ready to use a pCPU (host) but has to wait to get time on it.

The Balloon KB values reported for the balloon indicate that the host cannot meet its memory requirements and is an early warning sign of memory pressure on the host. The Balloon driver is installed via VMware Tools onto Windows and Linux guests, and its job is to force the operating systems of lightly used guests to page out unused memory back to ESX so that ESX can satisfy the demand from hungrier guests.

In the next chapter, I will discuss and call attention to some key performance counters to monitor, as they relate to virtualized SQL Server instances.

Best Practices: Says Who?

Most often, when a new DBA comes on the job or is interested in performance tuning a slow-running SQL Server, they seek out information on best practices. So, what exactly is a best practice, who is the authority, and who’s to say that their best practices are your best practices? Where do these “best practices” come from?

This is especially true when you’re looking for specific thresholds with performance counters. When setting these up, you often misconstrue what numbers are good and bad for SQL Server. The truth of the matter is that these numbers don’t always tell the whole story. They need context, and they need correlation. I’m not going to list commonly accepted, and in many cases folklore, best-practice numbers. Instead, I am going to talk about how to establish your own best practices based on the performance of your SQL Server instances.

There are many common misconceptions when it comes to setting a number in stone. Often the benchmark numbers are based on specific tests, in specific environments, on specific hardware. Instead, this book is going to talk about best practices in general and how they are all related to your healthy SQL environment. Obviously, you want to adhere to best practices in order to achieve this and create best practices of your own for your company’s needs. Therefore, the best thing to remember about best practices is that they are not always best.


The best thing to remember about best practices is that they are not always best.

Some Not-So Best Practices

Beware of bad practices often disguised as best practices. The superinformation highway, as the Internet used to be referred to, contains a lot of misinformation, roadblocks, and confusing signs. I want to dispel these notions here to emphasize that not all best practices should be blindly followed. As I discussed earlier in the chapter, only when you monitor your own SQL Server instances and see the numbers stray from the baseline should you consider a potential performance issue developing.

A couple of typical examples that many folks incorrectly accept as best practices are regarding the Page Life Expectancy (PLE) performance counter, the CXPACKET wait type, and tempdb configuration. PLE deals with how long you expect a data page to remain in the buffer cache in memory, on average. The number is represented in seconds, so the special number everyone talks about is 300, which is 5 minutes.

Therefore, when new pages are being read into memory, in essence the entire SQL Server memory buffer, the pages being replaced are flushed from cache every five minutes. If you think about this, especially nowadays with servers with large amounts of RAM, that’s a lot of pages being flushed, and it’s pretty risky to wait until your PLE is anywhere near that number.


Some commonly accepted best practices are myths.

Another misunderstanding is with one of the commonly occurring wait types, CXPACKET. This particular wait type often comes to the top of the wait list when querying the sys.dm_os_wait_stats DMV. (Wait types and this DMV will be talked about more in-depth in  Chapter 3.) This wait type is related to parallelism and parallel query processing. Many uninformed folks who see this show up as a top wait on their SQL Server automatically think that this is a problem. What follows from that line of thought are two wrong assumptions. The first one is that this in fact indicates there is a parallelism performance problem. The second one is the oft-accepted recommendation to turn parallelism off.

In reality, the nature of the OLTP system is to manage the query processes in which each process is always waiting on some resource at some point in time to run. The key question is, how long is it waiting? The facts that you see CXPACKET appear as your top wait is not necessarily a problem, unless there are excessive wait times.

Whether this wait is a problem or not depends on the type of system. In an OLTP environment, excessive CXPACKET waits can affect the throughput of other OLTP traffic. In a data warehouse environment, CXPACKET waits are expected for multiple-processor environments.

Figure 2-7 shows sample output of the sys.dm_os_wait_stats view where CXPacket is among the top waits.
Figure 2-7.

The top five waits from the sys.dm_os_wait_stats DMV

The longer the query is waiting for a particular resource, the slower performance is perceived by the end user. While a query, thread, or process is waiting (the differences in these terms will be defined in  Chapter 3), it sometimes holds up other queries, threads, and processes waiting to execute. This holdup creates the performance bottleneck. What to do next is the foundation for  Chapter 3, which covers the methodology of waits and queues. In that chapter, you will learn one of the most effective performance troubleshooting techniques.

One final example of misconstrued best practice is the configuration of the system database tempdb. The name itself is a misnomer because without this database, SQL Server will not start up. The purpose of this system database is basically a work-and-sort space for database operations. Temporary tables, variables, stored procedures, cursors created by the user, and internal objects created by SQL Server get used here. tempdb is used also for row versioning and snapshot isolation and other such snapshot operations such as when executing certain DBCC commands. All of these objects are dropped once the session is disconnected.

The function of tempdb is critical to the overall performance of SQL Server, and the size, placement, and number of tempdb files directly affect performance. That is where urban legend kicks in as database professionals seek some guidelines on sizing, disk placement, and number of tempdb data files to create. An outdated recommendation was to create one additional database file for tempdb for each number of processor CPUs on the system.

With an ever-increasing number of physical processors, multicores, and hyperthreading, this recommendation will not give you any significant performance gains and becomes a point of diminishing returns. It may even degrade performance.

It Depends

With all the confusion out there as to what are best practices, the best answer to this is “it depends.” It all depends on your environment and your server’s specific workload. There are several factors that come into play. The purest form of best practices comes from your own testing, trending, baselining, and benchmarking. The numbers themselves matter little if they are near or trending at your typical baseline.

As technical as performance tuning will get, it is truly more art than science and experience over training. You don’t need to completely throw out what are commonly accepted as industry best practices. You can use these as a guideline for your own testing. Some of them may be more applicable than others. What this section aims to do is arm you with knowledge on how to establish your own best practices, rather than spoonfeed you theoretical numbers that may not apply. Once you get through this book, you ideally will have created your own best practices to follow.

So, one way to establish your best practices is to compare the data you have collected during your health check data collection. This is the starting point to determine your ideal numbers. Using your baseline data, you will use this as a point of reference to determine whether your server is “healthy.” All the measures and statistics that are captured at or near your baseline in essence become your best-practice standards. These standards should in turn be documented as a guide of best practices for your company to follow and assist with future SQL Server builds and deployments.

Perhaps if you are not yet comfortable or ready to create your own best-practice documentation, you can start looking at authoritative and reliable resources. (See some of my common best-practice recommendations at the end of this chapter.) Industry best practices and recommendations from Microsoft are a good place to start. Its white papers, knowledge base articles, TechNet, and MSDN are solid references, and I’ll tell you why. Though some of their information may not always be 100 percent accurate, the documentation goes through a fairly rigorous technical-editing process before being released to the public.

Because of the high industry reputation of Microsoft and obligation to the customer, its information and documentation are vetted and validated. Even further, you can expect your group of Microsoft SQL Server MVPs, whose expertise is often unmatched with the product line, to always be reviewing, validating, and correcting any technical errors or omissions. Indeed, the MVP community plays a symbiotic relationship between Microsoft and the customers.

As a final point, with respect to establishing your best practices, you will soon see how it all gels together. Your inventory and baseline performance data becomes your documentation, which allows you to establish internal best practices, which in turn become your standards guide. Finally, you will be able to publish an operations manual that will become your run book. This term is discussed in the next section.

Your documentation and run book will be the key to a certifiably happy and healthy SQL server infrastructure and something you can immediately produce in the event of an audit. You managers will be happy, your colleagues will be happy, your auditors will be satisfied (but always grumpy), and you will be a happy database professional and exceptional DBA, worthy of praise, accolades, and, most important, a raise!

Run Book

The road map now brings me to what will become one of the most essential pieces of documentation the organization will ever have, collated into one book, with respect to your IT SQL Server infrastructure: the run book. When I refer to the run book here, it can, but not necessarily mean, an actual printed book. It is the collection of your documents, whether it’s hard copy sitting on your desk, in the data center, or in the safe or an electronic repository located on a network share, in a document management system, on SharePoint, or even in the cloud. The run book can be any format and contain Word documents, Excel spreadsheets, Visio diagrams, PowerPoint slides, and so on.

Far too few companies have a run book, and yours will be fortunate enough to have one on hand once you complete this book. The run book in essence will be the organization’s SQL Server instruction and operations manual, and such a book will provide the keys to the kingdom.

Now that I have built up the significance of the run book, let me define its purpose. The run book can have multiple uses. Wikipedia defines an IT run book as “a routine compilation of procedures and operations that the system administrator or operator carries out.” Indeed, the run book is used as a reference by the interrelated IT teams, DBAs, help-desk staff, and network operations centers (NOCs), internally or externally. It stands alone as something that can be used as a training tool and to get junior staff and DBAs up to speed. The run book can also be used by project and database consultants that come in for a short-term assignment and need to quickly get up to speed on the logistics and technical specifications of the SQL Server environment.

So, what if you’re a new DBA on the job? Wouldn’t it be great to be able to refer to an all-encompassing run book that contains inventory, checklists, diagrams, performance data, and procedural and operational guides? The first thing I ask for at a new job or project is existing documentation. More often than not, there is no existing documentation, which is unfortunate. When some forms of documentation are available, usually they are not very good or useful. If you are lucky enough that a previous DBA or IT professional was mindful to put together some useful and valid documentation, consider yourself ahead of the game.

Nonetheless, this awesome responsibility will more than likely fall on your shoulders, and this book aims to make you aware of the significance that detailed documentation can make. It can make or break an organization’s ability to troubleshoot issues, ensure business continuity, and recover quickly from a disaster, as well as pass an audit.

In addition, the run book will aid in performance tuning, as well as allow you or help-desk staff to triage issues through step-by-step instructions, decision trees, and contingencies. A good example is the DBA who is on-call, especially where there is escalation procedures in place. If a technically capable help-desk admin is manning the operations center and is the first call, a run book can allow the person to actually troubleshoot and resolve simple issues.

For example, if a simple script can be run to restart a failed job, truncate a transaction log, or rerun a backup, why escalate the issue up the chain? A run book can empower the junior technicians to grow their experience, technical know-how, and careers. It makes the difference between a call center where tickets are logged and delegated and a first-class, quality technical support center. This has positive benefits for all involved, including the organization as a whole.

Having thorough documentation is essential to building an efficient SQL Server infrastructure and is key to healthy SQL. You don’t have to be the best writer (you’re not writing a best-selling novel here); you just need to write things down in an organized way. And while it may be a daunting task, it is an undertaking that is well worth the effort. From a career perspective, it will mean the difference between an ordinary DBA and an exceptional DBA.


Career-wise, having a run book in your organization will mean the difference between an ordinary DBA and an exceptional DBA.

Keep in mind that the data is out there for the taking, and you must make sure you collect the right data and the right amount of data to allow you to make the proper analysis. Having too much data or not enough data will lead you to bad decisions. Analyzing the data will lead you to make decisions that will affect your SQL Server instances and impact performance. Therefore, you must be careful not to misinterpret the data and its meaning. Moreover, if you have all this data at your disposal and choose to ignore it, that in and of itself is a bad decision.

Now that you know about the run book, the information you need to collect, and the need to create a baseline, I will give you the knowledge, the tools, and the methods to collect this data throughout the rest of this book.

To sum up, having a run book will assist you with the following:
  • Training

  • Procedural operations and contingencies

  • Performance tuning

  • Troubleshooting

  • Standardization

  • Disaster recovery

  • Audit

Road Map Essentials

I’ve talked about all the essential components involved in creating a healthy SQL Server environment road map. Here’s a review of the items necessary for conducting a healthy SQL health check:
  • Checklist

  • Inventory

  • Statistical terms: baseline, delta, deviation, and so on

  • Metadata

  • Collection data

  • Compare data

  • Performance life cycle

  • Testing life cycle

  • Best practices

  • Documentation

  • Run book

The road map for healthy SQL is now complete. Here I have given you the foundation of what you need to know to move toward a healthy SQL Server Infrastructure. You’re ready to move forward to understanding and using the methodologies you will employ.

Rob’s Bonus Best Practice Considerations

Here are ten quick best practice considerations (there are tons of them throughout this book, so this is just a preview) of things to consider when provisioning and deploying a new SQL Server. I don’t make any specific recommendation because, as always, “it depends,” but these areas should be considered for optimal performance at the server level when first deploying and configuring a new SQL Server instance.
  • The Disk Layout for SQL Server should ideally separate Data, Log, tempdb, Binaries, and SQL Backups to separate physical disks.

  • You should consider enabling LockPagesInMemory for both physical and clustered SQL Servers.

  • Enable Instant File Initialization by assigning the Perform Volume Maintenance Tasks Windows policy to the SQL Server service account.

  • Always set the SQL Server configuration options Maximum Server Memory andMinimum Server Memory.

  • Exclude all data, log, backup, and SQL Server files from antivirus scans.

  • The rule of thumb for the number of Tempdb Files to create for a SQL Server is the number of Logical Cores/4.

  • Use the Startup Trace Flag T-1118 to reduce allocation contention in the SQL Server tempdb database.

  • Enable the SQL Server configuration setting Optimize For AdHoc Workloads for better query performance.

  • Increase the SQL Server configuration setting Cost Threshold for Parallelism to specify the threshold at which SQL Server creates and runs parallel plans for queries.

  • Consider using native backup compression for your SQL Server backups.

Copyright information

© Robert Pearl 2015

Authors and Affiliations

  • Robert Pearl
    • 1
  1. 1.NYUS

Personalised recommendations