Wait Stats – Monitoring and Using


sys.dm_os_wait_stats is one of the most important DMVs out there, and one that you should know the historical values of for every server you care about. This DMV will tell you the cumulative amount of time you waited for each wait type since SQL Server services started, which is nice to know, but limited in its usefulness as-is.

If, however, you knew what the values were 30 minutes ago and what they are now then you could know why recent queries were waiting. Take it a step further and know what you were waiting for in the same 30 minute period yesterday and you have a bit of a baseline to go off of. Now you can compare what you’re waiting for now to your baseline so you can say if what you’re seeing now is just normal for the server or if you’re seeing a true incident. Remember, the only way to know if numbers on a server are a problem are not is to know what’s normal for that specific server.

A half-hour increment is a bit much for me, and I would prefer to know it every 5 minutes to give a little more detail. Also, just looking at yesterday’s stats for comparison isn’t good enough for me, especially on when looking into an incident on a Monday where Sunday isn’t a very good baseline. Because of that, I want to look at the baseline data from the same time period yesterday and exactly a week ago.

To capture the data you need to capture a snapshot of sys.dm_os_wait_stats along with the time you grabbed it. Then grab it again, do a diff between the two values, and save that off with the starting and ending times. Keep this running every so many minutes in a SQL job and you have a running baseline to compare what you’re seeing now to what the server is typically doing.

After that, keep the data trimmed down to what you’d actually use. Personally, I’m a huge fan of 13 months. This running every 5 minutes for 13 months will accumulate about 750 MB of data, and you can adjust these numbers as you see fit.

Here are the tables; I comment out stuff if I would cry if I accidently ran it at the wrong time.

/*
IF OBJECT_ID('WaitStats') IS NOT NULL BEGIN
    DROP TABLE WaitStats
END

IF OBJECT_ID('WaitStatsLast') IS NOT NULL BEGIN
    DROP TABLE WaitStatsLast
END
*/
GO

CREATE TABLE WaitStats (
    DateStart datetime
    , DateEnd datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStats_DateStart_waittype_U_C ON WaitStats
(
    DateStart
    , wait_type
) WITH (Fillfactor = 95)
GO

CREATE TABLE WaitStatsLast (
    DateAdded datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStatsLast_waittype_U_C ON WaitStatsLast
(
    wait_type
) WITH (Fillfactor = 95)
GO

Notice the lack of creativity. It is very intentional. The code is more stable as I’m copying the field sizes directly from the DMV, and the data is more legible to an outsider (new employee, Microsoft engineer, consultant, vendor) because the fields are exactly the same as the DMV they’re used to seeing. I’m not a fan of non-creative solutions, so there’s a reason to go with it when I decide to be boring.

Now to populate the tables. This script will need to be ran on a regular schedule. Preferably, throw it in a proc and run that proc every 5 minutes, but that’s just my opinion in a discussion about how to watch your servers.

IF OBJECT_ID('tempdb..#WaitStatsNew') IS NOT NULL BEGIN
    DROP TABLE #WaitStatsNew 
END

SELECT DateAdded = GETDATE()
    , wait_type
    , waiting_tasks_count 
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms 
INTO #WaitStatsNew
FROM sys.dm_os_wait_stats

INSERT INTO WaitStats (DateStart, DateEnd, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateStart = ISNULL(l.DateAdded, (SELECT create_date FROM sys.databases WHERE name = 'tempdb'))
    , DateEnd = n.DateAdded
    , wait_type = n.wait_type
    , waiting_tasks_count = n.waiting_tasks_count - ISNULL(l.waiting_tasks_count, 0)
    , wait_time_ms = n.wait_time_ms - ISNULL(l.wait_time_ms, 0)
    , max_wait_time_ms = n.max_wait_time_ms --It's a max, not cumulative
    , signal_wait_time_ms = n.signal_wait_time_ms - ISNULL(l.signal_wait_time_ms, 0)
FROM #WaitStatsNew n
    LEFT OUTER JOIN WaitStatsLast l ON n.wait_type = l.wait_type AND l.DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'tempdb')

TRUNCATE TABLE WaitStatsLast 

INSERT INTO WaitStatsLast (DateAdded, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateAdded
    , wait_type
    , waiting_tasks_count
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms
FROM #WaitStatsNew 

DROP TABLE #WaitStatsNew

Finally, querying it. Give it a descent time frame, I typically start at 2 hours to get an overall picture without having a single event skew the numbers too much. Then look at your most relevant baseline data – yesterday and 7 days ago during the same timeframe.

DECLARE @dStart datetime
    , @dEnd datetime

SELECT @dStart = DATEADD(Hour, -2, GETDATE())
    , @dEnd = GETDATE()

IF OBJECT_ID('tempdb..#IgnoredWaits') IS NOT NULL BEGIN
    DROP TABLE #IgnoredWaits
END

CREATE TABLE #IgnoredWaits (Wait_Type NVarChar(60) PRIMARY KEY) 
--/*
--Values taken from p25 of "Troubleshooting SQL Server - A Guide for the Accidental DBA" by Jonathan Kehayias and Ted Krueger  
INSERT INTO #IgnoredWaits 
SELECT 'BAD_PAGE_PROCESS'
UNION SELECT 'BROKER_EVENTHANDLER'
UNION SELECT 'BROKER_RECEIVE_WAITFOR'
UNION SELECT 'BROKER_TASK_STOP'
UNION SELECT 'BROKER_TO_FLUSH'
UNION SELECT 'BROKER_TRANSMITTER'
UNION SELECT 'CHECKPOINT_QUEUE'
UNION SELECT 'CLR_AUTO_EVENT'
UNION SELECT 'CLR_MANUAL_EVENT'
UNION SELECT 'DBMIRROR_EVENTS_QUEUE'
UNION SELECT 'DISPATCHER_QUEUE_SEMAPHORE'
UNION SELECT 'FT_IFTS_SCHEDULER_IDLE_WAIT'
UNION SELECT 'FT_IFTSHC_MUTEX'
UNION SELECT 'KSOURCE_WAKEUP'
UNION SELECT 'LAZYWRITER_SLEEP'
UNION SELECT 'LOGMGR_QUEUE'
UNION SELECT 'ONDEMAND_TASK_QUEUE'
UNION SELECT 'PREEMPTIVE_OS_AUTHENTICATIONOPS'
UNION SELECT 'PREEMPTIVE_OS_GETPROCADDRESS'
UNION SELECT 'REQUEST_FOR_DEADLOCK_SEARCH'
UNION SELECT 'RESOURCE_QUEUE'
UNION SELECT 'SLEEP_BPOOL_FLUSH'
UNION SELECT 'SLEEP_SYSTEMTASK'
UNION SELECT 'SLEEP_TASK' 
UNION SELECT 'SQLTRACE_BUFFER_FLUSH'
UNION SELECT 'WAITFOR'
UNION SELECT 'XE_DISPATCHER_JOIN'
UNION SELECT 'XE_DISPATCHER_WAIT'
UNION SELECT 'XE_TIMER_EVENT'

SELECT TOP 7 TimeFrame = 'Right Now'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart 
    AND DateEnd <= @dEnd 
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Yesterday'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 1
    AND DateEnd <= @dEnd - 1
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Last Week'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 7
    AND DateEnd <= @dEnd - 7
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

The only thing left to do is wait and interpret the data. All the wait types mean something, with some easily pointing to an issue and others just being normal operations. Doing a dive into what wait types are out there, what they mean to you, and what they mean when they’re not what they were yesterday isn’t something I’m going to get into with this post.  However, there is an Accidental DBA book out there that explains this amazingly well.  Here are links to the FREE PDF Version and the Physical Book.