sys.dm_os_wait_stats is one of the most important DMVs out there, and one that you should know the historical values of for every server you care about. This DMV will tell you the cumulative amount of time you waited for each wait type since SQL Server services started, which is nice to know, but limited in its usefulness as-is.
If, however, you knew what the values were 30 minutes ago and what they are now then you could know why recent queries were waiting. Take it a step further and know what you were waiting for in the same 30 minute period yesterday and you have a bit of a baseline to go off of. Now you can compare what you’re waiting for now to your baseline so you can say if what you’re seeing now is just normal for the server or if you’re seeing a true incident. Remember, the only way to know if numbers on a server are a problem are not is to know what’s normal for that specific server.
A half-hour increment is a bit much for me, and I would prefer to know it every 5 minutes to give a little more detail. Also, just looking at yesterday’s stats for comparison isn’t good enough for me, especially on when looking into an incident on a Monday where Sunday isn’t a very good baseline. Because of that, I want to look at the baseline data from the same time period yesterday and exactly a week ago.
To capture the data you need to capture a snapshot of sys.dm_os_wait_stats along with the time you grabbed it. Then grab it again, do a diff between the two values, and save that off with the starting and ending times. Keep this running every so many minutes in a SQL job and you have a running baseline to compare what you’re seeing now to what the server is typically doing.
After that, keep the data trimmed down to what you’d actually use. Personally, I’m a huge fan of 13 months. This running every 5 minutes for 13 months will accumulate about 750 MB of data, and you can adjust these numbers as you see fit.
Here are the tables; I comment out stuff if I would cry if I accidently ran it at the wrong time.
/* IF OBJECT_ID('WaitStats') IS NOT NULL BEGIN DROP TABLE WaitStats END IF OBJECT_ID('WaitStatsLast') IS NOT NULL BEGIN DROP TABLE WaitStatsLast END */ GO CREATE TABLE WaitStats ( DateStart datetime , DateEnd datetime , wait_type nvarchar(60) , waiting_tasks_count bigint , wait_time_ms bigint , max_wait_time_ms bigint , signal_wait_time_ms bigint ) GO CREATE CLUSTERED INDEX IX_WaitStats_DateStart_waittype_U_C ON WaitStats ( DateStart , wait_type ) WITH (Fillfactor = 95) GO CREATE TABLE WaitStatsLast ( DateAdded datetime , wait_type nvarchar(60) , waiting_tasks_count bigint , wait_time_ms bigint , max_wait_time_ms bigint , signal_wait_time_ms bigint ) GO CREATE CLUSTERED INDEX IX_WaitStatsLast_waittype_U_C ON WaitStatsLast ( wait_type ) WITH (Fillfactor = 95) GO
Notice the lack of creativity. It is very intentional. The code is more stable as I’m copying the field sizes directly from the DMV, and the data is more legible to an outsider (new employee, Microsoft engineer, consultant, vendor) because the fields are exactly the same as the DMV they’re used to seeing. I’m not a fan of non-creative solutions, so there’s a reason to go with it when I decide to be boring.
Now to populate the tables. This script will need to be ran on a regular schedule. Preferably, throw it in a proc and run that proc every 5 minutes, but that’s just my opinion in a discussion about how to watch your servers.
IF OBJECT_ID('tempdb..#WaitStatsNew') IS NOT NULL BEGIN DROP TABLE #WaitStatsNew END SELECT DateAdded = GETDATE() , wait_type , waiting_tasks_count , wait_time_ms , max_wait_time_ms , signal_wait_time_ms INTO #WaitStatsNew FROM sys.dm_os_wait_stats INSERT INTO WaitStats (DateStart, DateEnd, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms) SELECT DateStart = ISNULL(l.DateAdded, (SELECT create_date FROM sys.databases WHERE name = 'tempdb')) , DateEnd = n.DateAdded , wait_type = n.wait_type , waiting_tasks_count = n.waiting_tasks_count - ISNULL(l.waiting_tasks_count, 0) , wait_time_ms = n.wait_time_ms - ISNULL(l.wait_time_ms, 0) , max_wait_time_ms = n.max_wait_time_ms --It's a max, not cumulative , signal_wait_time_ms = n.signal_wait_time_ms - ISNULL(l.signal_wait_time_ms, 0) FROM #WaitStatsNew n LEFT OUTER JOIN WaitStatsLast l ON n.wait_type = l.wait_type AND l.DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'tempdb') TRUNCATE TABLE WaitStatsLast INSERT INTO WaitStatsLast (DateAdded, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms) SELECT DateAdded , wait_type , waiting_tasks_count , wait_time_ms , max_wait_time_ms , signal_wait_time_ms FROM #WaitStatsNew DROP TABLE #WaitStatsNew
Finally, querying it. Give it a descent time frame, I typically start at 2 hours to get an overall picture without having a single event skew the numbers too much. Then look at your most relevant baseline data – yesterday and 7 days ago during the same timeframe.
DECLARE @dStart datetime , @dEnd datetime SELECT @dStart = DATEADD(Hour, -2, GETDATE()) , @dEnd = GETDATE() IF OBJECT_ID('tempdb..#IgnoredWaits') IS NOT NULL BEGIN DROP TABLE #IgnoredWaits END CREATE TABLE #IgnoredWaits (Wait_Type NVarChar(60) PRIMARY KEY) --/* --Values taken from p25 of "Troubleshooting SQL Server - A Guide for the Accidental DBA" by Jonathan Kehayias and Ted Krueger INSERT INTO #IgnoredWaits SELECT 'BAD_PAGE_PROCESS' UNION SELECT 'BROKER_EVENTHANDLER' UNION SELECT 'BROKER_RECEIVE_WAITFOR' UNION SELECT 'BROKER_TASK_STOP' UNION SELECT 'BROKER_TO_FLUSH' UNION SELECT 'BROKER_TRANSMITTER' UNION SELECT 'CHECKPOINT_QUEUE' UNION SELECT 'CLR_AUTO_EVENT' UNION SELECT 'CLR_MANUAL_EVENT' UNION SELECT 'DBMIRROR_EVENTS_QUEUE' UNION SELECT 'DISPATCHER_QUEUE_SEMAPHORE' UNION SELECT 'FT_IFTS_SCHEDULER_IDLE_WAIT' UNION SELECT 'FT_IFTSHC_MUTEX' UNION SELECT 'KSOURCE_WAKEUP' UNION SELECT 'LAZYWRITER_SLEEP' UNION SELECT 'LOGMGR_QUEUE' UNION SELECT 'ONDEMAND_TASK_QUEUE' UNION SELECT 'PREEMPTIVE_OS_AUTHENTICATIONOPS' UNION SELECT 'PREEMPTIVE_OS_GETPROCADDRESS' UNION SELECT 'REQUEST_FOR_DEADLOCK_SEARCH' UNION SELECT 'RESOURCE_QUEUE' UNION SELECT 'SLEEP_BPOOL_FLUSH' UNION SELECT 'SLEEP_SYSTEMTASK' UNION SELECT 'SLEEP_TASK' UNION SELECT 'SQLTRACE_BUFFER_FLUSH' UNION SELECT 'WAITFOR' UNION SELECT 'XE_DISPATCHER_JOIN' UNION SELECT 'XE_DISPATCHER_WAIT' UNION SELECT 'XE_TIMER_EVENT' SELECT TOP 7 TimeFrame = 'Right Now' , DateStart = MIN(DateStart) , DateEnd = MAX(DateEnd) , wait_type , wait_minutes = SUM(wait_time_ms)/1000/60 FROM WaitStats WHERE DateStart >= @dStart AND DateEnd <= @dEnd AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits) GROUP BY wait_type ORDER BY 5 DESC SELECT TOP 7 TimeFrame = 'Yesterday' , DateStart = MIN(DateStart) , DateEnd = MAX(DateEnd) , wait_type , wait_minutes = SUM(wait_time_ms)/1000/60 FROM WaitStats WHERE DateStart >= @dStart - 1 AND DateEnd <= @dEnd - 1 AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits) GROUP BY wait_type ORDER BY 5 DESC SELECT TOP 7 TimeFrame = 'Last Week' , DateStart = MIN(DateStart) , DateEnd = MAX(DateEnd) , wait_type , wait_minutes = SUM(wait_time_ms)/1000/60 FROM WaitStats WHERE DateStart >= @dStart - 7 AND DateEnd <= @dEnd - 7 AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits) GROUP BY wait_type ORDER BY 5 DESC
The only thing left to do is wait and interpret the data. All the wait types mean something, with some easily pointing to an issue and others just being normal operations. Doing a dive into what wait types are out there, what they mean to you, and what they mean when they’re not what they were yesterday isn’t something I’m going to get into with this post. However, there is an Accidental DBA book out there that explains this amazingly well. Here are links to the FREE PDF Version and the Physical Book.
- SQL Server Running Slow (simplesqlserver.com)
fyi, i got duplicate MISCELLANEOUS waittype, so the insert fails because of the index.
fyi, SQL Bug: https://connect.microsoft.com/SQLServer/feedback/details/490200/sys-dm-os-wait-stats-has-duplicate-miscellaneous-entry-and-isnt-affected-by-group-by-or-distinct
Thank you. I’ll probably just take the unique constraint off since it’s either that or add my own uniquifier.
It appears that this bug has been fixed in 2010, probably with 2008 SP2 and 2008 R2 SP1. However, I did remove the unique from both clustered indexes involved. As I don’t have a server running at the affected patch level, can you verify that this resolves the issue?
Also, thank you for letting me know about this. My goal is that anyone who cares about wait stats and doesn’t have descent monitoring software can have something they can use without worrying about issues like this.
Wouldn’t you want to calculate average wait time per ms per wait type rather then total wait time?
All the information is there. You know when the interval started and ended so it’s simple math to get the wait per ms.