Recently Recompiled Resource Hogs

It’s not too uncommon for a query to get a new execution plan that performs a lot worse than it could, and sometimes it’s bad enough to drag the whole server down to a halt.  When it’s something obvious such as a query going from 2 seconds duration to 30 seconds you’ll pick it up on a standard monitoring trace.  That’s not always the case, and the query that hits your server twice a second going from 100ms to 900ms can fly under the radar except for users saying the server isn’t as responsive as it was yesterday.

Before you go all out and throw up an intrusive trace that can slow down your server capturing an unfiltered sample of RPC:Completed and SQL:BatchCompleted to run aggregates on, see what the DMVs can tell you.  If you look in sys.dm_exec_query_stats you can get the total amount of resources used since the last recompilation, which is going to give you pretty much everything you need as long as you’re looking for a query that can stay in the cache.  The only trick is to look at everything in the same context.  Something that did 10,000,000 reads since it was compiled yesterday is nothing compared to something that did 1,000,000 reads since it was compiled 10 minutes ago.  Here’s my answer to this:

SELECT top 10 --WorkerTime_Sec = cast(round(total_worker_time / 1000 / 1000.0, 0) as Int)
    WorkerTimeSec_PerSec =  cast(cast(round(total_worker_time / 1000 / 1000.0, 0) as Int) / (Cast(DateDiff(second, Creation_Time, GetDate()) as DEC(20,2)) + .00000001) as DEC(20,2))
    --, LogicalReads_K = total_logical_reads / 1000
    , LogicalReads_K_PerSec = cast(total_logical_reads / 1000 / (Cast(DateDiff(second, Creation_Time, GetDate())as DEC(20,2)) + .00000001) as DEC(20,2))
    --, ElapsedTime_Sec = cast(round(total_elapsed_time / 1000 / 1000.0, 0) as Int)
    , ElapsedTimeSec_PerSec = cast(cast(round(total_elapsed_time / 1000 / 1000.0, 0) as Int)/ (Cast(DateDiff(second, Creation_Time, GetDate())as DEC(20,2)) + .00000001) as DEC(20,2))
    , Seconds_Since_Recompile = DateDiff(second, Creation_Time, GetDate())
    , Creation_Time
    , qs.execution_count
    , avg_reads = total_logical_reads / qs.execution_count
    , avg_elapsed_ms = total_elapsed_time / 1000 / qs.execution_count
    , avg_worker_ms = total_worker_time / 1000 / qs.execution_count
    , Offset = statement_start_offset
    , ObjectName = OBJECT_NAME(qt.objectid, qt.dbid)
    , qt.text
    , StatementText = SUBSTRING (qt.text, qs.statement_start_offset/2 + 1,
        abs(CASE WHEN qs.statement_end_offset = -1
               THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2 
            ELSE qs.statement_end_offset END - qs.statement_start_offset)/2 + 1)
    --, qt.*
    --, qs.*
FROM sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE Creation_Time > DateAdd(Hour, -24, GetDate())
    AND (creation_time < DATEADD(SECOND, -60, GETDATE()) or execution_count > 5) --Has to be a minute old or executed 6 times before I consider the data stable
    --AND (total_worker_time > 20000 or execution_count > 15000)
ORDER BY 1 DESC 

I comment out what I don’t typically use, same as always with me.  Those pieces don’t usually change.  The only thing that does typically change is the date filter on the creation time; if someone says there weren’t any problems this time yesterday then it stays at the default value, if they say it worked fine before lunch then it drops down to 4 hours.

There are four things to keep in mind with this DMV.  Everything is in microseconds which I don’t use, so I “fixed” the problem in my query.  It will only hold data while the query is in cache, which means it’s never tracked here if it’s never in cache.  Each row is for a single plan for a single statement, although  you could easily aggregate it if you wanted to.  Also, it’s possible for the worker time or duration per second to be over 1 due to multiple simultaneous runs or, with worker time, parallelism can cause this.  I typically try not to restate what you should be reading on MSDN or BOL, but these aren’t things everyone expects.

For me to run this, I typically have to already have a complaint that a server is running slower than expected. Check out my Server Running Slow post to see everything else that runs along with this.

White papers, DMVs and Monitoring Concepts

According to Jason Strate (SQL Server MVP) there are 5 white papers ever SQL Server DBA should read. Considering my blogging is focused on how to monitor and tune previously installed servers, we’ll focus on the first two of these white papers: Troubleshooting Performance Problems in SQL Server 2008 and SQL Server 2005 Waits and Queues. Ignore the age of these papers, after all you’re running mostly the same code to do mostly the same tasks in SQL Server 6.5 and SQL Server 2012…well, except for the lack of DMVs in 6.5.

I don’t agree with them 100%, but not to the point that I don’t agree 100% that everyone should read them. The sources of the information, the uses of the DMVs, all of it is what you’ll need to know to progress as a DBA. Ask any DBA III or higher and they’ll be able to tell you most of what’s said in these papers, even if they went the hard route and made it to where they are without reading them. Where they fall short is by giving static values that aren’t worded as just starting points, and the trending they do is the trending you can do as a Microsoft Support Engineer where nothing is saved or long-term.

My differences range from minor complaints such as me feeling that a blanket statement saying Page Life Expectancy under 300 seconds is when you should start to worry. First, it’s a blanket statement, and I’m a firm believer in that the word always means you’re almost always wrong. Second, servers have more memory now and my SAN team would kill me if I told them I would have to read 500 GB of data every 300 seconds on some of my bigger servers. I do agree that there are times you should worry about PLE, but you have to keep in mind two things; how much memory are you cycling through in that time and what’s normal for this specific server.  I just hope that regular drops down to 3,000 are closer to the point that grabs your attention.

Lets build off of that last big there…What’s normal for this specific server? The queries they’re giving you are snapshots of what info the server has for you right now. Some of that is what’s happening now, some is what has happened since the last restart of SQL services, and other pieces fall somewhere in the middle. You can’t tell me that I’ve done X amount of work since the server was rebooted two weeks ago and expect me to tell you if that’s an issue. You also can’t tell me that “This proc has used more CPU than any other by far since that reboot two weeks ago” and expect me to tell you if that’s a critical issue or an intense, yet normal and expected, off-hours maintenance task. You can, however, tell me that on a typical Monday between 2:00 PM and 5:00 PM we do X, but this Monday we’re doing 100 times that while the business has no reason to say this Monday should be any different. Now we’re on to something…we know there’s an issue, and we just found our starting point to solving it. Chances are if you’re reading this then either you or your immediate manager would be able to create a database to keep this trending information if it doesn’t exist already, and you’re making it harder on yourself if you don’t.

These two white papers are almost 200 pages long together, so I’m not going to pretend to cover them in one post. Actually, I’m hoping my next 50 posts might cover most of them. The points I’m going for are that these white papers are telling you the perfect stats to watch, and these stats are going to be more meaningful when you monitor them. I’ve started this in some of my posts such as looking at Wait Stats and Blocking, but there’s a long way to go. More than these two DMVs need to be watched like this, but they are good examples of wait stats capturing cumulative data and blocking capturing point-in-time data from DMVs.

If you see a DMV in these papers, you’ll see it in my posts. If it’s not there yet, it will be. Some like dm_db_exec_sessions will never have a dedicated post unless you count Running Processes which links it to several DMVs. Others like dm_os_performance_counters might not fit comfortably in a single post. The point here is that all the DMVs, especially the ones mentioned here, are worth looking into more.

Read the papers. Always go beyond the question of “what should I do”, get to the “why should I do it”, and strive for “how can I make this mean more and be more useful”. Never take my scripts or anyone else’s as-is. Tinker with them, really learn the DMVs, and it will all start to come together.

Wait Stats – Monitoring and Using

sys.dm_os_wait_stats is one of the most important DMVs out there, and one that you should know the historical values of for every server you care about. This DMV will tell you the cumulative amount of time you waited for each wait type since SQL Server services started, which is nice to know, but limited in its usefulness as-is.

If, however, you knew what the values were 30 minutes ago and what they are now then you could know why recent queries were waiting. Take it a step further and know what you were waiting for in the same 30 minute period yesterday and you have a bit of a baseline to go off of. Now you can compare what you’re waiting for now to your baseline so you can say if what you’re seeing now is just normal for the server or if you’re seeing a true incident. Remember, the only way to know if numbers on a server are a problem are not is to know what’s normal for that specific server.

A half-hour increment is a bit much for me, and I would prefer to know it every 5 minutes to give a little more detail. Also, just looking at yesterday’s stats for comparison isn’t good enough for me, especially on when looking into an incident on a Monday where Sunday isn’t a very good baseline. Because of that, I want to look at the baseline data from the same time period yesterday and exactly a week ago.

To capture the data you need to capture a snapshot of sys.dm_os_wait_stats along with the time you grabbed it. Then grab it again, do a diff between the two values, and save that off with the starting and ending times. Keep this running every so many minutes in a SQL job and you have a running baseline to compare what you’re seeing now to what the server is typically doing.

After that, keep the data trimmed down to what you’d actually use. Personally, I’m a huge fan of 13 months. This running every 5 minutes for 13 months will accumulate about 750 MB of data, and you can adjust these numbers as you see fit.

Here are the tables; I comment out stuff if I would cry if I accidently ran it at the wrong time.

/*
IF OBJECT_ID('WaitStats') IS NOT NULL BEGIN
    DROP TABLE WaitStats
END

IF OBJECT_ID('WaitStatsLast') IS NOT NULL BEGIN
    DROP TABLE WaitStatsLast
END
*/
GO

CREATE TABLE WaitStats (
    DateStart datetime
    , DateEnd datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStats_DateStart_waittype_U_C ON WaitStats
(
    DateStart
    , wait_type
) WITH (Fillfactor = 95)
GO

CREATE TABLE WaitStatsLast (
    DateAdded datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStatsLast_waittype_U_C ON WaitStatsLast
(
    wait_type
) WITH (Fillfactor = 95)
GO

Notice the lack of creativity. It is very intentional. The code is more stable as I’m copying the field sizes directly from the DMV, and the data is more legible to an outsider (new employee, Microsoft engineer, consultant, vendor) because the fields are exactly the same as the DMV they’re used to seeing. I’m not a fan of non-creative solutions, so there’s a reason to go with it when I decide to be boring.

Now to populate the tables. This script will need to be ran on a regular schedule. Preferably, throw it in a proc and run that proc every 5 minutes, but that’s just my opinion in a discussion about how to watch your servers.

IF OBJECT_ID('tempdb..#WaitStatsNew') IS NOT NULL BEGIN
    DROP TABLE #WaitStatsNew 
END

SELECT DateAdded = GETDATE()
    , wait_type
    , waiting_tasks_count 
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms 
INTO #WaitStatsNew
FROM sys.dm_os_wait_stats

INSERT INTO WaitStats (DateStart, DateEnd, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateStart = ISNULL(l.DateAdded, (SELECT create_date FROM sys.databases WHERE name = 'tempdb'))
    , DateEnd = n.DateAdded
    , wait_type = n.wait_type
    , waiting_tasks_count = n.waiting_tasks_count - ISNULL(l.waiting_tasks_count, 0)
    , wait_time_ms = n.wait_time_ms - ISNULL(l.wait_time_ms, 0)
    , max_wait_time_ms = n.max_wait_time_ms --It's a max, not cumulative
    , signal_wait_time_ms = n.signal_wait_time_ms - ISNULL(l.signal_wait_time_ms, 0)
FROM #WaitStatsNew n
    LEFT OUTER JOIN WaitStatsLast l ON n.wait_type = l.wait_type AND l.DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'tempdb')

TRUNCATE TABLE WaitStatsLast 

INSERT INTO WaitStatsLast (DateAdded, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateAdded
    , wait_type
    , waiting_tasks_count
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms
FROM #WaitStatsNew 

DROP TABLE #WaitStatsNew

Finally, querying it. Give it a descent time frame, I typically start at 2 hours to get an overall picture without having a single event skew the numbers too much. Then look at your most relevant baseline data – yesterday and 7 days ago during the same timeframe.

DECLARE @dStart datetime
    , @dEnd datetime

SELECT @dStart = DATEADD(Hour, -2, GETDATE())
    , @dEnd = GETDATE()

IF OBJECT_ID('tempdb..#IgnoredWaits') IS NOT NULL BEGIN
    DROP TABLE #IgnoredWaits
END

CREATE TABLE #IgnoredWaits (Wait_Type NVarChar(60) PRIMARY KEY) 
--/*
--Values taken from p25 of "Troubleshooting SQL Server - A Guide for the Accidental DBA" by Jonathan Kehayias and Ted Krueger  
INSERT INTO #IgnoredWaits 
SELECT 'BAD_PAGE_PROCESS'
UNION SELECT 'BROKER_EVENTHANDLER'
UNION SELECT 'BROKER_RECEIVE_WAITFOR'
UNION SELECT 'BROKER_TASK_STOP'
UNION SELECT 'BROKER_TO_FLUSH'
UNION SELECT 'BROKER_TRANSMITTER'
UNION SELECT 'CHECKPOINT_QUEUE'
UNION SELECT 'CLR_AUTO_EVENT'
UNION SELECT 'CLR_MANUAL_EVENT'
UNION SELECT 'DBMIRROR_EVENTS_QUEUE'
UNION SELECT 'DISPATCHER_QUEUE_SEMAPHORE'
UNION SELECT 'FT_IFTS_SCHEDULER_IDLE_WAIT'
UNION SELECT 'FT_IFTSHC_MUTEX'
UNION SELECT 'KSOURCE_WAKEUP'
UNION SELECT 'LAZYWRITER_SLEEP'
UNION SELECT 'LOGMGR_QUEUE'
UNION SELECT 'ONDEMAND_TASK_QUEUE'
UNION SELECT 'PREEMPTIVE_OS_AUTHENTICATIONOPS'
UNION SELECT 'PREEMPTIVE_OS_GETPROCADDRESS'
UNION SELECT 'REQUEST_FOR_DEADLOCK_SEARCH'
UNION SELECT 'RESOURCE_QUEUE'
UNION SELECT 'SLEEP_BPOOL_FLUSH'
UNION SELECT 'SLEEP_SYSTEMTASK'
UNION SELECT 'SLEEP_TASK' 
UNION SELECT 'SQLTRACE_BUFFER_FLUSH'
UNION SELECT 'WAITFOR'
UNION SELECT 'XE_DISPATCHER_JOIN'
UNION SELECT 'XE_DISPATCHER_WAIT'
UNION SELECT 'XE_TIMER_EVENT'

SELECT TOP 7 TimeFrame = 'Right Now'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart 
    AND DateEnd <= @dEnd 
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Yesterday'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 1
    AND DateEnd <= @dEnd - 1
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Last Week'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 7
    AND DateEnd <= @dEnd - 7
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

The only thing left to do is wait and interpret the data. All the wait types mean something, with some easily pointing to an issue and others just being normal operations. Doing a dive into what wait types are out there, what they mean to you, and what they mean when they’re not what they were yesterday isn’t something I’m going to get into with this post.  However, there is an Accidental DBA book out there that explains this amazingly well.  Here are links to the FREE PDF Version and the Physical Book.