White papers, DMVs and Monitoring Concepts

According to Jason Strate (SQL Server MVP) there are 5 white papers ever SQL Server DBA should read. Considering my blogging is focused on how to monitor and tune previously installed servers, we’ll focus on the first two of these white papers: Troubleshooting Performance Problems in SQL Server 2008 and SQL Server 2005 Waits and Queues. Ignore the age of these papers, after all you’re running mostly the same code to do mostly the same tasks in SQL Server 6.5 and SQL Server 2012…well, except for the lack of DMVs in 6.5.

I don’t agree with them 100%, but not to the point that I don’t agree 100% that everyone should read them. The sources of the information, the uses of the DMVs, all of it is what you’ll need to know to progress as a DBA. Ask any DBA III or higher and they’ll be able to tell you most of what’s said in these papers, even if they went the hard route and made it to where they are without reading them. Where they fall short is by giving static values that aren’t worded as just starting points, and the trending they do is the trending you can do as a Microsoft Support Engineer where nothing is saved or long-term.

My differences range from minor complaints such as me feeling that a blanket statement saying Page Life Expectancy under 300 seconds is when you should start to worry. First, it’s a blanket statement, and I’m a firm believer in that the word always means you’re almost always wrong. Second, servers have more memory now and my SAN team would kill me if I told them I would have to read 500 GB of data every 300 seconds on some of my bigger servers. I do agree that there are times you should worry about PLE, but you have to keep in mind two things; how much memory are you cycling through in that time and what’s normal for this specific server.  I just hope that regular drops down to 3,000 are closer to the point that grabs your attention.

Lets build off of that last big there…What’s normal for this specific server? The queries they’re giving you are snapshots of what info the server has for you right now. Some of that is what’s happening now, some is what has happened since the last restart of SQL services, and other pieces fall somewhere in the middle. You can’t tell me that I’ve done X amount of work since the server was rebooted two weeks ago and expect me to tell you if that’s an issue. You also can’t tell me that “This proc has used more CPU than any other by far since that reboot two weeks ago” and expect me to tell you if that’s a critical issue or an intense, yet normal and expected, off-hours maintenance task. You can, however, tell me that on a typical Monday between 2:00 PM and 5:00 PM we do X, but this Monday we’re doing 100 times that while the business has no reason to say this Monday should be any different. Now we’re on to something…we know there’s an issue, and we just found our starting point to solving it. Chances are if you’re reading this then either you or your immediate manager would be able to create a database to keep this trending information if it doesn’t exist already, and you’re making it harder on yourself if you don’t.

These two white papers are almost 200 pages long together, so I’m not going to pretend to cover them in one post. Actually, I’m hoping my next 50 posts might cover most of them. The points I’m going for are that these white papers are telling you the perfect stats to watch, and these stats are going to be more meaningful when you monitor them. I’ve started this in some of my posts such as looking at Wait Stats and Blocking, but there’s a long way to go. More than these two DMVs need to be watched like this, but they are good examples of wait stats capturing cumulative data and blocking capturing point-in-time data from DMVs.

If you see a DMV in these papers, you’ll see it in my posts. If it’s not there yet, it will be. Some like dm_db_exec_sessions will never have a dedicated post unless you count Running Processes which links it to several DMVs. Others like dm_os_performance_counters might not fit comfortably in a single post. The point here is that all the DMVs, especially the ones mentioned here, are worth looking into more.

Read the papers. Always go beyond the question of “what should I do”, get to the “why should I do it”, and strive for “how can I make this mean more and be more useful”. Never take my scripts or anyone else’s as-is. Tinker with them, really learn the DMVs, and it will all start to come together.

Blocking – Capturing and Monitoring

If a query is taking longer to run than normal, there’s a good chance it’s being blocked by something else. This is especially true when you’re doing something rediculously simple and SQL Server just sits there thinking. Symptoms of blocking problems include a trace of SQL:BatchCompleted and RPC:Completed with durations over 10 seconds comes back with results using less than a second of CPU time and very few reads and writes. Also, if you’re watching your wait stats, then you’ll start to see more waits that start with LCK_ than normal.  Not the mention the overly obvious blocking found in your Running Processes

This query will show you the blocking currently occurring on your server.

SELECT tl.resource_type
    , database_name = DB_NAME(tl.resource_database_id)
    , assoc_entity_id = tl.resource_associated_entity_id
    , lock_req = tl.request_mode
    , waiter_sid = tl.request_session_id
    , wait_duration = wt.wait_duration_ms
    , wt.wait_type
    , waiter_batch = wait_st.text
    , waiter_stmt = substring(wait_st.text,er.statement_start_offset/2 + 1,
                abs(case when er.statement_end_offset = -1
                then len(convert(nvarchar(max), wait_st.text)) * 2
                else er.statement_end_offset end - er.statement_start_offset)/2 + 1)
    , waiter_host = es.host_name
    , waiter_user = es.login_name
    , blocker_sid = wt.blocking_session_id
    , blocker_stmt = block_st.text 
    , blocker_host = block_es.host_name
    , blocker_user = block_es.login_name
FROM sys.dm_tran_locks tl (nolock)
    INNER JOIN sys.dm_os_waiting_tasks wt (nolock) ON tl.lock_owner_address = wt.resource_address
    INNER JOIN sys.dm_os_tasks ot (nolock) ON tl.request_session_id = ot.session_id AND tl.request_request_id = ot.request_id AND tl.request_exec_context_id = ot.exec_context_id
    INNER JOIN sys.dm_exec_requests er (nolock) ON tl.request_session_id = er.session_id AND tl.request_request_id = er.request_id
    INNER JOIN sys.dm_exec_sessions es (nolock) ON tl.request_session_id = es.session_id
    LEFT JOIN sys.dm_exec_requests block_er (nolock) ON wt.blocking_session_id = block_er.session_id
    LEFT JOIN sys.dm_exec_sessions block_es (nolock) ON wt.blocking_session_id = block_es.session_id 
    CROSS APPLY sys.dm_exec_sql_text(er.sql_handle) wait_st
    OUTER APPLY sys.dm_exec_sql_text(block_er.sql_handle) block_st

However, if you have time to watch your servers all the time and keep running that query then you’re overstaffed in a world that leans towards being understaffed. For way too many reasons to list here, throw the results of this into a table every minute. If someone says a query ran long and you see the server was waiting on locks, look here. If you want to be proactive and look for ways you can improve server performance, look here.

CREATE TABLE Blocking (
    BlockingID BigInt Identity(1,1) NOT NULL
    , resource_type NVarChar(60)
    , database_name SysName
    , assoc_entity_id BigInt
    , lock_req NVarChar(60)
    , wait_spid Int
    , wait_duration_ms Int
    , wait_type NVarChar(60)
    , wait_batch NVarChar(max)
    , wait_stmt NVarChar(max)
    , wait_host SysName
    , wait_user SysName
    , block_spid Int
    , block_stmt NVarChar(max)
    , block_host SysName
    , block_user SysName
    , DateAdded datetime NOT NULL DEFAULT (GetDate())
)
GO

CREATE UNIQUE CLUSTERED INDEX IX_Blocking_DateAdded_BlockingID_U_C ON Blocking
(
    DateAdded
    , BlockingID
) WITH (Fillfactor = 95)
GO

So, now you know what your blocking problems are. So, how do you fix them? That’s a deeper dive than I’ll be doing now, but here’s they key words to look into:

  • NOLOCK hint
  • Query tuning
  • Process timing
  • Eliminate cursors

Every situation is going to be unique, and I can’t pretend like I can give you all the answers. Finding that there is an issue can be difficult enough, and those are the answers I can help with.