Monitoring and Baselines Presentation

Here’s my full presentation for SQL Saturday #250 in Pittsburgh this past Saturday along with some notes on what I got out of it.

My Lessons

I learned a lot doing this, and I hope people learned a lot from it. For the people on the other side of the podium from me, the main lessons were in what I put in the abstract; they learned mostly about SQL Server. On my side of the podium though, the lessons were far from the same. Public speaking, creating presentation, prepping for unknown audiences, and seeing the gratitude of the SQL Server community in person. Anything that was on the abstract I learned slightly better, but that didn’t compare to what wasn’t on it.

Public speaking was always a fear of mine, and I tend to be someone who avoids my fears. However, throwing myself out there at the public as a whole was never my thing either, and I’ve been doing half decent, at least in my own mind. That being said, I decided to go all out and push myself a little further. After all, what are the chances of them picking me to present with all those professional teachers, consultants, and MVPs out there throwing up their abstracts. Best case I could say I tried, worst case I was going to throw myself out of my comfort zone and hope for the best.

They picked me, which I didn’t let myself expect. Everyone that knew I put in my abstract was also told by me that I only had a 50/50 chance at best, which was more optimistic than I really was, but I don’t like people knowing when I’m being dismissive of myself. It turns I was wrong a lot, and in every case I was wrong I was glad I was wrong.

Being Forced to Get Better

I’m not one to learn things for the sake of learning them, which is why I sucked in school. To really learn things I need a real-life use case, somewhere I’m going to apply it. I didn’t practice speaking ever before because I didn’t speak in front of large groups. I didn’t know how to put together a presentation properly because that’s not my thing…well, wasn’t is probably a better word now. Just like a couple months ago I wasn’t quite sure how to blog, but I’ve made it past that point.

Like everything in my life, blogging went from unknown to addiction quick. I’m not sure that I can say the same thing about speaking, but I can see the possibility. SQL Saturday only comes to Pittsburgh once a year and the local user group, which I’ll be joining soon, only meets once a month and has a single speaker. However, I can’t say that I didn’t look up when the SQL Saturdays in Cleveland and Washington DC were. Also, I took my own thoughts on my presentation and the feedback I received in and immediately starting thinking about how I could make that presentation better.

What’s below is the original, not touched up at all in the last 4 days. In part to show where I went wrong and how I’m going to fix it, and in part because Monday was my first day at a new job. By the way, starting a new job when between jobs is the best and worst thing you could ever do to yourself. You don’t have time to be too nervous about anything because you’re too overworked, but you’re also too overworked. Eh, you can’t win them all.

Well, on to the presentation. What you have here is my script that I talked through and taught myself before the presentation. However, if you were there, you’ll notice this isn’t exactly what I said. I didn’t read it, I presented it. There were no demos (the biggest complaint from the crowd) to avoid me from stumping myself, and the PowerPoint was just a whimsical picture for the start of each paragraph to keep me on track because I knew I’d be nervous and lose my way without a paragraph-by-paragraph guide helping me along.

I must have done at least descent because my reviews came back with two 3’s and the rest 4’s and 5’s out of 5. You’re just asking me to get off topic and start talking about why I hate ratings that are odd numbers, but no, I’m not getting off topic, I’m getting on to my presentation!!!

The Presentation

Download slide deck here.

I’ve been working on databases for over a decade, and most of that without having good standards or monitoring in place. My job was rough and I was much less effective. Now I know what I’m doing and I want to share that knowledge. In this presentation I am going to tell you what I watch and why I watch it. A presentation is a poor format to go through how to do this; having it in writing is much more beneficial to you. That is why I’m pretty much skipping that portion here and diverting you to my blog at SimpleSQLServer.com. Not so much to promote a blog that I lose money on, but instead to give you the resources in the best format.

As a DBA you will hit problems with the performance of your databases. It doesn’t matter if it’s one process or across the board, or if it’s just today or it has always been that way. No matter what it is, the more you know about your servers the easier it’s going to be to fix them.

It’s not easy, there’s no single spot to watch, no single solution to all of your problems, and there’s no “normal” values for these counters you can apply across every server you manage (Note: Thank you, Mike John, you stressed this point a lot). I watch several things on all the important servers, and most of these on every server. The important part is that you watch them continuously, even when you aren’t expecting to use the data. Some parts are cumulative and you can’t tell what was during an incident or what was from overnight maintenance. Other parts are snapshots and there’s no looking back.

(Note: Now I’m on to the stuff Brian Castle taught me. He’s the best you could hope for in a manager, and last I checked he was still hiring at EDMC in Robinson Township near Pittsburgh, PA)

To me, monitoring and baselining is the same thing. I know others will do a specific baselining process on a specific day and keep that. I feel you lose too much doing that and I watch enough to say that I have a continuous baseline going back for 13 months in most instances. I’m not all that worked up about lucky numbers or anything, it’s just nice to be able to say “that annual process we ran last week ran like this last year”. If you’re superstitious or have OCD, 400 days makes me just as happy.

Traces – Snapshot
Wait Stats – Cumulative – Resets on restart
Blocking – Snapshot
Query Stats – Cumulative – Resets on recompile
Index Stats – Cumulative – Resets on restart
OS Perf Counters – Varies
Database Sizes – Snapshot
Table Sizes – Snapshot

Traces – What long running processes have completed on the server?

You have to be careful, this one can kill your server. I have, on more than one occasion (sorry, Brian), caused a large server to reboot in the middle of the day because I filled the drives writing 40 GB of traces in 5 minutes. That being said, there is a safer way to trace, just don’t do an unfiltered trace catching everything.

I personally feel that absolutely every server should have a trace capturing RPC:Completed and SQL Batch:Completed that ran over X seconds duration. What is X? Well, that depends on the server. I’ve seen the best number for X be as low as 100 ms and as high as 10 seconds. Start high, and work your way down. You can add or do a separate trace for the Statement Completed for each of these if you want more detail.

You can get all of this except for SQL Batch:Completed with the text data from Extended Events, and that is a very good alternative. I haven’t made that jump for several reasons, although I would recommend it. First, this level of tracing hasn’t hurt me. Second, I still support several important servers that are still on SQL 2005 and I want to be as consistent as possible to provide as consistent as possible support.

What you’re trying to solve here is answering what ran long, have some hints as to why it ran long, and be able to say how often it has been running that way in recent history. A long duration doing very little work was waiting on something else, and other monitoring will help solve that problem. A lot of CPU, reads, or writes shows that you may need to look into tuning or statistics. Just keep in mind that reads can be reads of work tables, and writes can be writes to tempdb.

If you do it right you should have 4 or more days worth of history, and by doing it right I mean having 5 to 10 rollover files sized a reasonable size you can send off if needed (100 MB at most, they zip well) and capturing over the best duration threshold for your environment.

Wait Stats – What’s slowing you down?

The traces tell you what work was done, and the waist statistics tell you what went on when the query couldn’t actively do its work. This is cumulative over the server, so you can’t get details down to an individual process, however, you can see how much time is wasted and where it’s wasted at.

There are almost 500 distinct wait types in 2008 R2, but you don’t need to worry about that. The important thing is to know what are your biggest waits are, and you can look them up if you don’t recognize them. In the last slide I have links to a free eBook written by Jonathan Kehayias and Tom Kruger for the Accidental DBA that does an amazing job documenting what you can ignore and the meaning of the big ones you’re most likely to see.

If you’re looking for overall server health then you would want to look at waits over a long period of time, specifically the times of day you’d like to see better performance. For incidents, you want to look at what your wait stats are now and compare them to your running baseline. For me, the typical baseline I use is the same timeframe yesterday and 7 days ago. These are actual relevant days that have all but the changes you put in over the last couple of days, and it’s a true baseline for this specific server. If this server never waits on PageIOLatch, but it’s in your top 3 today then you instantly have a direction your heading in troubleshooting.

Blocking – Who’s in your way?

Locking occurs when one query is using data, and blocking occurs when another query needs to do something that is incompatible with that lock. If you don’t keep an eye on it then you’re looking back at a trace and seeing that a query took 1 second of CPU, did 1,000 reads, and no writes, yet it figured out a way to take 5 minutes in duration. The users complain the app is slow or unusable, and you’re giving them the answer that you’ll try to figure it out next time it happens and THEN try to find out the root cause.

That’s a bad idea. There are three types of users – those who don’t know you because things are running smooth, those that like you because you can say “I see exactly what happened and I can start working on avoiding reoccurrences”, and those users that you keep telling that you’re not sure what happened and you’ll try to get a better idea the next time it happens. I wish I could tell you that I could have more users not know your name, but watching blocking is a very easy way to get the users that like you to outnumber the users that hate you.

So, what can you do to tell them you know what just happened? My answer is to capture everything that’s blocking on the server every minute. Sure, there’s going to be a lot you capture that is actually only blocking for 10ms, and there are going to be things that were blocking for 59 seconds that were timed perfectly so you never saw it happen. The trick is to know that there is no perfect solution, and capture what you can. The DMVs are there to provide all the information, and being able to capture that on the fly along with in a proactive monitoring setup will make you look amazing to the users. If something blocked for 5 minutes, you have no excuse not to say “This query blocked this query, and this was the head blocker”. Sure you may have an excuse to say “It’s a vendor database and I passed it on to their support”, but at least you can tell the users something, which is always better than nothing.

Query Stats

The DMV dm_exec_query_stats is used by many DBAs, even if all we realized was that we were running one of the “Top 10 Most Expensive Query” scripts off the internet. The problem is that this is using it wrong. This DMV holds everything from when the query first went into cache and loses that information when the query goes out of the cache, even just for being recompiled. So the big process you ran overnight could still be in cache, leading you down a path that’s actually low priority for you. However, that relatively large process that runs every 5 minutes that just recompiled a couple minutes ago due to stats auto updating, it’s not even in there. This is too volatile to say that you’re getting good numbers out of it.

The answer isn’t to find somewhere else to get this information, it’s to capture this information more often and more intelligently. Capture it once and you have a snapshot of what it looked like, capture it again an hour later and you know what work has been done in that hour. Repeat the process over and over again, constantly adding another hour’s worth of data and saving off the latest snapshot in the process and you have some real information to go off of. This isn’t so much of what happened to be in cache at the time, it’s a pretty good summary of what ran and when it ran. Now you can query in detail, specifically saying “I want to know what queries are doing the most physical reads between 9:00 AM and 5:00 PM on weekdays so I can have the biggest impact when I tune this database.”

Index Stats

The DMV dm_exec_index_usage_stats isn’t too different from dm_exec_query_stats except that it’s cumulative since the time SQL services were started. Looking at a snapshot still leaves you half blind, unable to see what happened before the last time you rebooted the server and unable to tell when that index was used. Personally, seeing when it was used is less important to me than how much it was used long-term. Tracking this can do amazing things for you if you’re really fine-tuning a database.

This DMV is really under-used in my experience. As time goes on with a database the data changes and you add more indexes to make it run faster, but you don’t know if or when the index it used to use was being used by anything else so it stays there. Over time that means you have more and more unused indexes, and there’s no solid proof that they aren’t being used without you doing the work to collect that proof. This, in addition to duplicated indexes, adds to the workload for inserts and updates, adds to the workload of index maintenance, adds to the database size, and put additional strain on your cache and thus the PLE.

OS Performance Counters

This is one of the most misunderstood DMVs while also being one of the most useful. It’s giving you critical counters that give you an idea of how SQL Server is interacting with the hardware, but it’s doing it with several different types of counters that have to be measured differently. There are a couple great posts out on the internet to understanding how each one needs to be measured. As for right now, we’re focusing on why you want to watch it and what it will do for you. And wouldn’t you know that I saved the hardest one to describe for last, because each value is different and there will be controversy on which ones you should be watching and which ones are just a waste of your time monitoring.

PLE is one that this is no controversy about if you should watch it. This is the average age of the pages in cache, and a good measurement of when you’re doing too many physical reads. Peeking in on this from time to time is good, you can make sure it’s high enough. However, watching this will let you know when it’s dropping and help you dive into why it’s dropping. If you’re also running a trace you can see what did a lot of reads in that timeframe, and if you’re capturing Query Stats then you can find out which one of your large read queries is doing all the physical reads that destroy your PLE.

Server Target and Total memory are also great to watch even though they rarely change after the server is back up to speed following a restart of SQL services. The target memory is how much SQL Server would like to have, and Total memory is how much it actually does have. If target memory drops then there’s OS pressure you need to worry about. Also, as Total memory is increasing then it means the server is still filling the cache and PLE can pretty much be ignored. Of course it’s low, everything in there is new.

Page Lookups, Page Reads, Page Writes and Lazy Writes give you a better idea of how SQL Server is interacting with your cache and disks. If you want a real picture of what SQL Server is doing behind the scenes, this is much more useful than glancing at cache hit ratio.

Deadlocks and Memory Grants Pending are two things you’d like to always see at zero. That may be out of the question with deadlocks, but if you’re getting above zero on memory grants than you need to find out when that happened and everything that was running at that time.

SQL Compilations and Recompilations are hidden CPU hogs. They’re never going to be at zero because very few servers have stuff stay in cache forever, ad-hoc code runs, and several other reasons. However, when one of these counters jump up, you may have a hidden cause. For instance, if a piece of code that runs every minute or more was written in a way that it can’t be stored in cache, these numbers will be noticeably higher. When this happens, your CPU will be noticeably more stressed with no other indicators as to why, and this query won’t even show up in your Query Stats as it relies on showing the stats of everything that’s currently in cache.

Database Sizes

This seems simple, and it is. Most people know how to find the sizes of their databases, but do you track it? If not, you should. Some questions you should be able to answer readily are how fast your databases are growing, when will you run out of space, and is the steady decline in PLE justified by the growth of the database. Getting into more details, you can touch on when does the database grow such as an accounting app jumping in size every April, or answering if the data is growing steadily or exponentially. These are important things to know when it comes time to budget for new servers and disk space.

On the other side of things, you also have information to push back on application teams on how much space they’re using and if that’s really necessary. A lot of times you’ll find that they’re surprised by their growth to the point that you have to show them the numbers for them to believe it. It’s not uncommon for the response from this to be a cleanup project that helps keep the databases smaller and running faster, which is a big goal being accomplished.

Table Sizes

This is just building off of watching your database sizes in more detail. When a database starts filling up, and it’s filling up quick, it’s good to tell an application team what table is doing it. In my current environment it makes sense to watch every table that is over .5% of the database size AND over 100 MB. This means that some databases we don’t watch anything and other databases we watch about the 20 biggest tables. The biggest point is that we’re not trying to watch everything while watching anything big enough to make a difference to us.

If a database is filling a lot faster than normal then a lot of the time there’s a process that isn’t running like it should. The biggest table in the database may be a rather static value, but the third biggest table in the database wasn’t even on your radar two weeks ago. The app teams love it when you can tell them that not only is the database growing out of control, but we also see the growth in table X which has been growing at 200 MB per day starting on the first Saturday of last month. By the way, wasn’t last Saturday the day you changed a couple procs around or implemented an upgrade?

Now you just graduated from “Hey, there’s a problem” to also include “and here’s a huge lead to finding the root cause.” It’s rare to use this information, but it’s lightweight to capture for something that gives you a heavyweight appearance when you reference this knowledge to other teams.

My Critiques

I’d love to hear what you think about this presentation. Before we get to that, here are the complaints from the most harsh person in the room when I was presenting…

I didn’t do enough to say what each thing was. Although I marked the presentation as intermediate, you don’t know who’s going to be there. In this case, I knew a couple people in the crowd, and they ranged from data analyst to database manager. It should have been presented in more of “If you’re not ready for intermediate then you have everything you need, but it will take some effort to keep up”.

My slides were lacking. I stick with the idea that there should be few words on the screen, but I took it too far. I admitted above that the slides were there more to keep me on track then it was to help the audience, which I probably needed for my first public speaking venue. However, next time I’ll be less nervous and more prepared to make it geared for the audience better. Instead of random pictures, I should have more of graphs and data that I can talk through to give the audience visuals.

Nothing was given to the audience, and there should have been something. To be fair, none of the presentations that I know of handed anything to the audience. However, I set my own bar, and I feel that people would like a sheet of paper with the outline on it with links to online sources. My blog, of course, because it shows how to grab everything. Also, an outside link for each item. This gives a physical reminder to turn this knowledge into action.

There was no demo, and that was the audience’s biggest complaint. While I’m still not sure that I would do a live demo on my second go around, I’m going to hold myself on the remark above about my slides having graphs and data. That would also drag me away from the podium to talk through the slides, making it a more dynamic and interesting presentation; that would really help the audience take things in.

Your Turn

Now I need you to do two things. Put this information to work in your environment is what will help you the most. What will help me the most is you commenting on this post both before and after you implement any of this so you can help me become a better presenter.

Thank you for taking the time to read this!

SQL Server Running Slow

A user calls to say the app or server is slow today.  Here’s a quick summary of how I get started.  It depends on monitoring where you can compare baselines to recent activity, and I include links to all of that code I use.

The initial question is too vague.  You can’t blame this individual user because this is how 95% of issues make it to you.   The initial follow-up will be the following questions I ask the user along with the corresponding question your boss is asking you:

  • Can anyone else duplicate your issue?
    • Is this the user’s machine or a server issue?  Not a 100% conclusive answer, but it’s a start.
  • Is everything running slow or just one process or screen?
    • Is the whole server running slow or just one stored procedure?
  • Were any changes made to the system recently?  Depending on the user, this may be a question for someone else.
    • Did we cause it by making a change or did this just come up?

However, this is enough for me to do a very quick overall health check while I wait for those answers.  It seems like a lot at first glance, but it’s practically automated with the scripts I have on hand.  This is only opening 5 pre-written scripts, running them, and peeking at the results.

Quick Health:

  • Look at historical wait stats for the last two hours and the same time window both 1 and 7 days ago.  Documented in the Wait Stats post.
  • Look at memory and CPU contention for the same times as wait stats.  You’re specifically looking for high CPU usage, PLE drops, and changes to target memory. Documented in the OS Performance Counters post.
  • Look for blocking in the last hour, even though it would typically show up in wait stats.  You can compare this to previous days if needed, Documented in the Blocking post
  • Look at the large queries that completed in the last two hours along with the queries currently running to see if anything pops out.  This is just a quick glance looking for glaring issues, and you can look for details after getting answers to the follow-up questions.  Unfortunately, this information isn’t as cut-and-dry as the rest, and it’s going to rely on you being a little familiar with what normal is on this server.  Completed queries are documented in the Tracing Creation and Reading posts, and currently running queries are documented in the Running Processes post.
  • Look at the queries recompiled recently to see if a new execution plan is taking excessive resources.  Documented in the Recently Recompiled Resource Hogs post.

So, where do you go from here?  Yeah, you get the typical “it depends” answer from me.  The information you just looked at will normally comes up with something that doesn’t look right and needs to be looked into further.  However, there are so many directions this can go from here that a single document can’t cover it.  The purpose of this post is simply to answer the question “Where do I start looking with the most common question a user throws at me”, nothing more.  Also, I wanted a way to tie all of the random posts together so you could see them working together as the full solution.

White papers, DMVs and Monitoring Concepts

According to Jason Strate (SQL Server MVP) there are 5 white papers ever SQL Server DBA should read. Considering my blogging is focused on how to monitor and tune previously installed servers, we’ll focus on the first two of these white papers: Troubleshooting Performance Problems in SQL Server 2008 and SQL Server 2005 Waits and Queues. Ignore the age of these papers, after all you’re running mostly the same code to do mostly the same tasks in SQL Server 6.5 and SQL Server 2012…well, except for the lack of DMVs in 6.5.

I don’t agree with them 100%, but not to the point that I don’t agree 100% that everyone should read them. The sources of the information, the uses of the DMVs, all of it is what you’ll need to know to progress as a DBA. Ask any DBA III or higher and they’ll be able to tell you most of what’s said in these papers, even if they went the hard route and made it to where they are without reading them. Where they fall short is by giving static values that aren’t worded as just starting points, and the trending they do is the trending you can do as a Microsoft Support Engineer where nothing is saved or long-term.

My differences range from minor complaints such as me feeling that a blanket statement saying Page Life Expectancy under 300 seconds is when you should start to worry. First, it’s a blanket statement, and I’m a firm believer in that the word always means you’re almost always wrong. Second, servers have more memory now and my SAN team would kill me if I told them I would have to read 500 GB of data every 300 seconds on some of my bigger servers. I do agree that there are times you should worry about PLE, but you have to keep in mind two things; how much memory are you cycling through in that time and what’s normal for this specific server.  I just hope that regular drops down to 3,000 are closer to the point that grabs your attention.

Lets build off of that last big there…What’s normal for this specific server? The queries they’re giving you are snapshots of what info the server has for you right now. Some of that is what’s happening now, some is what has happened since the last restart of SQL services, and other pieces fall somewhere in the middle. You can’t tell me that I’ve done X amount of work since the server was rebooted two weeks ago and expect me to tell you if that’s an issue. You also can’t tell me that “This proc has used more CPU than any other by far since that reboot two weeks ago” and expect me to tell you if that’s a critical issue or an intense, yet normal and expected, off-hours maintenance task. You can, however, tell me that on a typical Monday between 2:00 PM and 5:00 PM we do X, but this Monday we’re doing 100 times that while the business has no reason to say this Monday should be any different. Now we’re on to something…we know there’s an issue, and we just found our starting point to solving it. Chances are if you’re reading this then either you or your immediate manager would be able to create a database to keep this trending information if it doesn’t exist already, and you’re making it harder on yourself if you don’t.

These two white papers are almost 200 pages long together, so I’m not going to pretend to cover them in one post. Actually, I’m hoping my next 50 posts might cover most of them. The points I’m going for are that these white papers are telling you the perfect stats to watch, and these stats are going to be more meaningful when you monitor them. I’ve started this in some of my posts such as looking at Wait Stats and Blocking, but there’s a long way to go. More than these two DMVs need to be watched like this, but they are good examples of wait stats capturing cumulative data and blocking capturing point-in-time data from DMVs.

If you see a DMV in these papers, you’ll see it in my posts. If it’s not there yet, it will be. Some like dm_db_exec_sessions will never have a dedicated post unless you count Running Processes which links it to several DMVs. Others like dm_os_performance_counters might not fit comfortably in a single post. The point here is that all the DMVs, especially the ones mentioned here, are worth looking into more.

Read the papers. Always go beyond the question of “what should I do”, get to the “why should I do it”, and strive for “how can I make this mean more and be more useful”. Never take my scripts or anyone else’s as-is. Tinker with them, really learn the DMVs, and it will all start to come together.

Blocking – Capturing and Monitoring

If a query is taking longer to run than normal, there’s a good chance it’s being blocked by something else. This is especially true when you’re doing something rediculously simple and SQL Server just sits there thinking. Symptoms of blocking problems include a trace of SQL:BatchCompleted and RPC:Completed with durations over 10 seconds comes back with results using less than a second of CPU time and very few reads and writes. Also, if you’re watching your wait stats, then you’ll start to see more waits that start with LCK_ than normal.  Not the mention the overly obvious blocking found in your Running Processes

This query will show you the blocking currently occurring on your server.

SELECT tl.resource_type
    , database_name = DB_NAME(tl.resource_database_id)
    , assoc_entity_id = tl.resource_associated_entity_id
    , lock_req = tl.request_mode
    , waiter_sid = tl.request_session_id
    , wait_duration = wt.wait_duration_ms
    , wt.wait_type
    , waiter_batch = wait_st.text
    , waiter_stmt = substring(wait_st.text,er.statement_start_offset/2 + 1,
                abs(case when er.statement_end_offset = -1
                then len(convert(nvarchar(max), wait_st.text)) * 2
                else er.statement_end_offset end - er.statement_start_offset)/2 + 1)
    , waiter_host = es.host_name
    , waiter_user = es.login_name
    , blocker_sid = wt.blocking_session_id
    , blocker_stmt = block_st.text 
    , blocker_host = block_es.host_name
    , blocker_user = block_es.login_name
FROM sys.dm_tran_locks tl (nolock)
    INNER JOIN sys.dm_os_waiting_tasks wt (nolock) ON tl.lock_owner_address = wt.resource_address
    INNER JOIN sys.dm_os_tasks ot (nolock) ON tl.request_session_id = ot.session_id AND tl.request_request_id = ot.request_id AND tl.request_exec_context_id = ot.exec_context_id
    INNER JOIN sys.dm_exec_requests er (nolock) ON tl.request_session_id = er.session_id AND tl.request_request_id = er.request_id
    INNER JOIN sys.dm_exec_sessions es (nolock) ON tl.request_session_id = es.session_id
    LEFT JOIN sys.dm_exec_requests block_er (nolock) ON wt.blocking_session_id = block_er.session_id
    LEFT JOIN sys.dm_exec_sessions block_es (nolock) ON wt.blocking_session_id = block_es.session_id 
    CROSS APPLY sys.dm_exec_sql_text(er.sql_handle) wait_st
    OUTER APPLY sys.dm_exec_sql_text(block_er.sql_handle) block_st

However, if you have time to watch your servers all the time and keep running that query then you’re overstaffed in a world that leans towards being understaffed. For way too many reasons to list here, throw the results of this into a table every minute. If someone says a query ran long and you see the server was waiting on locks, look here. If you want to be proactive and look for ways you can improve server performance, look here.

CREATE TABLE Blocking (
    BlockingID BigInt Identity(1,1) NOT NULL
    , resource_type NVarChar(60)
    , database_name SysName
    , assoc_entity_id BigInt
    , lock_req NVarChar(60)
    , wait_spid Int
    , wait_duration_ms Int
    , wait_type NVarChar(60)
    , wait_batch NVarChar(max)
    , wait_stmt NVarChar(max)
    , wait_host SysName
    , wait_user SysName
    , block_spid Int
    , block_stmt NVarChar(max)
    , block_host SysName
    , block_user SysName
    , DateAdded datetime NOT NULL DEFAULT (GetDate())
)
GO

CREATE UNIQUE CLUSTERED INDEX IX_Blocking_DateAdded_BlockingID_U_C ON Blocking
(
    DateAdded
    , BlockingID
) WITH (Fillfactor = 95)
GO

So, now you know what your blocking problems are. So, how do you fix them? That’s a deeper dive than I’ll be doing now, but here’s they key words to look into:

  • NOLOCK hint
  • Query tuning
  • Process timing
  • Eliminate cursors

Every situation is going to be unique, and I can’t pretend like I can give you all the answers. Finding that there is an issue can be difficult enough, and those are the answers I can help with.

Wait Stats – Monitoring and Using

sys.dm_os_wait_stats is one of the most important DMVs out there, and one that you should know the historical values of for every server you care about. This DMV will tell you the cumulative amount of time you waited for each wait type since SQL Server services started, which is nice to know, but limited in its usefulness as-is.

If, however, you knew what the values were 30 minutes ago and what they are now then you could know why recent queries were waiting. Take it a step further and know what you were waiting for in the same 30 minute period yesterday and you have a bit of a baseline to go off of. Now you can compare what you’re waiting for now to your baseline so you can say if what you’re seeing now is just normal for the server or if you’re seeing a true incident. Remember, the only way to know if numbers on a server are a problem are not is to know what’s normal for that specific server.

A half-hour increment is a bit much for me, and I would prefer to know it every 5 minutes to give a little more detail. Also, just looking at yesterday’s stats for comparison isn’t good enough for me, especially on when looking into an incident on a Monday where Sunday isn’t a very good baseline. Because of that, I want to look at the baseline data from the same time period yesterday and exactly a week ago.

To capture the data you need to capture a snapshot of sys.dm_os_wait_stats along with the time you grabbed it. Then grab it again, do a diff between the two values, and save that off with the starting and ending times. Keep this running every so many minutes in a SQL job and you have a running baseline to compare what you’re seeing now to what the server is typically doing.

After that, keep the data trimmed down to what you’d actually use. Personally, I’m a huge fan of 13 months. This running every 5 minutes for 13 months will accumulate about 750 MB of data, and you can adjust these numbers as you see fit.

Here are the tables; I comment out stuff if I would cry if I accidently ran it at the wrong time.

/*
IF OBJECT_ID('WaitStats') IS NOT NULL BEGIN
    DROP TABLE WaitStats
END

IF OBJECT_ID('WaitStatsLast') IS NOT NULL BEGIN
    DROP TABLE WaitStatsLast
END
*/
GO

CREATE TABLE WaitStats (
    DateStart datetime
    , DateEnd datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStats_DateStart_waittype_U_C ON WaitStats
(
    DateStart
    , wait_type
) WITH (Fillfactor = 95)
GO

CREATE TABLE WaitStatsLast (
    DateAdded datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStatsLast_waittype_U_C ON WaitStatsLast
(
    wait_type
) WITH (Fillfactor = 95)
GO

Notice the lack of creativity. It is very intentional. The code is more stable as I’m copying the field sizes directly from the DMV, and the data is more legible to an outsider (new employee, Microsoft engineer, consultant, vendor) because the fields are exactly the same as the DMV they’re used to seeing. I’m not a fan of non-creative solutions, so there’s a reason to go with it when I decide to be boring.

Now to populate the tables. This script will need to be ran on a regular schedule. Preferably, throw it in a proc and run that proc every 5 minutes, but that’s just my opinion in a discussion about how to watch your servers.

IF OBJECT_ID('tempdb..#WaitStatsNew') IS NOT NULL BEGIN
    DROP TABLE #WaitStatsNew 
END

SELECT DateAdded = GETDATE()
    , wait_type
    , waiting_tasks_count 
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms 
INTO #WaitStatsNew
FROM sys.dm_os_wait_stats

INSERT INTO WaitStats (DateStart, DateEnd, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateStart = ISNULL(l.DateAdded, (SELECT create_date FROM sys.databases WHERE name = 'tempdb'))
    , DateEnd = n.DateAdded
    , wait_type = n.wait_type
    , waiting_tasks_count = n.waiting_tasks_count - ISNULL(l.waiting_tasks_count, 0)
    , wait_time_ms = n.wait_time_ms - ISNULL(l.wait_time_ms, 0)
    , max_wait_time_ms = n.max_wait_time_ms --It's a max, not cumulative
    , signal_wait_time_ms = n.signal_wait_time_ms - ISNULL(l.signal_wait_time_ms, 0)
FROM #WaitStatsNew n
    LEFT OUTER JOIN WaitStatsLast l ON n.wait_type = l.wait_type AND l.DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'tempdb')

TRUNCATE TABLE WaitStatsLast 

INSERT INTO WaitStatsLast (DateAdded, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateAdded
    , wait_type
    , waiting_tasks_count
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms
FROM #WaitStatsNew 

DROP TABLE #WaitStatsNew

Finally, querying it. Give it a descent time frame, I typically start at 2 hours to get an overall picture without having a single event skew the numbers too much. Then look at your most relevant baseline data – yesterday and 7 days ago during the same timeframe.

DECLARE @dStart datetime
    , @dEnd datetime

SELECT @dStart = DATEADD(Hour, -2, GETDATE())
    , @dEnd = GETDATE()

IF OBJECT_ID('tempdb..#IgnoredWaits') IS NOT NULL BEGIN
    DROP TABLE #IgnoredWaits
END

CREATE TABLE #IgnoredWaits (Wait_Type NVarChar(60) PRIMARY KEY) 
--/*
--Values taken from p25 of "Troubleshooting SQL Server - A Guide for the Accidental DBA" by Jonathan Kehayias and Ted Krueger  
INSERT INTO #IgnoredWaits 
SELECT 'BAD_PAGE_PROCESS'
UNION SELECT 'BROKER_EVENTHANDLER'
UNION SELECT 'BROKER_RECEIVE_WAITFOR'
UNION SELECT 'BROKER_TASK_STOP'
UNION SELECT 'BROKER_TO_FLUSH'
UNION SELECT 'BROKER_TRANSMITTER'
UNION SELECT 'CHECKPOINT_QUEUE'
UNION SELECT 'CLR_AUTO_EVENT'
UNION SELECT 'CLR_MANUAL_EVENT'
UNION SELECT 'DBMIRROR_EVENTS_QUEUE'
UNION SELECT 'DISPATCHER_QUEUE_SEMAPHORE'
UNION SELECT 'FT_IFTS_SCHEDULER_IDLE_WAIT'
UNION SELECT 'FT_IFTSHC_MUTEX'
UNION SELECT 'KSOURCE_WAKEUP'
UNION SELECT 'LAZYWRITER_SLEEP'
UNION SELECT 'LOGMGR_QUEUE'
UNION SELECT 'ONDEMAND_TASK_QUEUE'
UNION SELECT 'PREEMPTIVE_OS_AUTHENTICATIONOPS'
UNION SELECT 'PREEMPTIVE_OS_GETPROCADDRESS'
UNION SELECT 'REQUEST_FOR_DEADLOCK_SEARCH'
UNION SELECT 'RESOURCE_QUEUE'
UNION SELECT 'SLEEP_BPOOL_FLUSH'
UNION SELECT 'SLEEP_SYSTEMTASK'
UNION SELECT 'SLEEP_TASK' 
UNION SELECT 'SQLTRACE_BUFFER_FLUSH'
UNION SELECT 'WAITFOR'
UNION SELECT 'XE_DISPATCHER_JOIN'
UNION SELECT 'XE_DISPATCHER_WAIT'
UNION SELECT 'XE_TIMER_EVENT'

SELECT TOP 7 TimeFrame = 'Right Now'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart 
    AND DateEnd <= @dEnd 
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Yesterday'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 1
    AND DateEnd <= @dEnd - 1
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Last Week'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 7
    AND DateEnd <= @dEnd - 7
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

The only thing left to do is wait and interpret the data. All the wait types mean something, with some easily pointing to an issue and others just being normal operations. Doing a dive into what wait types are out there, what they mean to you, and what they mean when they’re not what they were yesterday isn’t something I’m going to get into with this post.  However, there is an Accidental DBA book out there that explains this amazingly well.  Here are links to the FREE PDF Version and the Physical Book.