Monitoring and Baselines Presentation


Here’s my full presentation for SQL Saturday #250 in Pittsburgh this past Saturday along with some notes on what I got out of it.

My Lessons

I learned a lot doing this, and I hope people learned a lot from it. For the people on the other side of the podium from me, the main lessons were in what I put in the abstract; they learned mostly about SQL Server. On my side of the podium though, the lessons were far from the same. Public speaking, creating presentation, prepping for unknown audiences, and seeing the gratitude of the SQL Server community in person. Anything that was on the abstract I learned slightly better, but that didn’t compare to what wasn’t on it.

Public speaking was always a fear of mine, and I tend to be someone who avoids my fears. However, throwing myself out there at the public as a whole was never my thing either, and I’ve been doing half decent, at least in my own mind. That being said, I decided to go all out and push myself a little further. After all, what are the chances of them picking me to present with all those professional teachers, consultants, and MVPs out there throwing up their abstracts. Best case I could say I tried, worst case I was going to throw myself out of my comfort zone and hope for the best.

They picked me, which I didn’t let myself expect. Everyone that knew I put in my abstract was also told by me that I only had a 50/50 chance at best, which was more optimistic than I really was, but I don’t like people knowing when I’m being dismissive of myself. It turns I was wrong a lot, and in every case I was wrong I was glad I was wrong.

Being Forced to Get Better

I’m not one to learn things for the sake of learning them, which is why I sucked in school. To really learn things I need a real-life use case, somewhere I’m going to apply it. I didn’t practice speaking ever before because I didn’t speak in front of large groups. I didn’t know how to put together a presentation properly because that’s not my thing…well, wasn’t is probably a better word now. Just like a couple months ago I wasn’t quite sure how to blog, but I’ve made it past that point.

Like everything in my life, blogging went from unknown to addiction quick. I’m not sure that I can say the same thing about speaking, but I can see the possibility. SQL Saturday only comes to Pittsburgh once a year and the local user group, which I’ll be joining soon, only meets once a month and has a single speaker. However, I can’t say that I didn’t look up when the SQL Saturdays in Cleveland and Washington DC were. Also, I took my own thoughts on my presentation and the feedback I received in and immediately starting thinking about how I could make that presentation better.

What’s below is the original, not touched up at all in the last 4 days. In part to show where I went wrong and how I’m going to fix it, and in part because Monday was my first day at a new job. By the way, starting a new job when between jobs is the best and worst thing you could ever do to yourself. You don’t have time to be too nervous about anything because you’re too overworked, but you’re also too overworked. Eh, you can’t win them all.

Well, on to the presentation. What you have here is my script that I talked through and taught myself before the presentation. However, if you were there, you’ll notice this isn’t exactly what I said. I didn’t read it, I presented it. There were no demos (the biggest complaint from the crowd) to avoid me from stumping myself, and the PowerPoint was just a whimsical picture for the start of each paragraph to keep me on track because I knew I’d be nervous and lose my way without a paragraph-by-paragraph guide helping me along.

I must have done at least descent because my reviews came back with two 3’s and the rest 4’s and 5’s out of 5. You’re just asking me to get off topic and start talking about why I hate ratings that are odd numbers, but no, I’m not getting off topic, I’m getting on to my presentation!!!

The Presentation

Download slide deck here.

I’ve been working on databases for over a decade, and most of that without having good standards or monitoring in place. My job was rough and I was much less effective. Now I know what I’m doing and I want to share that knowledge. In this presentation I am going to tell you what I watch and why I watch it. A presentation is a poor format to go through how to do this; having it in writing is much more beneficial to you. That is why I’m pretty much skipping that portion here and diverting you to my blog at SimpleSQLServer.com. Not so much to promote a blog that I lose money on, but instead to give you the resources in the best format.

As a DBA you will hit problems with the performance of your databases. It doesn’t matter if it’s one process or across the board, or if it’s just today or it has always been that way. No matter what it is, the more you know about your servers the easier it’s going to be to fix them.

It’s not easy, there’s no single spot to watch, no single solution to all of your problems, and there’s no “normal” values for these counters you can apply across every server you manage (Note: Thank you, Mike John, you stressed this point a lot). I watch several things on all the important servers, and most of these on every server. The important part is that you watch them continuously, even when you aren’t expecting to use the data. Some parts are cumulative and you can’t tell what was during an incident or what was from overnight maintenance. Other parts are snapshots and there’s no looking back.

(Note: Now I’m on to the stuff Brian Castle taught me. He’s the best you could hope for in a manager, and last I checked he was still hiring at EDMC in Robinson Township near Pittsburgh, PA)

To me, monitoring and baselining is the same thing. I know others will do a specific baselining process on a specific day and keep that. I feel you lose too much doing that and I watch enough to say that I have a continuous baseline going back for 13 months in most instances. I’m not all that worked up about lucky numbers or anything, it’s just nice to be able to say “that annual process we ran last week ran like this last year”. If you’re superstitious or have OCD, 400 days makes me just as happy.

Traces – Snapshot
Wait Stats – Cumulative – Resets on restart
Blocking – Snapshot
Query Stats – Cumulative – Resets on recompile
Index Stats – Cumulative – Resets on restart
OS Perf Counters – Varies
Database Sizes – Snapshot
Table Sizes – Snapshot

Traces – What long running processes have completed on the server?

You have to be careful, this one can kill your server. I have, on more than one occasion (sorry, Brian), caused a large server to reboot in the middle of the day because I filled the drives writing 40 GB of traces in 5 minutes. That being said, there is a safer way to trace, just don’t do an unfiltered trace catching everything.

I personally feel that absolutely every server should have a trace capturing RPC:Completed and SQL Batch:Completed that ran over X seconds duration. What is X? Well, that depends on the server. I’ve seen the best number for X be as low as 100 ms and as high as 10 seconds. Start high, and work your way down. You can add or do a separate trace for the Statement Completed for each of these if you want more detail.

You can get all of this except for SQL Batch:Completed with the text data from Extended Events, and that is a very good alternative. I haven’t made that jump for several reasons, although I would recommend it. First, this level of tracing hasn’t hurt me. Second, I still support several important servers that are still on SQL 2005 and I want to be as consistent as possible to provide as consistent as possible support.

What you’re trying to solve here is answering what ran long, have some hints as to why it ran long, and be able to say how often it has been running that way in recent history. A long duration doing very little work was waiting on something else, and other monitoring will help solve that problem. A lot of CPU, reads, or writes shows that you may need to look into tuning or statistics. Just keep in mind that reads can be reads of work tables, and writes can be writes to tempdb.

If you do it right you should have 4 or more days worth of history, and by doing it right I mean having 5 to 10 rollover files sized a reasonable size you can send off if needed (100 MB at most, they zip well) and capturing over the best duration threshold for your environment.

Wait Stats – What’s slowing you down?

The traces tell you what work was done, and the waist statistics tell you what went on when the query couldn’t actively do its work. This is cumulative over the server, so you can’t get details down to an individual process, however, you can see how much time is wasted and where it’s wasted at.

There are almost 500 distinct wait types in 2008 R2, but you don’t need to worry about that. The important thing is to know what are your biggest waits are, and you can look them up if you don’t recognize them. In the last slide I have links to a free eBook written by Jonathan Kehayias and Tom Kruger for the Accidental DBA that does an amazing job documenting what you can ignore and the meaning of the big ones you’re most likely to see.

If you’re looking for overall server health then you would want to look at waits over a long period of time, specifically the times of day you’d like to see better performance. For incidents, you want to look at what your wait stats are now and compare them to your running baseline. For me, the typical baseline I use is the same timeframe yesterday and 7 days ago. These are actual relevant days that have all but the changes you put in over the last couple of days, and it’s a true baseline for this specific server. If this server never waits on PageIOLatch, but it’s in your top 3 today then you instantly have a direction your heading in troubleshooting.

Blocking – Who’s in your way?

Locking occurs when one query is using data, and blocking occurs when another query needs to do something that is incompatible with that lock. If you don’t keep an eye on it then you’re looking back at a trace and seeing that a query took 1 second of CPU, did 1,000 reads, and no writes, yet it figured out a way to take 5 minutes in duration. The users complain the app is slow or unusable, and you’re giving them the answer that you’ll try to figure it out next time it happens and THEN try to find out the root cause.

That’s a bad idea. There are three types of users – those who don’t know you because things are running smooth, those that like you because you can say “I see exactly what happened and I can start working on avoiding reoccurrences”, and those users that you keep telling that you’re not sure what happened and you’ll try to get a better idea the next time it happens. I wish I could tell you that I could have more users not know your name, but watching blocking is a very easy way to get the users that like you to outnumber the users that hate you.

So, what can you do to tell them you know what just happened? My answer is to capture everything that’s blocking on the server every minute. Sure, there’s going to be a lot you capture that is actually only blocking for 10ms, and there are going to be things that were blocking for 59 seconds that were timed perfectly so you never saw it happen. The trick is to know that there is no perfect solution, and capture what you can. The DMVs are there to provide all the information, and being able to capture that on the fly along with in a proactive monitoring setup will make you look amazing to the users. If something blocked for 5 minutes, you have no excuse not to say “This query blocked this query, and this was the head blocker”. Sure you may have an excuse to say “It’s a vendor database and I passed it on to their support”, but at least you can tell the users something, which is always better than nothing.

Query Stats

The DMV dm_exec_query_stats is used by many DBAs, even if all we realized was that we were running one of the “Top 10 Most Expensive Query” scripts off the internet. The problem is that this is using it wrong. This DMV holds everything from when the query first went into cache and loses that information when the query goes out of the cache, even just for being recompiled. So the big process you ran overnight could still be in cache, leading you down a path that’s actually low priority for you. However, that relatively large process that runs every 5 minutes that just recompiled a couple minutes ago due to stats auto updating, it’s not even in there. This is too volatile to say that you’re getting good numbers out of it.

The answer isn’t to find somewhere else to get this information, it’s to capture this information more often and more intelligently. Capture it once and you have a snapshot of what it looked like, capture it again an hour later and you know what work has been done in that hour. Repeat the process over and over again, constantly adding another hour’s worth of data and saving off the latest snapshot in the process and you have some real information to go off of. This isn’t so much of what happened to be in cache at the time, it’s a pretty good summary of what ran and when it ran. Now you can query in detail, specifically saying “I want to know what queries are doing the most physical reads between 9:00 AM and 5:00 PM on weekdays so I can have the biggest impact when I tune this database.”

Index Stats

The DMV dm_exec_index_usage_stats isn’t too different from dm_exec_query_stats except that it’s cumulative since the time SQL services were started. Looking at a snapshot still leaves you half blind, unable to see what happened before the last time you rebooted the server and unable to tell when that index was used. Personally, seeing when it was used is less important to me than how much it was used long-term. Tracking this can do amazing things for you if you’re really fine-tuning a database.

This DMV is really under-used in my experience. As time goes on with a database the data changes and you add more indexes to make it run faster, but you don’t know if or when the index it used to use was being used by anything else so it stays there. Over time that means you have more and more unused indexes, and there’s no solid proof that they aren’t being used without you doing the work to collect that proof. This, in addition to duplicated indexes, adds to the workload for inserts and updates, adds to the workload of index maintenance, adds to the database size, and put additional strain on your cache and thus the PLE.

OS Performance Counters

This is one of the most misunderstood DMVs while also being one of the most useful. It’s giving you critical counters that give you an idea of how SQL Server is interacting with the hardware, but it’s doing it with several different types of counters that have to be measured differently. There are a couple great posts out on the internet to understanding how each one needs to be measured. As for right now, we’re focusing on why you want to watch it and what it will do for you. And wouldn’t you know that I saved the hardest one to describe for last, because each value is different and there will be controversy on which ones you should be watching and which ones are just a waste of your time monitoring.

PLE is one that this is no controversy about if you should watch it. This is the average age of the pages in cache, and a good measurement of when you’re doing too many physical reads. Peeking in on this from time to time is good, you can make sure it’s high enough. However, watching this will let you know when it’s dropping and help you dive into why it’s dropping. If you’re also running a trace you can see what did a lot of reads in that timeframe, and if you’re capturing Query Stats then you can find out which one of your large read queries is doing all the physical reads that destroy your PLE.

Server Target and Total memory are also great to watch even though they rarely change after the server is back up to speed following a restart of SQL services. The target memory is how much SQL Server would like to have, and Total memory is how much it actually does have. If target memory drops then there’s OS pressure you need to worry about. Also, as Total memory is increasing then it means the server is still filling the cache and PLE can pretty much be ignored. Of course it’s low, everything in there is new.

Page Lookups, Page Reads, Page Writes and Lazy Writes give you a better idea of how SQL Server is interacting with your cache and disks. If you want a real picture of what SQL Server is doing behind the scenes, this is much more useful than glancing at cache hit ratio.

Deadlocks and Memory Grants Pending are two things you’d like to always see at zero. That may be out of the question with deadlocks, but if you’re getting above zero on memory grants than you need to find out when that happened and everything that was running at that time.

SQL Compilations and Recompilations are hidden CPU hogs. They’re never going to be at zero because very few servers have stuff stay in cache forever, ad-hoc code runs, and several other reasons. However, when one of these counters jump up, you may have a hidden cause. For instance, if a piece of code that runs every minute or more was written in a way that it can’t be stored in cache, these numbers will be noticeably higher. When this happens, your CPU will be noticeably more stressed with no other indicators as to why, and this query won’t even show up in your Query Stats as it relies on showing the stats of everything that’s currently in cache.

Database Sizes

This seems simple, and it is. Most people know how to find the sizes of their databases, but do you track it? If not, you should. Some questions you should be able to answer readily are how fast your databases are growing, when will you run out of space, and is the steady decline in PLE justified by the growth of the database. Getting into more details, you can touch on when does the database grow such as an accounting app jumping in size every April, or answering if the data is growing steadily or exponentially. These are important things to know when it comes time to budget for new servers and disk space.

On the other side of things, you also have information to push back on application teams on how much space they’re using and if that’s really necessary. A lot of times you’ll find that they’re surprised by their growth to the point that you have to show them the numbers for them to believe it. It’s not uncommon for the response from this to be a cleanup project that helps keep the databases smaller and running faster, which is a big goal being accomplished.

Table Sizes

This is just building off of watching your database sizes in more detail. When a database starts filling up, and it’s filling up quick, it’s good to tell an application team what table is doing it. In my current environment it makes sense to watch every table that is over .5% of the database size AND over 100 MB. This means that some databases we don’t watch anything and other databases we watch about the 20 biggest tables. The biggest point is that we’re not trying to watch everything while watching anything big enough to make a difference to us.

If a database is filling a lot faster than normal then a lot of the time there’s a process that isn’t running like it should. The biggest table in the database may be a rather static value, but the third biggest table in the database wasn’t even on your radar two weeks ago. The app teams love it when you can tell them that not only is the database growing out of control, but we also see the growth in table X which has been growing at 200 MB per day starting on the first Saturday of last month. By the way, wasn’t last Saturday the day you changed a couple procs around or implemented an upgrade?

Now you just graduated from “Hey, there’s a problem” to also include “and here’s a huge lead to finding the root cause.” It’s rare to use this information, but it’s lightweight to capture for something that gives you a heavyweight appearance when you reference this knowledge to other teams.

My Critiques

I’d love to hear what you think about this presentation. Before we get to that, here are the complaints from the most harsh person in the room when I was presenting…

I didn’t do enough to say what each thing was. Although I marked the presentation as intermediate, you don’t know who’s going to be there. In this case, I knew a couple people in the crowd, and they ranged from data analyst to database manager. It should have been presented in more of “If you’re not ready for intermediate then you have everything you need, but it will take some effort to keep up”.

My slides were lacking. I stick with the idea that there should be few words on the screen, but I took it too far. I admitted above that the slides were there more to keep me on track then it was to help the audience, which I probably needed for my first public speaking venue. However, next time I’ll be less nervous and more prepared to make it geared for the audience better. Instead of random pictures, I should have more of graphs and data that I can talk through to give the audience visuals.

Nothing was given to the audience, and there should have been something. To be fair, none of the presentations that I know of handed anything to the audience. However, I set my own bar, and I feel that people would like a sheet of paper with the outline on it with links to online sources. My blog, of course, because it shows how to grab everything. Also, an outside link for each item. This gives a physical reminder to turn this knowledge into action.

There was no demo, and that was the audience’s biggest complaint. While I’m still not sure that I would do a live demo on my second go around, I’m going to hold myself on the remark above about my slides having graphs and data. That would also drag me away from the podium to talk through the slides, making it a more dynamic and interesting presentation; that would really help the audience take things in.

Your Turn

Now I need you to do two things. Put this information to work in your environment is what will help you the most. What will help me the most is you commenting on this post both before and after you implement any of this so you can help me become a better presenter.

Thank you for taking the time to read this!

Monitoring Database and Table Sizes


Trending database and table sizes helps give you an idea of what to expect, and, sometimes, points out problems and their root causes. I even go so far as to monitor the disk space, file growth rates, and have a report going out telling me that “according to your current growth trends, you’ll run out of space on this day”. That last part will have to be another post since getting disk space can be kinda tricky depending on how your disks are set up; you need to go outside of SQL Server to grab the size of mountpoints if you use them. Don’t worry about mountpoints right now, if you don’t know the term that usually means that you can swap out the word “drive” and have everything I say be accurate for your environment.

First, this is lightweight stuff, there’s really no reason to capture it more than once a day, and it will only add up to several MB for a year’s worth of data. You’re not going to kill yourself doing this, you’re not going to stare at the data every day taking up your time, but you’re going to look like a heavyweight when you put this into action and reference this data to others.

Second, I do have other similar posts doing snapshots of this same info for Table and File sizes. It’s basically the same thing just different in how you’re using it and if you’ll have a historical view of it.

Database sizes

Lets start with the databases because, well, they’re bigger. I capture it all, the database name, logical file name, file type (row, logs, etc), filegroup, allocated space, used space, max size, growth rate, if it’s percent growth, and the drive/mountpoint it’s on. Other than dividing the appropriate numbers by 1024 twice to store all my sizes in MB, everything gets stored in native format. The reason for that is you have raw data in a format easy for you to verify and easy for any outsider (consultant, new employee, etc) to understand everything you have. I may make some of you cringe because I don’t normalize this; it’s small data and not worth the effort, especially with page compression (see Brad McGehee’s post on compression) turned on.

What I like doing is having automated reports with logic behind it saying “if drive size is X, file growth is Y, then you can grow to a potential size of Z”, follow that up with “the used space in this database has grown X in the last 90 days and Y in the last 30 days, so according to those trends I’ll hit the potential size in Z days”. One of my favorite reports takes that last Z as a parameter and tells me anything that will run out of space before a specific number of days. I get that report daily for anything that will run out of space in the next two weeks and a separate instance of that report monthly telling me what will run out of space in the next 100 days. For me to run out of space without having a couple automated emails sitting in my inbox yelling at me saying “See, I told you that would happen!!!”, a database has to do something crazy and out of character. The only part of that last sentence that doesn’t really happen is the automated email yelling, those voices don’t exist outside of my head (do they?).

Even without monitoring disk space, you can predict autogrowths. This is useful for trying preventing autogrowth all together if you’re into that, but with instant file initialization this has never been a spot I’ve chosen to spend my time on. However, I do care about fragmentation, and I have a report that will tell me if we continue growing according to our 30 or 90 day trends then the data files will have to grow X times. If X is higher than 4 and filegrowth is below 2 GB chunks, it shows up on my report and I change the filegrowth to make it more reasonable. If you allow percent filegrowths in your environment (just say no), or even if one sneaks through, things get complex and you have to stroll through financial websites to find the calculations for compounding interest to figure out how much the files will grow.

Cast(CEILING(LOG((Cast((ProjectedGrowthInAMonthMB - FileGroupFreeMB) as Dec(20,4)) + Cast(FileSizeMB as Dec(20,4)))/(Cast(FileSizeMB as Dec(20,4)))) /LOG(1+(Cast(Growth as Dec(20,4))/100))) AS Int)

Something inside that database is growing, but what is it?

Now if a database is growing, the first thing someone will ask you is “why is it growing”. That’s an answer I can’t give you, but “where is it growing” I can do. This is why I’m also tracking the table sizes in a database. Be fair to yourself here and admit it up front that you don’t care if a table changes size radically if the largest size is still insignificant to that database. So make up rules on “I want to watch tables that are bigger than…” and only store the stats for those tables to keep your little collections from being the source of your size issues. For me, that means that a table has to be both over 100MB in total size including indexes and over 0.5% of the size of the database. I’ve used these sizes for a while now and have never had an issue on either side; no one needed info I didn’t have for space trending issues and the collection time and space I used never caused any complaints. This isn’t saying that someone won’t ask “did that table have 10 or 11 records this time last month”, but I’m not willing to collect enough to answer every little question.

If a database is growing at an alarming rate it’s easy to assume the largest table is causing it. That’s not always the case, and it’s not too uncommon for a database change to either create a new table that grows quick intentionally or accidentally cause an old table to stop cleaning itself up. No matter what the reason is, if you want to see how a database is growing, look at your table stats documenting the growth. Developers always listen to you better when you tell them this is what happened and here’s how I know while the words “I think” are viewed as instant discredidation.

How am I supposed to do this stuff?

You’ll need somewhere to store all the info. I like to store two weeks of info on the local server and pull everything to a central server nightly where data is held for a full 13 months. However, to keep me under control, lets just focus on getting everything to the local server in the following tables.

CREATE TABLE [Drives](
    [DateAdded] [smalldatetime] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
    [CapacityMB] [int] NULL,
    [FreeSpaceMB] [int] NULL,
 CONSTRAINT [DBInven_Drives_DriveLetter_MountPoint_DateAdded] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DriveLetter] ASC,
    --[MountPoint] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Files](
    [DateAdded] [smalldatetime] NOT NULL,
    [DatabaseName] [sysname] NOT NULL,
    [FileName] [nvarchar](128) NOT NULL,
    [TypeDesc] [nvarchar](60) NULL,
    [FileGroup] [sysname] NULL,
    [SizeMB] [int] NULL,
    [UsedMB] [int] NULL,
    [FreeMB] [int] NULL,
    [MaxSizeMB] [int] NULL,
    [Growth] [int] NULL,
    [IsPercentGrowth] [bit] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
 CONSTRAINT [DBInven_Files_DatabaseID_FileID_DateAdded_U_C_IX] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [FileName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Tables](
    [DateAdded] [datetime] NOT NULL,
    [DatabaseName] [nvarchar](128) NOT NULL,
    [SchemaName] [nvarchar](128) NOT NULL,
    [TableName] [nvarchar](128) NOT NULL,
    [RowCounts] [bigint] NULL,
    [AllocatedMB] [int] NULL,
    [DataSizeMB] [int] NULL,
    [IndexSizeMB] [int] NULL,
    [PercentOfDB] [decimal](5, 2) NULL,
 CONSTRAINT [PK_Tables] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [SchemaName] ASC,
    [TableName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

If you use mountpoints, and if you call them “mountpoint” followed by an incrementing digit, and you never have more than can fit into a tinyint, uncommenting those pieces of the tables will do wonders for you. In case you didn’t guess, they’re commented out because the people who do use them probably don’t use them that way.

DECLARE @RunTime SmallDateTime
DECLARE @SQL NVarChar(max)

SET @RunTime = (SELECT Max(DateAdded) FROM Perf.DBInven.Drives)

SET @SQL = N'use [?]; 
    INSERT INTO Perf..Files 
    select ''' + cast(@RunTime as nvarchar(50)) + ''' 
        , DatabaseName = db_name()
        , FileName = f.Name
        , TypeDesc = f.Type_Desc
        , FileGroup = fg.Name
        , SizeMB = ((f.size*8/1024)) 
        , UsedMB = (FileProperty(f.name, ''SpaceUsed'')*8/1024) 
        , FreeMB = ((f.size-FileProperty(f.name, ''SpaceUsed''))*8/1024) 
        , MaxSizeMB = Case f.Max_Size
            WHEN -1 THEN -1
            ELSE f.Max_Size / 128
            END
        , f.Growth
        , f.Is_Percent_Growth
        , DriveLetter = left(f.physical_name,1) 
        /*, MountPoint = CASE substring(f.physical_name,4,10)
            WHEN ''MountPoint'' THEN 
                Case isnumeric(substring(f.physical_name,14,4)) --MountPoint >= 1000
                WHEN 1 THEN substring(f.physical_name,14,4)
                ELSE Case isnumeric(substring(f.physical_name,14,3)) --MountPoint >= 100
                    WHEN 1 THEN substring(f.physical_name,14,3)
                    ELSE Case isnumeric(substring(f.physical_name,14,2)) --MountPoint >= 10
                        WHEN 1 THEN substring(f.physical_name,14,2)
                        ELSE substring(f.physical_name,14,1) -- MountPoint single digit
                        END
                    END
                END
            ELSE Cast(0 as varchar(10)) --Root drive is MountPoint 0 
            END*/
    from sys.database_files f
        LEFT JOIN sys.filegroups fg ON f.data_space_id = fg.data_space_id
    '

exec sp_MSforeachdb @SQL

And now to grab the tables. Keep in mind that I specifically only want to grab stuff over 0.5% of the database size AND over 100 MB. You’ll see that towards the end of the code.

DECLARE @SQL NVarChar(max)
DECLARE @DateAdded SmallDateTime

SELECT @DateAdded = MAX(DateAdded) FROM Perf.DBInven.Drives 

SET @SQL = ' Use [?];
    DECLARE @DatabaseSize Float

    IF OBJECT_ID(''tempdb..#TableSizes'') IS NOT NULL BEGIN
          DROP TABLE #TableSizes
    END

    SELECT DatabaseName = DB_NAME()
    , TableName = a2.name
    , SchemaName = a3.name
    , RowCounts = a1.rows
    , TableSizeMB = (a1.reserved + ISNULL(a4.reserved,0)) / 128
    , DataSizeMB = a1.data / 128
    , IndexSizeMB = (CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data 
                        THEN (a1.used + ISNULL(a4.used,0)) - a1.data 
                        ELSE 0 
                    END) /128
INTO #TableSizes
FROM (SELECT ps.object_id
            , [rows] = SUM(CASE
                                WHEN (ps.index_id < 2) THEN row_count
                                ELSE 0
                            END)
            , reserved = SUM(ps.reserved_page_count)
            , data = SUM(CASE
                            WHEN (ps.index_id < 2) 
                                THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count)
                            ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count)
                        END)
            , used = SUM (ps.used_page_count) 
        FROM sys.dm_db_partition_stats ps
        GROUP BY ps.object_id) AS a1
    INNER JOIN sys.all_objects a2  ON a1.object_id = a2.object_id
    INNER JOIN sys.schemas a3 ON a2.schema_id = a3.schema_id
    LEFT JOIN (SELECT it.parent_id
            , reserved = SUM(ps.reserved_page_count)
            , used = SUM(ps.used_page_count)
        FROM sys.dm_db_partition_stats ps
            INNER JOIN sys.internal_tables it ON it.object_id = ps.object_id
        WHERE it.internal_type IN (202,204)
        GROUP BY it.parent_id) AS a4 ON a4.parent_id = a1.object_id
WHERE a2.type <> ''S'' and a2.type <> ''IT''

    SELECT @DatabaseSize = Sum(TableSizeMB) FROM #TableSizes

    INSERT INTO Perf..[Tables]
    SELECT ''' + Cast(@DateAdded as VarChar(50)) + '''
        , DatabaseName
        , SchemaName
        , TableName
        , RowCounts
        , TableSizeMB
        , DataSizeMB
        , IndexSizeMB
        , (TableSizeMB/@DatabaseSize) * 100 
    FROM #TableSizes
    WHERE TableSizeMB > .005 * @DatabaseSize 
        AND TableSizeMB > 100

    DROP TABLE #TableSizes'

exec sp_MSforeachdb @SQL

So this leaves us with the size of the drives. This isn’t an easy subject. You can get everything you need through SQL Server if you don’t use mountpoints, but you’re left with either Powershell or VBScript to capture it with mountpoints so you can query WMI. Another post, another day. I’d actually prefer to rewrite my VBScript as PowerShell to both learn PowerShell and put a better solution out here for you.

This was my last post I needed to prep for my Monitoring and Baselines Presentation for SQL Saturday where I talk about what you want to watch and why. Since the presentation didn’t get into how to watch everything, I made sure that everything was on my blog ahead of time.

Query Stats


DBAs rarely use the full potential of sys.dm_exec_query_stats.  It’s common to see the queries for looking at the most expensive queries according to any of the stats contained within the current cache, which is great to see.  However, if you grab snapshots of this information the proper way then you can get more detail that will really make a difference in your job.  Capturing this every 4 hours is probably enough, and here’s what it’s going to do for you.

First, using it the way most DBAs do and look at the most expensive queries according to a specific metric.  It works the same for them all, so we’ll just say we’re interested in physical reads in this case because our PLE isn’t what we’re hoping it could be.  You’ve probably seen the queries that directly query sys.dm_exec_query_stats to give you this similar to the one on MSDN for this DMV.  That will give you everything that’s in the cache right now, which could contain queries that recompile every hour in combination with other queries that have remained in the cache since the server was rebooted two weeks ago.  This means that when you’re querying to see what you can do to improve your PLE during business hours you’re getting anything that happens to be there now, not what ran in the time window you’re looking to improve.

Now look at it from the point of view that you capture this information every 4 hours.  You can get the total number of physical reads grouped by sql_handle where the interval_end time is in the last month, on weekdays, between 9:00 AM and 5:00 PM.  This is both a more complete and more accurate picture of what you care about, and you’re more likely to be focusing on the queries you should be tuning from the start.

Once you do start tuning, you want to know what effect you really had, and not just rely on the couple sets of test parameters you used in your NonProd environments.  A couple days after your changes made it to production, it’s a good idea to make sure you had the desired effect.  Query the data you have saved off for the last three days and for the same days last week.

What I went over so far is just realizing how this takes the traditional use of this to a new level, but there’s more functionality here now that it’s being saved off.  Now this is extremely useful in incident response.  Before you could see if anything was hogging resources as seen in my Recently Recompiled Resource Hogs post, but now you take it a step further by being able to verify if what’s running long now has always run that way or if something is different.  Also, with having the plan_handles saved off you can even see if it’s getting a different execution plan, which will help you determine if this was a change in data or if it’s a bad execution plan.

Not to shoot myself in the foot, but this isn’t perfect.  If you capture this data every 4 hours starting at 12:00 and it recompiles at 3:59 then you’ll only capture one minute for that interval.  Also, if you’re dealing with a query that can’t be cached then it won’t be found here at all; this is why you’re also running traces to find the big queries and watching sys.dm_os_performance_counters to know when there are a lot of compiles and recompiles.

To start capturing this, here are the tables you’ll need to capture everything.

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

/*
IF OBJECT_ID('QueryStats') IS NOT NULL BEGIN
    DROP TABLE QueryStats
END

IF OBJECT_ID('QueryStatsLast') IS NOT NULL BEGIN
    DROP TABLE QueryStatsLast
END

IF OBJECT_ID('QueryText') IS NOT NULL BEGIN
    DROP TABLE QueryText
END
--*/

CREATE TABLE [dbo].[QueryStats](
    [interval_start] [smalldatetime] NOT NULL,
    [interval_end] [smalldatetime] NOT NULL,
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [execution_count] [int] NOT NULL,
    [total_elapsed_time_ms] [int] NOT NULL,
    [min_elapsed_time_ms] [int] NOT NULL,
    [max_elapsed_time_ms] [int] NOT NULL,
    [total_worker_time_ms] [int] NOT NULL,
    [min_worker_time_ms] [int] NOT NULL,
    [max_worker_time_ms] [int] NOT NULL,
    [total_logical_reads] [int] NOT NULL,
    [min_logical_reads] [int] NOT NULL,
    [max_logical_reads] [int] NOT NULL,
    [total_physical_reads] [int] NOT NULL,
    [min_physical_reads] [int] NOT NULL,
    [max_physical_reads] [int] NOT NULL,
    [total_logical_writes] [int] NOT NULL,
    [min_logical_writes] [int] NOT NULL,
    [max_logical_writes] [int] NOT NULL,
    [creation_time] [smalldatetime] NOT NULL,
    [last_execution_time] [smalldatetime] NOT NULL,
) ON [PRIMARY]
GO

--Wide is bad, if you have a lot of nonclustered indexes.  
--This is a high-writes, low-reads table, so a single clustered index is probably all I want.
--With Interval_End being the first key field, they're inserted in order and eliminate mass fragmentation
  -- as well as bad write performance.
CREATE UNIQUE CLUSTERED INDEX IX_QueryStats_intervalend_sqlhandle_statementstartoffset_planhandle_U_C ON QueryStats
(
    interval_end
    , sql_handle
    , statement_start_offset
    , plan_handle
)

CREATE TABLE [dbo].[QueryStatsLast](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryStatsLast_sqlhandle_planhandle_statementstartoffset_U_C] ON [dbo].[QueryStatsLast] 
(
    [sql_handle] ASC,
    [plan_handle] ASC,
    [statement_start_offset] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

CREATE TABLE [dbo].[QueryText](
    [sql_handle] [varbinary](64) NOT NULL,
    [QueryText] [nvarchar](max) NOT NULL,
    [DatabaseName] [nvarchar](128) NULL,
    [objtype] [nvarchar](20) NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryText_sqlhandle_U_C] ON [dbo].[QueryText] 
(
    [sql_handle] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

Now that the tables are created, here’s how I populate them.  Set this up to run once every 4 hours to keep the data small during typical running, and kick it up to once an hour or more if needed to pinpoint an issue if you need it.

DECLARE @interval_start smalldatetime
    , @interval_end smalldatetime

IF OBJECT_ID('tempdb..#QS') IS NOT NULL BEGIN
    DROP TABLE #QS 
END 

CREATE TABLE [dbo].[#QS](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) 
/*
--This sounded like a great idea, but it just slowed it down several seconds.
CREATE UNIQUE CLUSTERED INDEX TempQS_Cluster ON #QS  
(
    sql_handle
    , plan_handle
    , statement_start_offset
)
*/

--The Cached Plans Object Type is in here in case you want to treat ad-hoc or prepared statements differently
INSERT INTO #QS 
SELECT qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , cp.objtype
    , qs.execution_count 
    , total_elapsed_time_ms = qs.total_elapsed_time/1000
    , min_elapsed_time_ms = qs.min_elapsed_time/1000
    , max_elapsed_time_ms = qs.max_elapsed_time/1000 
    , total_worker_time_ms = qs.total_worker_time/1000 
    , min_worker_time_ms = qs.min_worker_time/1000
    , max_worker_time_ms = qs.max_worker_time/1000
    , qs.total_logical_reads
    , qs.min_logical_reads
    , qs.max_logical_reads
    , qs.total_physical_reads
    , qs.min_physical_reads 
    , qs.max_physical_reads
    , qs.total_logical_writes 
    , qs.min_logical_writes 
    , qs.max_logical_writes
    , qs.creation_time
    , qs.last_execution_time
    , DateAdded = getDate()
FROM sys.dm_exec_query_stats AS qs
    INNER JOIN sys.dm_exec_cached_plans cp on qs.plan_handle = cp.plan_handle
--WHERE cp.objtype NOT IN ('Adhoc')

INSERT INTO QueryText (sql_handle, QueryText, DatabaseName, objtype)
SELECT QS.sql_handle
    , QueryText = qt.text
    , DatabaseName = DB_NAME(max(qt.dbid))
    , max(QS.objtype)
FROM (SELECT #QS.sql_handle
        , #QS.objtype  
        FROM #QS
            LEFT JOIN QueryText QST ON #QS.sql_handle = QST.sql_handle
        WHERE QST.sql_handle IS NULL) QS
    CROSS APPLY sys.dm_exec_sql_text(QS.sql_handle) qt
GROUP BY QS.sql_handle
    , qt.text

--All the values are the same in each table
SELECT TOP 1 @interval_start = dateadded FROM QueryStatsLast
SELECT TOP 1 @interval_end = dateadded FROM #QS

IF @interval_start IS NULL BEGIN
    SELECT @interval_start = create_date
    FROM sys.databases
    WHERE name = 'tempdb'
END

INSERT INTO QueryStats (interval_start
    , interval_end
    , sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time)
SELECT @interval_start
    , @interval_end
    , qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , qs.execution_count - ISNULL(qsl.execution_count, 0)
    , qs.total_elapsed_time_ms - ISNULL(qsl.total_elapsed_time_ms, 0)
    , qs.min_elapsed_time_ms 
    , qs.max_elapsed_time_ms 
    , qs.total_worker_time_ms - ISNULL(qsl.total_worker_time_ms, 0)
    , qs.min_worker_time_ms
    , qs.max_worker_time_ms 
    , qs.total_logical_reads - ISNULL(qsl.total_logical_reads, 0)
    , qs.min_logical_reads 
    , qs.max_logical_reads 
    , qs.total_physical_reads - ISNULL(qsl.total_physical_reads, 0)
    , qs.min_physical_reads 
    , qs.max_physical_reads 
    , qs.total_logical_writes - ISNULL(qsl.total_logical_writes, 0)
    , qs.min_logical_writes 
    , qs.max_logical_writes 
    , qs.creation_time
    , qs.last_execution_time
FROM #QS qs
    LEFT OUTER JOIN QueryStatsLast qsl ON qs.sql_handle = qsl.sql_handle
        AND qs.plan_handle = qsl.plan_handle
        AND qs.statement_start_offset = qsl.statement_start_offset
        AND qs.creation_time = qsl.creation_time 
WHERE qs.execution_count - ISNULL(qsl.execution_count, 0) > 0 --Only keep records for when it was executed in that interval

TRUNCATE TABLE QueryStatsLast

INSERT INTO QueryStatsLast
SELECT sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , objtype
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time
    , DateAdded
FROM #QS

Sys.dm_os_performance_counters Demystified


The dmv sys.dm_os_performance_counters is awesome, if you can understand it. This is how I make it easy to read and use. Here are the values I watch and why I watch them. My list isn’t going to be perfect and you’re not going to agree with it 100%, and I’m ok with that. First, there is no perfect. Second, if we agree 100% then one of us is just mindlessly following the other which isn’t good.

  • Cache Hit Ratio – I ignore this value, but I still monitor it. I will, ideally, never be the only DBA on a team again, and everyone seems to think this value is cool.
  • Page Life Exp – My favorite! When you read a page from disk into memory how many seconds will it stay there? Just don’t use the outdated “300” rule or your disks will catch on fire!!!

    Burning Drive!!!

    PLE = 300

  • Page Lookups/Sec – How many pages are read from memory.
  • Page Reads/Sec – How many pages are read from disk.
  • Page Writes/Sec – How many pages are written to disk.
  • Lazy Writes/sec – How many pages are written to disk outside of a checkpoint due to memory pressure.
  • Batch Requests/sec – How busy is the server?
  • Trans/sec – How busy is the server?
  • Total Server Memory – How much memory SQL Server is currently using. Typically ramps up to Target value and PLE is low as it ramps up since new pages are in memory dropping the average.
  • Target Server Memory – How much memory SQL Server is allowed to use. Should be the same as the max memory setting, but memory pressure can cause this to decrease.
  • Memory Grants Pending – How many processes aren’t able to get enough memory to run. Should be 0, always 0, if not then find out why.
  • Deadlocks – How many deadlocks are we getting. Most apps handle deadlocks gracefully, but they still lose time doing it. If this number starts going up, start looking into it.
  • SQL Compilations/sec – This is a hidden performance killer! Some queries can’t be cached so they’re compiled every time they’re run. I’ve seen this with a query being run once a second and a big server was running slower than my laptop. It’s normal for things to compile throughout the day, it’s not normal for this number to be 10x higher than before that last upgrade.
  • SQL Re-Compliations/sec – Same goes here. The counters aren’t that much different.

If you know a little about this DMV then you know these values are cryptic. There’s several ways this data is stored and it has to be retrieved differently for each type to be useful. Then many of these are cumulative since the server was restarted, which isn’t going to help too much. Even worse, MSDN failed us on this one and figuring out this DMV required help outside of that site. Now for the good news, the script below will take care of all of that for you and leave you with some easy reading with values you can filter to the time periods you care about. If you want to add some of your own counters then just follow my lead on one that has the same cntr_type, or you can go to Rabin’s blog post that I learned from.

IF object_id('tempdb..#OSPC') IS NOT NULL BEGIN
    DROP TABLE #OSPC
END

DECLARE @FirstCollectionTime DateTime
    , @SecondCollectionTime DateTime
    , @NumberOfSeconds Int
    , @BatchRequests Float
    , @LazyWrites Float
    , @Deadlocks BigInt
    , @PageLookups Float
    , @PageReads Float
    , @PageWrites Float
    , @SQLCompilations Float
    , @SQLRecompilations Float
    , @Transactions Float

DECLARE @CounterPrefix NVARCHAR(30)
SET @CounterPrefix = CASE WHEN @@SERVICENAME = 'MSSQLSERVER'
                            THEN 'SQLServer:'
                        ELSE 'MSSQL$' + @@SERVICENAME + ':'
                        END

--Grab the current values from dm_os_performance_counters
--Doesn't do anything by instance or database because this is good enough and works unaltered in all envirornments
SELECT counter_name, cntr_value--, cntr_type --I considered dynamically doing each counter type, but decided manual was better in this case
INTO #OSPC 
FROM sys.dm_os_performance_counters 
WHERE object_name like @CounterPrefix + '%'
    AND instance_name IN ('', '_Total')
    AND counter_name IN ( N'Batch Requests/sec'
                        , N'Buffer cache hit ratio'
                        , N'Buffer cache hit ratio base'
                        , N'Free Pages'
                        , N'Lazy Writes/sec'
                        , N'Memory Grants Pending'
                        , N'Number of Deadlocks/sec'
                        , N'Page life expectancy'
                        , N'Page Lookups/Sec'
                        , N'Page Reads/Sec'
                        , N'Page Writes/Sec'
                        , N'SQL Compilations/sec'
                        , N'SQL Re-Compilations/sec'
                        , N'Target Server Memory (KB)'
                        , N'Total Server Memory (KB)'
                        , N'Transactions/sec')

--Just collected the second batch in the query above
SELECT @SecondCollectionTime = GetDate()

--Grab the most recent values, if they are appropriate (no reboot since grabbing them last)
SELECT @FirstCollectionTime = DateAdded
    , @BatchRequests = BatchRequests
    , @LazyWrites = LazyWrites
    , @Deadlocks = Deadlocks
    , @PageLookups = PageLookups
    , @PageReads = PageReads
    , @PageWrites = PageWrites
    , @SQLCompilations = SQLCompilations
    , @SQLRecompilations = SQLRecompilations
    , @Transactions = Transactions
FROM OSPerfCountersLast 
WHERE DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'TempDB')

--If there was a reboot then all these values would have been 0 at the time the server came online (AKA: TempDB's create date)
SELECT @FirstCollectionTime = ISNULL(@FirstCollectionTime, (SELECT create_date FROM sys.databases WHERE name = 'TempDB'))
    , @BatchRequests = ISNULL(@BatchRequests, 0)
    , @LazyWrites = ISNULL(@LazyWrites, 0)
    , @Deadlocks = ISNULL(@Deadlocks, 0)
    , @PageLookups = ISNULL(@PageLookups, 0)
    , @PageReads = ISNULL(@PageReads, 0)
    , @PageWrites = ISNULL(@PageWrites, 0)
    , @SQLCompilations = ISNULL(@SQLCompilations, 0)
    , @SQLRecompilations = ISNULL(@SQLRecompilations, 0)
    , @Transactions = ISNULL(@Transactions, 0)

SELECT @NumberOfSeconds = DATEDIFF(ss, @FirstCollectionTime, @SecondCollectionTime)

--I put these in alphabetical order by counter_name, not column name.  It looks a bit odd, but makes sense to me
--Deadlocks are odd here.  I keep track of the number of deadlocks in the time period, not average number of deadlocks per second.
  --AKA, I keep track of things the way I would refer to them when I talk to someone.  "We had 2 deadlocks in the last 5 minutes", not "We averaged .00002 deadlocks per second there"
INSERT INTO OSPerfCounters (DateAdded, Batch_Requests_Sec, Cache_Hit_Ratio, Free_Pages, Lazy_Writes_Sec, Memory_Grants_Pending
    , Deadlocks, Page_Life_Exp, Page_Lookups_Sec, Page_Reads_Sec, Page_Writes_Sec, SQL_Compilations_Sec, SQL_Recompilations_Sec
    , ServerMemoryTarget_KB, ServerMemoryTotal_KB, Transactions_Sec)
SELECT @SecondCollectionTime
    , Batch_Request_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec') - @BatchRequests) / @NumberOfSeconds
    , Cache_Hit_Ratio        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio')/(SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio base')
    , Free_Pages            = (SELECT cntr_value FROM #OSPC WHERE counter_name =N'Free pages')
    , Lazy_Writes_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec') - @LazyWrites) / @NumberOfSeconds
    , Memory_Grants_Pending    = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Memory Grants Pending')
    , Deadlocks                = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec') - @Deadlocks) 
    , Page_Life_Exp         = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page life expectancy')
    , Page_Lookups_Sec      = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec') - @PageLookups) / @NumberOfSeconds
    , Page_Reads_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec') - @PageReads) / @NumberOfSeconds
    , Page_Writes_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec') - @PageWrites) / @NumberOfSeconds
    , SQL_Compilations_Sec  = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec') - @SQLCompilations) / @NumberOfSeconds
    , SQL_Recompilations_Sec= ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec') - @SQLRecompilations) / @NumberOfSeconds
    , ServerMemoryTarget_KB = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Target Server Memory (KB)')
    , ServerMemoryTotal_KB  = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Total Server Memory (KB)')
    , Transactions_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec') - @Transactions) / @NumberOfSeconds

TRUNCATE TABLE OSPerfCountersLast

--Note, only saving the last value for ones that are done per second.
INSERT INTO OSPerfCountersLast(DateAdded, BatchRequests, LazyWrites, Deadlocks, PageLookups, PageReads
    , PageWrites, SQLCompilations, SQLRecompilations, Transactions)
SELECT DateAdded            = @SecondCollectionTime
    , BatchRequests            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec')
    , LazyWrites            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec')
    , Deadlocks             = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec')
    , PageLookups            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec')
    , PageReads                = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec')
    , PageWrites            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec')
    , SQLCompilations        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec')
    , SQLRecompilations        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec')
    , Transactions             = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec')

DROP TABLE #OSPC

Throw that code above here in a proc, schedule it to run every so often (I like 5 minutes) and it’ll….fail. It kinda relies on a couple tables you should create first. Here ya go.

CREATE TABLE OSPerfCounters(
    DateAdded datetime NOT NULL
    , Batch_Requests_Sec int NOT NULL
    , Cache_Hit_Ratio float NOT NULL
    , Free_Pages int NOT NULL
    , Lazy_Writes_Sec int NOT NULL
    , Memory_Grants_Pending int NOT NULL
    , Deadlocks int NOT NULL
    , Page_Life_Exp int NOT NULL
    , Page_Lookups_Sec int NOT NULL
    , Page_Reads_Sec int NOT NULL
    , Page_Writes_Sec int NOT NULL
    , SQL_Compilations_Sec int NOT NULL
    , SQL_Recompilations_Sec int NOT NULL
    , ServerMemoryTarget_KB int NOT NULL
    , ServerMemoryTotal_KB int NOT NULL
    , Transactions_Sec int NOT NULL
)

--You'll typically only query this by one value, which is added sequentually.  No page splits!!!
CREATE UNIQUE CLUSTERED INDEX IX_OSPerfCounters_DateAdded_U_C ON OSPerfCounters
(
    DateAdded
) WITH (FillFactor = 100)

--Only holds one value at a time, indexes are a waste
CREATE TABLE OSPerfCountersLast(
    DateAdded datetime NOT NULL
    , BatchRequests bigint NOT NULL
    , LazyWrites bigint NOT NULL
    , Deadlocks bigint NOT NULL
    , PageLookups bigint NOT NULL
    , PageReads bigint NOT NULL
    , PageWrites bigint NOT NULL
    , SQLCompilations bigint NOT NULL
    , SQLRecompilations bigint NOT NULL
    , Transactions bigint NOT NULL
)

The important part of all this is how you use it. It’s tempting to just look at the last 7 records and say that you know what’s going on, that makes me want to slap you. Every server is different, every server has different loads and baselines, and you’re either underworked or you don’t know what those baselines are for every server you manage. I do simple baselines every time I look at an incident and look at the last hour, the same time yesterday, and the same time a week ago. That gives you a chance to see what’s normal for this server and what’s different right now. This query is so simple you’ll wonder why I even posted it, but it’s effective which is why it’s here. The 7 records per day thing, that’s because 21 records show up on my screen without me scrolling, it is NOT a magic number!

SELECT 'Today', * FROM (
SELECT TOP 7 *
FROM OSPerfCounters
ORDER BY dateadded DESC
) X

UNION
SELECT 'Yesterday', * FROM (SELECT TOP 7 *
FROM OSPerfCounters
WHERE dateadded <= GETDATE()-1
ORDER BY dateadded DESC
) Y

UNION
SELECT 'Last Week', * FROM (
SELECT TOP 7 *
FROM OSPerfCounters
WHERE dateadded <= GETDATE()-7
ORDER BY dateadded DESC) Z

ORDER BY dateadded DESC

And, well, something I’ve been skipping on my posts and telling people to handle cleanup on their own…. Here’s step 2 of my jobs that populate my monitoring tables to keep your data from being the ever-growing data you’re struggling with in every other app. I delete in batches according to the clustered index. It’s overkill for something deleting one row at a time, or, even if you put this in a separate daily job, 288 rows if the process is scheduled every 5 minutes. So, why the batches? Because I copy/paste my own code everywhere, batches is reusable, and this is how I chop off the tail end of EVERYTHING!

SELECT 'Start' --Give me a rowcount of 1

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (100000)
    FROM OSPerfCounters
    where dateadded < (GetDate() - 400)
END 

In the beginning I mentioned that if you agreed with me 100% then one of us is a mindless monkey. Look, I put this out there first, so I’m obviously not the mindless monkey here, am I? There’s a box below that gives you a chance to show that you’re not a mindless monkey either! Tell me I’m wrong, how I can do better, and how everyone else reading this can benefit from it even more! I’ll promote you from mindless monkey to talking monkey!

SQL Server Running Slow


A user calls to say the app or server is slow today.  Here’s a quick summary of how I get started.  It depends on monitoring where you can compare baselines to recent activity, and I include links to all of that code I use.

The initial question is too vague.  You can’t blame this individual user because this is how 95% of issues make it to you.   The initial follow-up will be the following questions I ask the user along with the corresponding question your boss is asking you:

  • Can anyone else duplicate your issue?
    • Is this the user’s machine or a server issue?  Not a 100% conclusive answer, but it’s a start.
  • Is everything running slow or just one process or screen?
    • Is the whole server running slow or just one stored procedure?
  • Were any changes made to the system recently?  Depending on the user, this may be a question for someone else.
    • Did we cause it by making a change or did this just come up?

However, this is enough for me to do a very quick overall health check while I wait for those answers.  It seems like a lot at first glance, but it’s practically automated with the scripts I have on hand.  This is only opening 5 pre-written scripts, running them, and peeking at the results.

Quick Health:

  • Look at historical wait stats for the last two hours and the same time window both 1 and 7 days ago.  Documented in the Wait Stats post.
  • Look at memory and CPU contention for the same times as wait stats.  You’re specifically looking for high CPU usage, PLE drops, and changes to target memory. Documented in the OS Performance Counters post.
  • Look for blocking in the last hour, even though it would typically show up in wait stats.  You can compare this to previous days if needed, Documented in the Blocking post
  • Look at the large queries that completed in the last two hours along with the queries currently running to see if anything pops out.  This is just a quick glance looking for glaring issues, and you can look for details after getting answers to the follow-up questions.  Unfortunately, this information isn’t as cut-and-dry as the rest, and it’s going to rely on you being a little familiar with what normal is on this server.  Completed queries are documented in the Tracing Creation and Reading posts, and currently running queries are documented in the Running Processes post.
  • Look at the queries recompiled recently to see if a new execution plan is taking excessive resources.  Documented in the Recently Recompiled Resource Hogs post.

So, where do you go from here?  Yeah, you get the typical “it depends” answer from me.  The information you just looked at will normally comes up with something that doesn’t look right and needs to be looked into further.  However, there are so many directions this can go from here that a single document can’t cover it.  The purpose of this post is simply to answer the question “Where do I start looking with the most common question a user throws at me”, nothing more.  Also, I wanted a way to tie all of the random posts together so you could see them working together as the full solution.

Reading Traces


Capturing information is useless unless you know how to use it. Here’s what I use to get the most out of my basic trace of anything taking over so many seconds. This is also my template for more invasive traces, and it gets tweaked on a case-by-case basis.

It’s my typical style with half of it commented out where less run portions can be highlighted and run and portions of my WHERE clause don’t have to be rewritten to be added in. The first line gets highlighted as soon as I open this script every time so I can run it, copy the path of the trace I’m looking for, and paste it into my main script. As for the WHERE clause, call me lazy or whatever it is you call people who can’t remember the right word, but I don’t always remember the exact wording of each column name that I want to filter by. I simply uncomment them as I need them.

The only thing I can say is really impressive here is finding the job names, which is a piece of this script I will never forget figuring out. Other than that, just little things to help readability when both running this script and when playing with the WHERE clause. I can’t tell you why duration is microseconds and CPU is milliseconds, but I can say the SELECT and WHERE clauses are set up to use the same units I would use when I talk to someone.

--SELECT * FROM sys.traces

/*
--sp_trace_setstatus id, status
exec sp_trace_setstatus 3,0 --stop
exec sp_trace_setstatus 3,2 --close
--0 stop
--1 start
--2 close
*/

--query running trace
SELECT TOP 10000 Database_Name = DB_Name(DatabaseID)
    , DatabaseName
    , starttime
    , endtime
    , textdata = cast(textdata as varchar(4000)) 
    , Duration_Sec = cast(duration/1000/1000.0 as Dec(10,1)) 
    , CPU_Sec = cast(cpu/1000.0 as Dec(10,1)) 
    , Reads_K = cast(reads/1000.0 as Dec(10,0)) 
    , Writes_K = Cast(writes/1000.0 as Dec(10,1)) 
    , hostname  
    , LoginName 
    , NTUserName 
    , ApplicationName = CASE LEFT(ApplicationName, 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(ApplicationName,32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(ApplicationName, 67, len(ApplicationName)-67)
                    ELSE ApplicationName
                    END 
    , SPID
    , TE.*
FROM fn_trace_gettable('D:\Traces\DurationOver10sec_110.trc',default) T
    LEFT JOIN sys.trace_events TE ON  T.EventClass = TE.trace_event_id
WHERE endtime > DATEADD(Hour, -3, GetDate()) 
    --AND Duration > 1000000 * 15 --in seconds
    --AND CPU > 1000 * 10 --in seconds
    --AND Writes > 10000
    --AND Reads > 1000000
    --AND te.category_id = 2 --Auto Grow/Shrink found in default trace
    --AND t.databaseid = DB_ID('master')
    --AND t.DatabaseName = 'tempdb'
    --AND t.LoginName <> 'shood'
    --AND t.textdata like '%%'
ORDER BY endTime DESC, starttime DESC

So, for those of you not too familiar with tracing, what does all of this mean and what am I looking for? It depends on what’s wrong. The day after an app’s upgrade I’m looking for procs that snuck through testing and are killing the server. When a user complains about performance you might see a new proc showing up because it’s getting a bad execution plan. If tempdb or the logs fill up I can tell you what had 100,000 writes and just happened to end at that exact moment. Sometimes I’ll glance to see if anything is hitting or approaching the app’s 30 second timeout.

A general trace such as the RPC:Completed and SQL:BatchCompleted that take over X seconds won’t answer all of your questions, but it will be a good start. The most important things are to always have something running (could be extended events instead of a trace) to tell you the big things that are running and to have your monitoring limited enough where you aren’t the one slowing down the server. Make sure it’s customized to each server since a 1 second duration in a web app is a big deal, but other servers you would get flooded with results unless you bump it up to 10 seconds. You can run more invasive traces when you need them.

There are two hits against extended events where I’m even using traces for this. First, I still have too many SQL 2005 servers around and like to do things as consistent as reasonably possible across all servers to make it simple. Second, I’m still not too happy about extended events skipping SQL:BatchCompleted in SQL 2008 as that misses out on finding large loops of small statements. Extended events are much more efficient, can do a lot that traces can’t do, but aren’t a 100% replacement for traces.

Comment on my posts, let me know what you think. Every post I have at Simple SQL Server is written to myself 5 years ago, and is exactly what I wish I had back then in my own mind. The key words being “in my own mind”, and I’d like to hear what’s on your mind. Be it constructive criticism on this post or a completely separate SQL Server subject you wish you knew more about either now or 5 years ago that you’d like to see thrown up for everyone to use, I’ll never know until you tell me.

Tracing Introduction


SQL Server tracing is essential for troubleshooting performance issues, yet it can put loads on your server that would cause noticeable slowness, capture so much information that it fills the drive, and I’ve even seen it cause a cluster failover more than once.  With a tool this powerful you just need to know how to use it before you use it.  It’s almost like the idea where companies want their entry-level people to have a year of experience.

There is one trace I’d like to see on every server, preferably as a startup proc or a job that starts automatically when SQL Agent starts.  It doesn’t matter how you do it, if you want me to be happy you’ll tell me this was running for at least 24 hours before you had a performance incident.  All I’m looking for is RPC:Completed and SQL:BatchCompleted with a reasonable filter of duration over 10 seconds.

If you’ve never used Profiler before, here’s a quick overview to make the server-side trace I’m using.  Leave the first screen at the defaults, with the possible exception of changing the template to blank.  Select RPC:Completed and SQL:BatchCompleted and set the Duration filter to greater than or equal to 10,000 ms which you can get to using the “Column Filters…” button.  You can collect all columns if you want to, it won’t hurt with these events although there are some columns that won’t help.

Create trace with duration filter

Run the trace, then you can export it to a script.  I’m not happy that you have to run it first, but in the end this is either something that doesn’t affect the server at all or something I run on a dev server that’s at the same patch level.

Export trace to file

There, you have the basic script to start the server-side trace.  You still need to tweak it a bit before you can use it, but the hard part is done.  Everything highlighted in yellow was changed by me.  The default max file size was changed from 5 MB to 25 MB because I view 25 to 50 MB to be the range that’s not annoyingly small or too big to move around or send to someone easily.  The second parameter of sp_trace_create is set to 2 to enable rollover files, the third parameter is set to my output file sans an extension, and the last parameter is added to say how many output files there are.

/****************************************************/
/* Created by: SQL Server 2008 R2 Profiler          */
/* Date: 06/16/2013  06:10:02 AM         */
/****************************************************/

-- Create a Queue
declare @rc int
declare @TraceID int
declare @maxfilesize bigint
set @maxfilesize = 25

-- Please replace the text InsertFileNameHere, with an appropriate
-- filename prefixed by a path, e.g., c:\MyFolder\MyTrace. The .trc extension
-- will be appended to the filename automatically. If you are writing from
-- remote server to local drive, please use UNC path and make sure server has
-- write access to your network share

exec @rc = sp_trace_create @TraceID output, 2, N'D:\Traces\DurationOver10Sec', @maxfilesize, NULL, 5
if (@rc != 0) goto error

-- Client side File and Table cannot be scripted

-- Set the events
declare @on bit
set @on = 1
exec sp_trace_setevent @TraceID, 10, 15, @on
exec sp_trace_setevent @TraceID, 10, 31, @on
exec sp_trace_setevent @TraceID, 10, 8, @on
exec sp_trace_setevent @TraceID, 10, 16, @on
exec sp_trace_setevent @TraceID, 10, 48, @on
exec sp_trace_setevent @TraceID, 10, 1, @on
exec sp_trace_setevent @TraceID, 10, 9, @on
exec sp_trace_setevent @TraceID, 10, 17, @on
exec sp_trace_setevent @TraceID, 10, 10, @on
exec sp_trace_setevent @TraceID, 10, 18, @on
exec sp_trace_setevent @TraceID, 10, 34, @on
exec sp_trace_setevent @TraceID, 10, 3, @on
exec sp_trace_setevent @TraceID, 10, 11, @on
exec sp_trace_setevent @TraceID, 10, 51, @on
exec sp_trace_setevent @TraceID, 10, 12, @on
exec sp_trace_setevent @TraceID, 10, 13, @on
exec sp_trace_setevent @TraceID, 10, 14, @on
exec sp_trace_setevent @TraceID, 12, 15, @on
exec sp_trace_setevent @TraceID, 12, 31, @on
exec sp_trace_setevent @TraceID, 12, 8, @on
exec sp_trace_setevent @TraceID, 12, 16, @on
exec sp_trace_setevent @TraceID, 12, 48, @on
exec sp_trace_setevent @TraceID, 12, 1, @on
exec sp_trace_setevent @TraceID, 12, 9, @on
exec sp_trace_setevent @TraceID, 12, 17, @on
exec sp_trace_setevent @TraceID, 12, 10, @on
exec sp_trace_setevent @TraceID, 12, 14, @on
exec sp_trace_setevent @TraceID, 12, 18, @on
exec sp_trace_setevent @TraceID, 12, 3, @on
exec sp_trace_setevent @TraceID, 12, 11, @on
exec sp_trace_setevent @TraceID, 12, 51, @on
exec sp_trace_setevent @TraceID, 12, 12, @on
exec sp_trace_setevent @TraceID, 12, 13, @on

-- Set the Filters
declare @intfilter int
declare @bigintfilter bigint

set @bigintfilter = 10000000
exec sp_trace_setfilter @TraceID, 13, 0, 4, @bigintfilter

-- Set the trace status to start
exec sp_trace_setstatus @TraceID, 1

-- display trace id for future references
select TraceID=@TraceID
goto finish

error: 
select ErrorCode=@rc

finish: 
go

This is done as a server-side trace for a couple reasons. Profiler needs to stay open to keep collecting data, SQL Server has to send the data to Profiler which is usually across the network, then the trace data is held in memory. None of these are a problem for a properly filtered short-term trace. This is properly filtered, but I’m really hoping everyone will run this long-term.

I owe you another blog post on querying traces, then we can move in to putting it all together with where to start when a user makes their favorite statement to you…”The server is running slow”.