Powershell, when were my servers rebooted?

One of my servers wasn’t rebooted when it should have been and I never realized it until after the outage was over.  The big deal wasn’t that it didn’t reboot, it was that I never realized it.  These are the things that bug me, so I used it as an excuse to learn a bit of PowerShell.

A little backstory to give the details a better chance of making sense to everyone…  Instead of having a couple overworked sys admins doing windows updates for every server, we each take care of our own with a sys admin checking that all updates were applied.  It’s the first I’ve heard of it being done this way, but I like it.

To keep things consistent, the database team has a policy of rebooting all database servers regardless of WSUS requiring it.  One month most of the servers required a reboot and bounced automatically, except for one that didn’t require a reboot.  After the first 10 or 15 rebooted, I got lazy and started just verifying the servers were up, SQL was responding, all updates were applied, and there was no pending reboot.

It wasn’t until I was reading the logs the next business day I realized that that server was missing all of the chatter from SQL Services starting up.  Like I said, the biggest issue was me being frustrated that it snuck by me, and I didn’t want that to happen again.  I wanted my solution to be independent of SQL Server because not all of the computers I do updates on have a SQL engine.

Wanting an excuse to learn PowerShell better, I talked to our Exchange guy (they know their PowerShell) and got a basic script to determine a computer’s boot time given its name.  I expanded that to be an entire AD group, then a list of AD groups, and finally using wildcards to find AD groups.  Wildcards work best for me because I’m reorganizing my SUS groups and all of mine have “SQL” in the groups name, even for the app servers I’m responsible for.

That resulted in the following code.  Now I’m absolutely certain that a PowerShell or AD guru can poke a ton of holes in my logic, but that’s what we have gurus for!  To be fair, I’d do the same to their T-SQL.  If any of those gurus are out there reading this DBA’s blog, the comment section below is how you fix my code and stop me from spreading bad code like a virus.

Import-Module ActiveDirectory

#either method works for getting a list of groups.  You can type in all of your groups or make a query to find them all.  I prefer the query if possible.

#$groups = "Test SQL Servers", "Prod SQL Servers 1", "Prod SQL Servers 2", "SQL Cluster Servers 1", "SQL Cluster Servers 2"

$groups = Get-ADGroup -Filter {name -like "*SQL *"} | where-object {$_.distinguishedname -like "*OU=SUS Group*"}

ForEach ($group in $groups) {
    $computerlist = Get-ADGroupMember $group -Recursive | SELECT name 

    #If there were any computers in the list, do this.  It skips empty groups.
    if ($computerlist) {
        $computerlistcount = $computerlist.Count

        #It returns an object instead of an array if there was only one, so count would be null
        if (!$computerlistcount) {
            $computerlistcount = 1
        }

        Write-Host ""
        Write-Host "$computerlistcount server(s) in" $group.name
    }

    ForEach ($computer in $computerlist) {
        try {
            Get-WmiObject win32_operatingsystem -ComputerName $computer.name -ErrorAction Stop | select csname, @{LABEL='LastBootUpTime';EXPRESSION={$_.ConverttoDateTime($_.lastbootuptime)}}
        }
        catch {
            #This logic is lacking.  I don't care what error you got, just say you couldn't connect and we'll move on.
            #For this script, it probably means the server is in the middle of a reboot
            Write-Host "Could not connect to" $computer.name
            Continue
        }
    }
}
 

There are relatively equivalent things you can do in SQL Server, but that’s assuming you’re running a SQL Server engine on every server you care about.  Even from the viewpoint of a DBA running the SUS updates on my own servers, this doesn’t hold up.  I still care about my monitoring application server and several others.

That being said, here’s what SQL can do for you:

SELECT create_date FROM sys.databases WHERE database_id = 2

exec xp_cmdshell 'wmic os get lastbootuptime'

There’s also some more bad news with these beyond requiring the SQL Server engine.

The create date of TempDB is useful, but it’s created on startup of SQL Services, not the computer.  This means if you restart SQL Services then you get a new date, even if you specifically state that you want to restart every computer during your SUS outage whether it’s required or not.

The WMIC call gets around that and tells you when the server came up, but it’s relying on xp_cmdshell.  This extended procedure has a bad rep for being a security loophole, even though by default to run the thing you need enough permissions to be able to turn it on.  Because of that, it’s turned off in many environments and flipping it on for a second can be frowned upon.

File Growths

I hinted at file growths in my previous posts about shrinking data and log files.  Then I talked about growing log files in my post Database Log VLFs.  However, there are still some unanswered questions about growing data files.

Should I use autogrowth?  What should I set it to grow by?

Why do manual growths, and how do I know how much to grow by or when a growth is needed?

Can I and should I use Instant File Initialization?  If I can and should, how do I even know if it’s turned on?

Leave Autogrowth On?

Yes, absolutely.  This isn’t a question.

This could be worded as “Do you want a slight pause or do you want your application to fail?”  The answer is a little more obvious when you word it that way.  The only time I’ve seen an appropriate answer of “No” is when you have multiple data files in a filegroup and only want one to grow.

To answer how long this pause will be you have to ask these two questions.  Are you able to use Instant File Initialization?  How much are you growing by if you can’t use Instant File Initialization?  I’ll discuss Instant File Initialization later in this post, but it’s basically the difference between “this space is yours immediately” instead of writing zeros SQL Server doesn’t care about to be paranoid about security.

There are competing factors as to how much you want to autogrowth to be.  The only real consensus is that the defaults are horrible.  You need to balance out how long of a pause it will take for a growth, how long of a pause your users will tolerate, how fast is your database growing, how much fragmentation smaller growths will cause, and how much fragmentation you will tolerate.

By default, data files grow at 1 MB and log files grow at 10%.  I covered log file growths rather well in my post Database Log VLFs, so we’ll focus on the data files here.  Think about it, if you have a 100 GB database that got there by the defaults, it got there in 102,400 pieces.  Not good.

If the defaults are horrible, what’s good?  Tiny is bad, huge is bad, and percentage-based growth just lets you automatically drift from tiny to huge.  Do your balancing act I talked about above.  If growing by 64 MB will keep you good for several weeks, set it to 64 MB.  If 1 GB goes by rather quickly, you can’t use Instant File Initialization, and the users can tolerate the pause it takes to grow by 1 GB, then this might be the best growth rate for you.  No single answer is right, but at least you have a better idea of what the variables are.

There are also cases where there are no good answers.  Say you have an app wouldn’t be uncommon to grow by a couple GB a day, you can’t use Instant File Initialization, and pauses of more than a second wouldn’t go over very well.  You don’t want the fragmentation of setting the growth to 64 MB, but you don’t want the pause of setting it to 1 GB.  Maybe you’ll settle for 256 MB, but you’re not happy about it.  Manual growths are the real answer, but you want a safety net that won’t kill you with pauses or fragmentation.  Not everyone is in that situation, but some of us are.

If I’m going to suggest leaving Autogrowth on and suggest that you be aware of the size of these growths, then the least I can do is give you a script for that.  This will tell you the size of every database file and the autogrowth rate.  I’ll often use it to tweak the growth rates, so I script that change out in the results as well.  If you read the post Database Log VLFs then it will make more sense why I use 1,000 MB here instead of 1,024 MB; it wasn’t a typo.

DECLARE @FilterByDB VarChar(100)
	, @Cmd NVarChar(4000)

SELECT @FilterByDB = ''
--SELECT @FilterByDB = 'tempdb'

IF Object_ID('TempDB..##FileSize') IS NOT NULL BEGIN
	DROP TABLE ##FileSize
END

CREATE TABLE ##FileSize
(	  DB VarChar(128)
	, FileLogicalName VarChar(128)
	, FilePhysicalName VarChar(512)
	, FileGroup VarChar(128)
	, Used_MB VarChar(128)
	, Free_MB VarChar(128)
	, Size_MB VarChar(128)
	, GrowthRate VarChar(128)
	, MaxSize VarChar(128)
)


SELECT @Cmd =  N'use [?]; 

IF ''?'' like ''' + @FilterByDB + '%'' OR ''' + @FilterByDB + ''' = '''' BEGIN 
INSERT INTO ##FileSize
SELECT DB = db_name()
	, FileLogicalName = f.name
	, FilePhysicalName = f.physical_name
	, FileGroup = ISNULL(g.name, f.Type_Desc) 
	, Used_MB = REPLACE(CONVERT(varchar(100), (CAST((FileProperty(replace(f.name, '''', ''''''''), ''SpaceUsed'')) / 128 AS money)), 1), ''.00'', '''')
	, Free_MB = REPLACE(CONVERT(varchar(100), (CAST((f.size - FileProperty(f.name, ''SpaceUsed'')) / 128 AS money)), 1), ''.00'', '''')
	, Size_MB = REPLACE(CONVERT(varchar(100), (CAST((f.size) / 128 AS money)), 1), ''.00'', '''')
	, GrowthRate = Case Is_Percent_Growth
					WHEN 1 THEN '''' + Cast(Growth as VarChar(100)) + ''%''
					ELSE REPLACE(CONVERT(varchar(100), (CAST(Growth/128 AS money)), 1), ''.00'', '''') + '' MB''
					END 
	, MaxSize = Case Max_Size
					WHEN -1 THEN ''---''
					WHEN 268435456 THEN ''---''
					ELSE REPLACE(CONVERT(varchar(100), (CAST(Max_Size/128 AS money)), 1), ''.00'', '''') + '' MB''
					END
FROM sys.database_files f
	LEFT JOIN sys.filegroups g on f.data_space_id = g.data_space_id

END
' 
IF @FilterByDB <> '' AND 100 < (SELECT COUNT(1) FROM sys.databases) BEGIN
	SELECT @Cmd = REPLACE(@Cmd, '?', @FilterByDB)

	EXEC (@Cmd)
END ELSE BEGIN
	exec sp_MSforeachdb @Cmd 
END

SELECT *
	, PctUsed = Cast(100 * replace(Used_MB, ',', '') / (Cast(replace(Size_MB,',', '') as Dec(20,2)) + .01) as Dec(20,2))
	, SetGrowth64MB = 'ALTER DATABASE [' + db + '] MODIFY FILE ( NAME = N''' + FileLogicalName + ''', /*Size = MB,*/ FILEGROWTH = 64MB )'
	, SetGrowthTo256MB = 'ALTER DATABASE [' + db + '] MODIFY FILE ( NAME = N''' + FileLogicalName + ''', /*Size = MB,*/ FILEGROWTH = 256MB )'
	, SetGrowthTo1000MB = 'ALTER DATABASE [' + db + '] MODIFY FILE ( NAME = N''' + FileLogicalName + ''', /*Size = MB,*/ FILEGROWTH = 1000MB )'
FROM ##FileSize
ORDER BY 1, 4, /*cast(replace(size_mb, ',', '') as int) DESC,*/ FilePhysicalName

DROP TABLE ##FileSize

Manual Growths

If you want to be extreme about it, autogrowth is like asking someone about wearing a seatbelt.  Do you want it to hurt a little or a lot if you “crash” by running out of allocated space?  Well, you want it to hurt…wait…why do I have to crash?

This may seem over-the-top, but some of the applications may have databases like the last one I talked about in the autogrowth section.  Typically we’d prefer not to have autogrowth handling everything to avoid the pause while it grows, but there are cases like that where it’s more than a slight preference to avoid the pause.

So the autogrowths are our seatbelts we’re putting on just in case while catching everything with manual growths is our real goal of avoiding that “crash”.  However, how are you supposed to know when to grow a file or how much to grow it?  Seeing how large your files are now doesn’t tell you how much they’ll grow in the next couple months, they could be static or growing exponentially.

The partial answer is to say how much data was used before, how much data is being used now, and predict how much data will be used in the future using that trend.  A better answer is to say what the peak usage was over a week and compare that to past weeks, which would account for things that cause spikes in file usage such as index rebuilds or data loads.

If you have monitoring software, it may have that answer just waiting for you to query it.  That’s the situation I find myself in right now, and I wrote Database Growth Trends – Idera Diagnostic Manager talking about how I handle it right now.  However, I used to work with monitoring software that pretty much only told me OS-level info, so we watched this data manually as was discussed in my post Monitoring Database and Table Sizes.

To be clear, many databases are fine with autogrowth handling everything.  Do what’s best for you for each individual database.  I personally have a mix of databases where some I’d be upset if autogrowth kicked in because it would cause timeouts that the app handles extremely poorly, and others, such as my DBA database, where a pause would only affect my collection jobs or a query I’m personally running.  Autogrowth is not inherently evil, so only be paranoid where paranoia is called for.

Performing Manual Growths

Manually growing the file is just setting a new initial size.  To do that in SSMS, right-click on a database, properties, files, then change the initial size to how large you want it to be.

I’m not a big fan of having SSMS do things for me, so I’ll use “Script Action to New Query Window” and run it form there.  This will give me a script like this:

USE [master]
GO
ALTER DATABASE [DBA] MODIFY FILE ( NAME = N'DBA_Data', SIZE = 256000KB )
GO

One thing I wish I knew about years before I did is that the size in KB also accepts sizes in MB.  I write just about everything in MB, so it’s more natural for me to write that script with Size = 250MB instead.  I know I’m being petty, but I’m petty and comfortable.

This script could take a couple milliseconds or several seconds, even for 250MB.  It all depends on if you have Instant File Initialization turned on and if you’re able to use it for this growth.

Instant File Initialization – What is it?

Short answer, this makes growths exponentially quicker by opening an often-acceptable small security hole.

Of course, I always have a long answer…

By default, all space used by SQL Server is zeroed out before control is handed over.  This makes sure that it’s clean space for two basic reasons.  First, the database requires that space to be clean because of how it plans to use it.  Second, we don’t want to risk the possibility of special commands being able to pull back unencrypted sensitive data that used to be there.

Now the first reason we don’t have much control or say over at this point in the conversation.  If you’re growing a log file then it needs to be zeroed out, and SQL Server will zero out each VLF when it’s marked for reuse, too.  For data files, if you’re using Transparent Data Encryption (TDE) then the way SQL Server stores the data also requires the space to be zeroed out.  These things are outside of the conversation on file growths, it’s just what we’ll have to deal with at this point and SQL Server won’t use IFI even if it’s turned on.

The one thing that will be pulled into this discussion is the security aspect of using uninitialized space.  Kimberly Tripp demonstrates this the best in her post Instant Initialization – Why, Why, and How? in the “Why isn’t this on by default?” section.

I’ll try to make it short and sweet if you don’t want to read yet another post (I recommend you read it). You can ask SQL Server to read specific pages off the disk.  If SQL Server never actually used that page then it contains what was there when it was given to SQL Server.  Zero initialized disks are very boring info, just “0000000000….”.  Not much fun.

Uninitialized space is more fun (depending on your meaning of the word) and could contain old data.  If you had unencrypted sensitive data sitting there before, it could contain enough to steal someone’s identity.  You never know.  These commands will only work if you have the permissions to run them, but who has those permissions and how much do you trust them?  Most organizations say it’s fine, but some won’t.

Instant File Initialization – How do I use it?

This is an odd feature that you can’t just see in a table in SQL Server.  The easiest way is to run an xp_cmdshell command and look for it, but you can optionally turn on trace flags, grow or create a database, turn off the trace flags, then look in your error log.

Paul Randal gets into both methods in his post How to tell if you have instant file initialization enabled?.  I don’t have any issues with xp_cmdshell, so I’ll let you go to Paul’s blog if your company insists on the more difficult method.

As for me, I’ll run the query below as a multiserver query in SSMS and look at the results.  If that sounds foreign to you, check out my Multiserver Queries post, it’s a very simple and useful technique for specific tasks such as this one.  In fact, I wrote that post so I could link to it in this post.

However, even if IFI is turned on, it’s only going to work for data files that aren’t encrypted with TDE.  Logs and TDE data files need the zeros.  Also, if any database on your instance is using TDE then so is TempDB, which means its data files can’t use IFI either.

CREATE TABLE #Output
(
	DataOut VarChar(2000)
)

DECLARE @XP_CmdShell_Enabled int

SELECT @XP_CmdShell_Enabled = CONVERT(INT, ISNULL(value, value_in_use))
FROM master.sys.configurations
WHERE name = 'xp_cmdshell'

IF @XP_CmdShell_Enabled = 0 BEGIN
	EXEC sp_configure 'show advanced options', 1
	RECONFIGURE WITH OVERRIDE 
	EXEC sp_configure 'xp_cmdshell', 1
	RECONFIGURE WITH OVERRIDE 
END 

INSERT INTO #Output 
EXEC xp_cmdshell 'whoami /priv'

IF @XP_CmdShell_Enabled = 0 BEGIN
	EXEC sp_configure 'xp_cmdshell', 0
	RECONFIGURE WITH OVERRIDE 
END 

SELECT IFI_Enabled = COUNT(1) 
FROM #Output 
WHERE DataOut LIKE '%SeManageVolumePrivilege%Enabled%'

DROP TABLE #Output

If IFI is turned off and it’s proper to turn it on for this server, here are your steps to turn it on:

  1. Open Administrative Tools / Local Security Policy
  2. Local Policies
  3. User Rights Assignment
  4. Open “Perform volume maintenance tasks”
  5. Add the user for the SQL Server service account

If IFI can’t be used, here are your steps (which are also good even if IFI is being used):

  1. Monitor database sizes for manual growths
  2. Grow files manually

The SQL Server service account, if you don’t know which account it is, can be found in Administrative Tools / Services / SQL Server (InstanceName) under “Log On As”.  There are easier ways find the service account name in bulk if you’re doing this a lot, such as a slight variation of the query above.

xp_cmdshell Rant

Now, to be clear on xp_cmdshell which I used in the script above…. I do believe in security, but I don’t believe disabling this feature is the security you’re looking for.  When it’s set up properly, disabling xp_cmdshell is like locking the door to your data center’s supply closet with a keyless lock instead of limiting access to the data center.

I do have issue with granting execute permissions on this to non-sysadmin accounts.  I also have issue with granting sysadmin to anyone who asks, including vendors who insist that’s the only way their app will work.  This means the only users that can use it are the very, very few with sysadmin.  Also, any sysadmin can turn it on anyways, so the only extra security you get is a message in the error log saying it was turned on, but not even who did it.

My recommendation is that if you need it, leave it on.  If you don’t, leave it off.  If you have a one-time script that needs it, such as checking for IFI, turn it on just long enough to do what you need to do then automate turning it off.  The script follows this recommendation; it’s left on if it was already on, but turns it off if it was off.

It’s time for me to admit I’m going off on a tangent, so I’ll pass you off to Sean McCown’s (b|tSecurity Theater post if you’re too fired up to stop now.

In Summary

Autogrowth – Yes.  Just as a fail-safe if performance is important to you.  Limit the size to limit the performance impact, but don’t cause undue fragmentation.

Manual Growth – Yes.  Trend how fast you’re growing so you know how to stay ahead of it.

Instant File Initialization – Usually.  Understand the limitations and security risks, and turn it on if it’s the right thing to do.

Entry-Level Content Challenge

This is my fourth post in Tim Ford’s Entry Level Content Challenge.  I went through much of my career with people saying to manually grow files, but it was left at that.  I had no clue how to see how much space was used without clicking through the GUI, and passed it off as being ridiculous.  Now I found ways to make it easy, and I’m betting there’s a lot of people out there doing too much work or skipping this simply for not knowing how.

Read over Tim’s challenge and consider joining in.  It’s a great way to help people get past barriers you had in the past while learning how to clear them a little more efficiently yourself.  Anyone who’s reading this post already knows how to clear some barriers that others are viewing as an impassible brick wall, so you can make a difference for them.

 

 

 

 

Database Growth Trends – Idera Diagnostic Manager

How fast are your databases growing?  Should I panic about a 1 TB database that has 100 GB free on disk?  Am I safe with a 100 GB database that also has 100 GB free?  Based on those sizes…I have no idea.

It’s possible that the 1 TB database is purging data as fast as it takes it in, so it’s been at 1 TB for a long time and may never use the 100 GB it has free.  Looking at the 100 GB database, it may have been a new 1 MB database created two months ago and grew 1 MB at a time to get to where it’s at now. (Bonus points if you panicked reading that last sentence.)

The thing to look for is how big that database was before.  What did you write down for how much space it used last week, last month, and several months ago?  Do you remember all those stock broker commercials saying “Past performance doesn’t predict future results” which is supposed to lead you on to think “Yeah, but it’s the best indicator I have.”  The same goes for database growths.  So, you did write down what it was before, right?

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it if you’ve seen it before.

The biggest issue is the word “before”.  If you weren’t watching your servers before then you have nothing to compare the current state of your server to.  Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is already capturing this type of information to help make all the pretty charts and graphs you see through the GUI.  You need this software to run so you can get alerted when things go wrong, and pulling this data will just make it more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any.  However, that’s taking on a lot of responsibility, time, and effort.  If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool.  Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll make a write up on them and give you credit for being awesome supporting your product like that.  I can only write against what I can test against, it’s nothing personal.

What I’m Watching

I want to know when my files will hit an autogrowth and when they’ll run out of space.

Running out of space is pretty obvious.  If you’re out of space, every command that needs more space will fail, and users will run to the supply closet for torches and pitchforks.  It’s a bad day for the DBA.

Autogrowth is less of an issue and less obvious.  That’s not saying it’s not an issues, especially for larger databases that can’t take advantage of IFI.  The command that needs more space, and any that pile up behind it, will wait patiently (barring any timeouts) for the growth to take place.  However, SQL Server can be more patient than your users or applications.

That being said, if I knew I’d need more space than is allocated in the database using a simple script then I could do it when no one was waiting for the space.  No complaints, just make the files larger and move on with my life.  Oh, happy days.

To get all of this information I need to know a couple things:

  • How much space is used in each filegroup
  • How much space is allocated to each filegroup
  • How much can each filegroup grow according to disk free space and growth rates

Idera DM Captures That

Well, Idera does a pretty good job.  It captures the data size (used), data expansion (growth by growth rate and disk free space), and data file size (allocated).  The problem is that it captures this stuff on the database level, not the filegroup level.  This means that it’s possible for a filegroup to run out of space while the database as a whole has tons of free space.

I’m currently working in an environment where all data files for a single database are on the same drive, so it’s less concerning to me.  It is something to keep in mind, and may be very important to your environment.  If this is critical to you, I touch on how to capture this data yourself below.

If separating filegroups isn’t an issue for you (it often isn’t) then all you have to do to trend growths with Idera is to write a very long script that hits SQLdmRepository..DatabaseSize a couple times.  My script goes a little like this:

/*
MaxUsed is always from a week long range.  The reason is that this captures any spikes in usage such as off-hours index maintenance.
*/

;WITH MostRecent AS
(
	SELECT X.DatabaseID
		, X.UTCCollectionDateTime
		, S.InstanceName
		, D.DatabaseName
		, DataFileSizeMB = DataFileSizeInKilobytes/1024
		, DataFilePotentialSizeMB = (X.DataFileSizeInKilobytes + X.DataExpansionInKilobytes)/1024
	FROM (
			SELECT DS.DatabaseID
				, RowNum = DENSE_RANK() OVER (PARTITION BY DS.DatabaseID ORDER BY UTCCollectionDateTime DESC)
				, UTCCollectionDateTime
				, DataFileSizeInKilobytes
				, DataExpansionInKilobytes
			FROM SQLdmRepository..DatabaseSize DS
			WHERE UTCCollectionDateTime > GETUTCDATE()-7 
		) X
		INNER JOIN SQLdmrepository..SQLServerDatabaseNames D ON X.DatabaseID = D.DatabaseID
		INNER JOIN SQLdmrepository..MonitoredSQLServers S ON S.SQLServerID = D.SQLServerID 
	WHERE RowNum = 1
)
, ThisWeek AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime > GETUTCDATE()-7 
	GROUP BY DS.DatabaseID 
)
, OneMonth AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime BETWEEN GETUTCDATE()-37 AND GETUTCDATE()-30 
	GROUP BY DS.DatabaseID 
)
, ThreeMonth AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime BETWEEN GETUTCDATE()-97 AND GETUTCDATE()-90 
	GROUP BY DS.DatabaseID 
)
, SixMonth AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime BETWEEN GETUTCDATE()-187 AND GETUTCDATE()-180 
	GROUP BY DS.DatabaseID 
)
, OneYear AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime BETWEEN GETUTCDATE()-365 AND GETUTCDATE()-358 
	GROUP BY DS.DatabaseID 
)
, PreResults AS 
(
	SELECT MR.InstanceName
		, MR.DatabaseName 
		, MR.DataFileSizeMB
		, ThisWeek_MaxUsedMB = CAST(W.MaxUsedMB AS INT)
		, OneMonth_MaxUsedMB = CAST(M1.MaxUsedMB AS INT)
		, ThreeMonth_MaxUsedMB = CAST(M3.MaxUsedMB AS INT)
		, SixMonth_MaxUsedMB = CAST(M6.MaxUsedMB AS INT)
		, OneYear_MaxUsedMB = CAST(M12.MaxUsedMB AS INT)
		, NextGrowthInDays_OneMonth = CAST((MR.DataFileSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M1.MaxUsedMB + .0001)*30 AS BIGINT)
		, NextGrowthInDays_ThreeMonth = CAST((MR.DataFileSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M3.MaxUsedMB + .0001)*90 AS BIGINT)  
		, NextGrowthInDays_SixMonth = CAST((MR.DataFileSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M6.MaxUsedMB + .0001)*180 AS BIGINT)  
		, NextGrowthInDays_OneYear = CAST((MR.DataFileSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M12.MaxUsedMB + .0001)*360 AS BIGINT)  
		, OutOfSpaceInDays_OneMonth = CAST((MR.DataFilePotentialSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M1.MaxUsedMB + .0001)*30 AS BIGINT)  
		, OutOfSpaceInDays_ThreeMonth = CAST((MR.DataFilePotentialSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M3.MaxUsedMB + .0001)*90 AS BIGINT)  
		, OutOfSpaceInDays_SixMonth = CAST((MR.DataFilePotentialSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M6.MaxUsedMB + .0001)*180 AS BIGINT)  
		, OutOfSpaceInDays_OneYear = CAST((MR.DataFilePotentialSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M12.MaxUsedMB + .0001)*360 AS BIGINT) 
	FROM MostRecent MR
		INNER JOIN ThisWeek W ON MR.DatabaseID = W.DatabaseID
		LEFT JOIN OneMonth M1 ON MR.DatabaseID = M1.DatabaseID AND W.MaxUsedMB > M1.MaxUsedMB
		LEFT JOIN ThreeMonth M3 ON MR.DatabaseID = M3.DatabaseID AND W.MaxUsedMB > M3.MaxUsedMB
		LEFT JOIN SixMonth M6 ON MR.DatabaseID = M6.DatabaseID AND W.MaxUsedMB > M6.MaxUsedMB
		LEFT JOIN OneYear M12 ON MR.DatabaseID = M12.DatabaseID AND W.MaxUsedMB > M12.MaxUsedMB
)
SELECT InstanceName
	, DatabaseName 
	, NextGrowthInDays_Min = CASE WHEN NextGrowthInDays_OneMonth <= ISNULL(NextGrowthInDays_ThreeMonth , 1000000)
										AND NextGrowthInDays_OneMonth <= ISNULL(NextGrowthInDays_SixMonth, 1000000) 
										AND NextGrowthInDays_OneMonth <= ISNULL(NextGrowthInDays_OneYear, 1000000)
									THEN NextGrowthInDays_OneMonth 
								WHEN NextGrowthInDays_ThreeMonth <= ISNULL(NextGrowthInDays_SixMonth, 1000000)
										AND NextGrowthInDays_ThreeMonth <= ISNULL(NextGrowthInDays_OneYear, 1000000)
									THEN NextGrowthInDays_ThreeMonth
								WHEN NextGrowthInDays_SixMonth <= ISNULL(NextGrowthInDays_OneYear, 1000000)
									THEN NextGrowthInDays_SixMonth
								ELSE NextGrowthInDays_OneYear
							END
	, OutOfSpaceInDays_Min = CASE WHEN OutOfSpaceInDays_OneMonth <= ISNULL(OutOfSpaceInDays_ThreeMonth, 1000000)
										AND OutOfSpaceInDays_OneMonth <= ISNULL(OutOfSpaceInDays_SixMonth, 1000000)
										AND OutOfSpaceInDays_OneMonth <= ISNULL(OutOfSpaceInDays_OneYear, 1000000)
									THEN OutOfSpaceInDays_OneMonth 
								WHEN OutOfSpaceInDays_ThreeMonth <= ISNULL(OutOfSpaceInDays_SixMonth, 1000000)
										AND OutOfSpaceInDays_ThreeMonth <= ISNULL(OutOfSpaceInDays_OneYear, 1000000)
									THEN OutOfSpaceInDays_ThreeMonth
								WHEN OutOfSpaceInDays_SixMonth <= ISNULL(OutOfSpaceInDays_OneYear, 1000000)
									THEN OutOfSpaceInDays_SixMonth
								ELSE OutOfSpaceInDays_OneYear
							END
	, DataFileSizeMB
	, ThisWeek_MaxUsedMB
	, OneMonth_MaxUsedMB 
	, ThreeMonth_MaxUsedMB 
	, SixMonth_MaxUsedMB
	, OneYear_MaxUsedMB
	, NextGrowthInDays_OneMonth 
	, NextGrowthInDays_ThreeMonth 
	, NextGrowthInDays_SixMonth 
	, NextGrowthInDays_OneYear 
	, OutOfSpaceInDays_OneMonth
	, OutOfSpaceInDays_ThreeMonth
	, OutOfSpaceInDays_SixMonth
	, OutOfSpaceInDays_OneYear 
FROM PreResults
WHERE OutOfSpaceInDays_OneMonth < 100
	OR OutOfSpaceInDays_ThreeMonth < 100
	OR OutOfSpaceInDays_SixMonth < 100
	OR OutOfSpaceInDays_OneYear < 100
	OR NextGrowthInDays_OneMonth < 15
	OR NextGrowthInDays_ThreeMonth < 15
	OR NextGrowthInDays_SixMonth < 15
	OR NextGrowthInDays_OneYear < 15
ORDER BY DataFileSizeMB DESC

What this does is look at the trends based on what the database used 30, 90, 180, and 360 days ago to try to predict the future.  By “predict the future” I mean it’s not perfect, and can’t be perfect.  However, it gives you very good data with a few false positives and very few false negatives.

Speaking of this not being perfect, it double-dips.  If there are two databases growing on the same drive with 10 GB free, it says that each one has 10 GB of growth available.  I was not able to rewrite this to take that into account while also accounting for databases with multiple files in a single filegroup spread across multiple drives.

Now I’m two weeks ahead of projected file growths and over three months ahead of projected out-of-space errors.

This is very basic information, and I’d have trouble imagining any SQL monitoring software not capturing this in a table you can query.

If I Didn’t Have Idera…

If I didn’t have Idera, or any other monitoring software I could query, I’d be going out at least once a day and pulling the allocated and used space from every file on every server.  I may want to do it more often to find peak usage, such as when online index rebuilds are using twice their typical space over the weekends; I take advantage of Idera capturing this hourly.  I’ll capture it using something like this:

SELECT @SQL = 
'USE [?]

INSERT INTO DBA..FileSizes (DateAdded, Database_ID, File_ID, Data_Space_ID, SizeMB, UsedMB)
	select GETDATE() 
		, Database_ID = DB_ID()
		, f.File_ID
		, f.data_space_id
		, SizeMB = ((f.size*8/1024)) 
		, UsedMB = (FileProperty(f.name, ''SpaceUsed'')*8/1024) 
	from sys.database_files f
	'

exec sp_MSforeachdb @SQL

Then I’d need to see how much free space I have on each drive, so I’d go to Ed Wagner’s article Time and Space: How to Monitor Drive Space in SQL Server.  Knowing that we use MountPoints in places, I’d have to change his script to use “volume” instead of “logicaldisk” (also mentioned in his comments).  This minor change is about the best case scenario you’ll run across.

Remember, the scripts you find on the internet are typically designed to work perfect on the AUTHOR’s system, not yours.  It’s not Ed’s fault they don’t use MountPoints, in fact he should probably be congratulated for not making his environment more complex than it needs to be.  He should also be congratulated on not making a change to his article that he can’t properly test.  At least you know you have to be extra careful testing this change instead of being led to believe the author did something they couldn’t do properly.

After getting the drive’s free space, I’d have to look at the file growth rates to see how large my files could grow.  Keep in mind that growth in sys.database_files could be either a percentage or the number of 8kb pages the file will grow by, depending on the value in is_percent_growth.  Static growths are easy, just divide by 128 and that’s how many MBs it will grow by.  It’s easier (and better for your server) to change percentage growths to MB than it is to look up the logarithmic functions used to calculate compound interest that also apply here.

Now that you have the free space on the drive and the growth rates on the files, you can calculate the maximum size the file can grow to.  Use this with trends on the used space to calculate when that size won’t be enough for you.

Make sure you test your results, you’ll have to support it.

So Your Database Is Growing…

Databases get larger, it’s what they do.  A normal DBA then goes through and makes sure that there’s room for that growth.  A better DBA will do that, too, but then follow it up by figuring out why the database is growing and try to justify it.

I have seen way too many databases out there that don’t purge old data that’s useless to the company.  Sometimes the purge was never written, sometimes the code to do that is never being called for several somewhat humorous reasons.  Any way you look at it, the data is growing and it’s easy to fall into the trap of just adding space every month, forever.

The way I handle this is by watching my tables and seeing how they’re growing.  These are treated just like databases where the largest one isn’t necessarily the growth issue, and looking at historical data to see the growth trend is a better solution.

In the past I wrote a post called Monitoring Database and Table Sizes to get this info yourself.  In the future, I’ll write Table Growth Trends – Idera Diagnostic Manager.  At the time of this writing, I’m planning on the Table Growth post coming out three weeks after this one.  Until then, I put this script up on my Idera Diagnostic Manager Scripts page, which is where I keep these posts organized and where I put my scripts to wait for me to blog about them.

 

Multiserver Queries

You can run queries against multiple servers at once, and it’s quite useful for a number of reasons.  I use it to check settings, verify backups and DBCC ran recently on unmonitored servers, make sure all servers rebooted during the maintenance window, and many other reasons.

This is all done through registering servers on SQL Server and opening a new query for the group.  I’ll walk you through that then run a couple queries that I find useful.

Unfortunately, this method is so easy that you’ll only ever have to see it once and won’t be back to visit this blog post again.  Hopefully the scripts below will inspire a return visit.

Setting It Up

Registered servers are local to your SSMS install.  You’re not changing server settings by setting this up, and this will be lost if you reinstall SSMS without exporting your settings.

First, open the Registered Servers pane by going to View / Registered Servers.

View Registered Servers

Add a new server group and call it anything you want.  I’ll call mine blog because that’s what I’m using it for.  I love the idea of having Prod and Non-Prod groups, especially since I can run a query against all my Non-Prod servers that aren’t monitored to verify backups and DBCC checks are being done.

It’s important to note at this point that you can have a server in more than one group and groups can be nested.  So in my prod group I have groups for each data center, then the servers are at that level.  I could also have the same servers in functional groups, such as Finance, IT, and Why_Me.

Create Registered Servers Group

 

Right-click and do a New Server Registration, and the options should be pretty natural to you.  For this example, I used aliases for “Registered Server Name”, but I stick to the default when doing this for myself.

At the most basic level, it should look like this.

Registered Server Group

That’s it, you’re set up.

Running Queries

This is easier than the setup.

Right-click on a group and click on New Query.

Multiserver New Query

It opens a new query window with the only oddity being instead of a yellowish bar saying “Connected. (1/1)”, now you have a pink bar saying “Connected. (2/2)” along with the group name.

Multiserver Connected

This will be connected to all servers directly in the group and in groups nested within that group.  There will be a long delay if one of the servers isn’t available, so it’s worth while to keep these groups cleaned up.

Now you run a query.  Here’s how the results are grouped by default.  Although they showed up in order for me, that was a coincidence.  They’ll actually be in the order they came back, so a server that returned results in 1 second will come before a server that returned results in 2 seconds.

Multiserver Results

You can go to Tools / Options and change a couple things, but the defaults tend to work great.

Multiserver Query Options

Now that it’s set up, right-click on your group and go to Tasks / Export… to save this off somewhere.  The only place this is saved right now is on your workstation, and those words will make any DBA cringe.

Practical Uses

Doing “SELECT 1” like I did above is always a blast, but it’s not all that productive.  As always, I encourage you to play around with it yourself, but here’s some things to get you started.

Server (services) last started

When I do Windows Updates I reboot all the database servers every month, even if the updates don’t say the needed the reboot at that time.  However, can be difficult to tell which servers I rebooted and which ones just aren’t begging for a reboot.  Since TempDB is recreated each time services start, I check to make sure that happened during the WSUS window.

SELECT create_date FROM sys.databases WHERE database_id = 2

If I was doing this for hundreds of servers, I’d simply change the query to say AND create_date < GetDate()-1 then only my problem servers would return results.

DBCC and Backups up-to-date

My non-prod servers aren’t monitored, so things go missed sometimes.  Backup jobs may be failing and I don’t know until I have time to check everything on all my non-prod servers.  Unfortunately, that doesn’t happen very often.  As with everything else in life, I cheat as much as possible to make it easier.  This time by running this as a multiserver query:

--Leaving some overlap to avoid false alarms due to timing issues
--We want to know if there hasn't been:
--	Full backup in the last 8 days (should happen every 7)
--	Full or diff backup in the last 2 days (should happen every 1)
--  Log backup in the last 4 hours (should happen every 1 or more often)
--  DBCC CheckDB run successfully in the last 8 days (should happen every 7)

SET NOCOUNT ON

DECLARE @FullDays Int
SELECT @FullDays = 8

DECLARE @FullDiffDays Int
SELECT @FullDiffDays = 2

DECLARE @LogHours Int
SELECT @LogHours = 4

DECLARE @DBCCDays Int
SELECT @DBCCDays = 8

DECLARE @SQLText NVarChar(4000)

IF OBJECT_ID('tempdb..#Results') IS NOT NULL BEGIN
	DROP TABLE #Results
END

CREATE TABLE #Results 
(
	ResultText NVarChar(2000)
)

IF Object_ID('TempDB..#DBCC_Temp') IS NOT NULL BEGIN
	DROP TABLE #DBCC_Temp
END

CREATE TABLE #DBCC_Temp 
(
	ParentObject VARCHAR(255),
	[Object] VARCHAR(255),
	Field VARCHAR(255),
	[Value] VARCHAR(255)
)

--When was the last full backup?
INSERT INTO #Results (ResultText)
SELECT CAST(serverproperty('ComputerNamePhysicalNetBIOS') as sysname) 
	+ ISNULL(NULLIF('\' + CAST(serverproperty('InstanceName') as sysname),'\MSSQLSERVER'),'') 
	+ '.' 
	+ d.name 
	+ ' hasn''t had a full backup in over ' 
	+ Cast(@FullDays as VarChar(2)) 
	+ ' day(s).  Last full backup was ' 
	+ ISNULL('on ' + Convert(VarChar(16), max(b.backup_finish_date), 120), 'NEVER') + '.'
FROM sys.databases d 
	LEFT JOIN msdb..backupset b ON d.name = b.database_name AND b.type = 'D'
WHERE d.name <> 'tempdb'
	AND d.state = 0
GROUP BY d.name 
HAVING ISNULL(max(b.backup_finish_date), '2000-01-01') < GetDate()-@FullDays 

--When was the last full or differential backup?
INSERT INTO #Results (ResultText)
SELECT CAST(serverproperty('ComputerNamePhysicalNetBIOS') as sysname) 
	+ ISNULL(NULLIF('\' + CAST(serverproperty('InstanceName') as sysname),'\MSSQLSERVER'),'') 
	+ '.' 
	+ d.name 
	+ ' hasn''t had a full or diff backup in over ' 
	+ Cast(@FullDiffDays as VarChar(2)) 
	+ ' day(s).  Last full or diff backup was ' 
	+ ISNULL('on ' + Convert(VarChar(16), max(b.backup_finish_date), 120), 'NEVER') + '.'
FROM sys.databases d
	LEFT JOIN msdb..backupset b ON d.name = b.database_name AND b.type IN ('D', 'I')
WHERE d.name <> 'tempdb'
	AND d.state = 0
GROUP BY d.name  
HAVING ISNULL(max(b.backup_finish_date), '2000-01-01') < GetDate()-@FullDiffDays 

--When was the last log backup?
INSERT INTO #Results (ResultText)
SELECT CAST(serverproperty('ComputerNamePhysicalNetBIOS') as sysname) 
	+ ISNULL(NULLIF('\' + CAST(serverproperty('InstanceName') as sysname),'\MSSQLSERVER'),'') 
	+ '.' 
	+ d.name 
	+ ' hasn''t had a log backup in over ' 
	+ Cast(@LogHours as VarChar(3)) 
	+ ' hour(s).  Last log backup was ' 
	+ ISNULL('on ' + Convert(VarChar(16), max(b.backup_finish_date), 120), 'NEVER') + '.'
FROM sys.databases d
	LEFT JOIN msdb..backupset b ON d.name = b.database_name AND b.type = 'L'
WHERE d.name NOT IN ('tempdb', 'model')--Model is by default in full, rarely backed up, and not modified enough to complain about.
	AND d.recovery_model_desc <> 'SIMPLE'
	AND d.state = 0
GROUP BY d.name  
HAVING ISNULL(max(b.backup_finish_date), '2000-01-01') < DateAdd(Hour, -1 * @LogHours, GetDate()) 

SELECT @SQLText = N'use [?]; 

DECLARE @DBName sysname
DECLARE @HasRetried bit 

SELECT @DBName = CAST(''?'' as VarChar(256)) 
SELECT @HasRetried = 0

TRUNCATE TABLE #DBCC_Temp

IF EXISTS (SELECT d.state FROM sys.databases d WHERE d.name = @DBName AND d.State = 0 /*Online*/ AND d.is_read_only = 0 AND d.create_date < GetDate()-7) AND ''tempdb'' <> @DBName BEGIN
	INSERT INTO #DBCC_Temp
	EXECUTE(''DBCC PAGE (['' + @DBName + ''], 1, 9, 3)WITH TABLERESULTS, NO_INFOMSGS'')

	INSERT INTO #Results (ResultText)
	SELECT CAST(serverproperty(''ComputerNamePhysicalNetBIOS'') as sysname) 
		+ ISNULL(NULLIF(''\'' + CAST(serverproperty(''InstanceName'') as sysname),''\MSSQLSERVER''),'''') 
		+ ''.''
		+ @DBName
		+ '' hasn''''t had a successful DBCC check in the last ' + CAST(@DBCCDays as VARCHAR(2)) + ' day(s).  Last successful check was ''
		+ ISNULL(NULLIF(MAX([Value]),''1900-01-01 00:00:00.000''), ''NEVER'')
	FROM #DBCC_Temp
	WHERE Field = ''dbi_dbccLastKnownGood''
	HAVING MAX([Value]) < GetDate()-' + CAST(@DBCCDays as VARCHAR(2)) + '
END
'

exec sp_MSforeachdb @SQLText 

SELECT ResultText 
FROM #Results

DROP TABLE #Results
DROP TABLE #DBCC_Temp

If I get any results from this, I have something to fix.  This also works for daily full backups because it does diff or full, but you may want to change it around to fit your maintenance schedules.

Version Check

If you’re evaluating what version all of your servers are on so you can make plans to patch them, just run this:

SELECT @@VERSION

You have everything you need to compare it to SQLServerBuilds.BlogSpot.com.  The only thing is that I tend to copy/paste the results into Excel so I can sort them there.

More Uses

I use registered servers and multiserver queries for more than this.  In fact, I wrote this post so I could link to it in a future post on Instant File Initialization.  Once I have that one here, I’ll have four examples of uses.

Other readers and I would benefit from hearing more uses of this feature.  Let us hear the best uses you have, preferably with a script.

Entry-Level Content Challenge

This is my third post in Tim Ford’s Entry Level Content Challenge.  I was tempted to write a post and just tell people to run a script as a multiserver query.  After all, it’s very simple after you see it for the first time.  However, you have to see it for the first time before you get to that point.

This is the exact purpose of his challenge.  Read over it and consider joining in.

 

Recompile Monitoring using XEvents

Recompiles can be a hidden bottleneck on your server, and it may not be too obvious.  I should know, it happened to me.  Unfortunately for me, that was before I was a proponent of tracing or using XEvents to watch for recompiles.  Fortunately for you, the experience changed me, and I enjoy sharing my experiences.

What happened to me was a proc that was called WAY too often was taking too long cumulatively.  I knew the true root cause of my issue was in the application which was calling the proc once per row, often for about hundred rows, when loading a pretty important page on a web application.  However, I just found out the service desk has been getting complaints for a while, and asking the development team to help me rewrite how the page works ended in an answer that didn’t even sound like now.  Well, I guess “no” does kinda sound like “now”…

The proc was pretty simple, taking on average about 0.150 seconds to run.  Sounds great until you remember that it’s being called about a hundred times for a page load on a web application.  100 * 0.15 = 15 seconds for the page to load, not counting anything else running.  Users were threatening to not be users anymore even though they were happy with most other aspects.

I got into it, reworded the query a little, created a perfect index for it, and it got it down to 0.100 seconds to run.  This was a large concession on my part because I HATE perfect indexes for a single query, and feel strongly that indexes should focus more on being reusable by many queries.  However, even that only got the web page to load in 10 seconds.

Traces of the proc confused me.  The statements were running in 0.010 seconds, and there was a 0.090 second gap between runs.  That gap was 9 of my 10 seconds.  At 10 seconds users avoid a website, but at 1 second users have to be prompted to say it’s not as fast as they’d like.

It took me longer than I’d like to admit to figure out that every run was kicking off a recompile that took 0.090 seconds.  This is the point where I started tracing for the recompile reason, and found out it was because statistics were being updated.  However, there were no stats in my database that were updated in the last 15 minutes, so I would have never guessed this was the reason without tracing on this event.

This proc loaded data into a temp table, then did some processing on that temp table, and I figured out that there were enough rows being loaded into the temp table to update the stats on it.  Removing the temp table was a disaster, it raised my execution time by 0.010 seconds.  Yeah, “disaster” can be relative…

Then I came up with a very odd solution that I’ve never seen or heard of before that point, and I haven’t used much since that point.  The temp table was swapped out for a table variable.  No stats to be updated, so no recompile.  Although it assumed there would only be one record in the table variable (it always does), the execution plan was the same.  The only difference was the total 0.100 second duration was dropped to 0.010 after losing the 0.090 recompile.

User experience was now a web page that loaded up in just over 1 second, with most of it still being this query.  There wasn’t anything more I could do except wait until the page could be rewritten by development where I could rewrite it to get all the results at once.  That process took time, and customers were not impatiently watching to see when it would happen.

I wrote this talking about the users, and users care about duration.  However, that’s not the whole story.  The 0.090 recompile was pure CPU time, it happening 100 times per page call added up to 9 seconds of pure CPU time, and the users having to use this page hundreds of times a day added up to a noticeable impact on the server performance when it went away.  Now I admit that the users only used this when they HAD to instead of when they should have, so it was balanced out a little by the users opening the page about twice as often.  Even then, it was still obvious that it was gone when comparing long-term CPU usage.

The Management Response

Side note for all of you managers out there…  My manager pulled me aside the next week saying “step into my office for a second.”.  I admit, that statement always makes you nervous, even when you KNOW everything is going great.  He proceeded to tell me that before I started with the company they pulled developers and a DBA into a task force that spent tons of money to improve performance, upgrading servers, upgrading the SAN, upping the WAN speed, and looking into the database.  Then he said that I did more in one day than they did in 6 months without spending any money, and that I should take my wife out for a nice dinner and bring him the receipt.

Now I know that the hardware upgrades helped out, that I only helped a single process that day, and many other things that could belittle what happened.  I also know that spending $100 was very minor in terms of the effect that change made for the company.  That’s not how I saw things.

What I saw was a response that stuck with me, a lot.  Management stepped back to recognize a difference someone made, and made sure they knew they were appreciated.  Then took it a step further bringing the family into it, having my wife feeling proud and appreciated at the same time.

I know the constraints management is under and all the things they can’t do.  That being said, the only thing that frustrates me about this situation is that I don’t see others in management taking the same approach.  I swore from that point that if I ever became a manager my one odd request would be that I would want a very small budget in writing to use at my digression.  Small things to show appreciation go a long way.

My Challenge

The challenging part for me, and where I felt I didn’t do as well as I should have done, is that this flew under my radar.  I was tracing for anything that took over 1 second duration and was looking at the most expensive queries all the time.  However, this one took 0.150 seconds and never accumulated time in the plan cache due to the recompiles.

Here’s the thing, I pride myself on being specialized in database performance, so I feel I should have seen this before having someone in customer service casually mention it to me in conversation.  It wasn’t even a request to look into it, just mentioning he just got off a call with the customer about the page, and that the page never worked right with no one being able to fix it.

After finding and resolving the root cause of the proc, I also wanted to find and resolve how this flew under my radar.  This came up to two primary responses.  First, the page was taking too long as a whole, so I requested stats on a per-page basis – how long it took to load each time and how long cumulatively.  Second, I started watching for recompiles from time to time.  Either one of these would have led me to this issue well before that point.

This post is not about monitoring performance on a website, so, although it’s a very important thing to do, I won’t be mentioning that part again.

Monitoring Recompiles

This post is about recompiles, how they can drag you down, and how you can find out about them.  I covered how they drug me down, and in past posts I casually mentioned you should use server-side traces or XEvents to monitor them.  Then I left it up to you to figure out how.  Lets try a different approach today.

Recompiles aren’t something I’m watching for constantly, so I’m not going to run any kind of monitoring constantly for them.  What I’ll do is run an XEvent session for 24 hours and analyze what it picked up.  If it didn’t recompile often that day it almost definitely does not belong on my priority list, so this works for me.

Here’s my session.  It starts immediately, writes to five 10MB rollover files, and will not start the next time your services restart.

IF EXISTS (SELECT 1 FROM sys.server_event_sessions WHERE name = 'Recompiles')
	DROP EVENT SESSION [Recompiles] ON SERVER;
GO

CREATE EVENT SESSION [Recompiles] ON SERVER 
ADD EVENT sqlserver.sql_statement_recompile
(	SET collect_object_name=(1)
		, collect_statement=(1)
	ACTION
		(sqlserver.database_id
			, sqlserver.database_name
			, sqlserver.session_id
			, sqlserver.sql_text
			, sqlserver.username))
	ADD TARGET package0.event_file
		(SET filename=N'Recompiles'
			, max_file_size=(10)
)
WITH 
	(MAX_MEMORY=4096 KB
		, EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS
		, MAX_DISPATCH_LATENCY=30 SECONDS
		, MAX_EVENT_SIZE=0 KB
		, MEMORY_PARTITION_MODE=NONE
		, TRACK_CAUSALITY=OFF
		, STARTUP_STATE=OFF)
GO

ALTER EVENT SESSION Recompiles
ON SERVER
STATE = START;

Querying the Recompiles XEvent Session

Then I’ll use this query to look at my session.  It’s a little awkward in two ways.

First, I only leave the session running for a day, but this script needs it to be running to read from it.  The only reason for that is it pulls the file path from the running sessions.  You can get around this by putting the path into the script manually.  As for me, I usually query it and figure out my next steps while it’s running, then I stop the session.

Second, and this is quite humorous to me, it recompiles due to my temp tables every time it runs.  It’s a great example of what to ignore because it’s a low-cost recompile that happens on a seldom-used ad-hoc query where performance is not a major priority.  However, because temp table definitions are cached for proc plans, this goes away if you turn this script into a proc.  I laugh every time…

When you get the results in the #Queries temp table in the end they’re what you’re used to with it just being a table and all the XML out of the way.  The query at the end of the script views them all, but I’ll play around with it doing aggregations and more.  I encourage you to do the same.  This is also why it drops the table if it exists in the beginning instead of dropping them at the end.

DECLARE 
	@SessionName SysName 
	, @TopCount Int = 1000
	
SELECT @SessionName = 'Recompile'

--SELECT @SessionName = 'system_health'
/* 
SELECT * FROM sys.traces

SELECT  Session_Name = s.name, s.blocked_event_fire_time, s.dropped_buffer_count, s.dropped_event_count, s.pending_buffers
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE target_name = 'event_file'
--*/

IF OBJECT_ID('tempdb..#Events') IS NOT NULL BEGIN
	DROP TABLE #Events
END

IF OBJECT_ID('tempdb..#Queries') IS NOT NULL BEGIN
	DROP TABLE #Queries 
END

DECLARE @Target_File NVarChar(1000)
	, @Target_Dir NVarChar(1000)
	, @Target_File_WildCard NVarChar(1000)

SELECT @Target_File = CAST(t.target_data as XML).value('EventFileTarget[1]/File[1]/@name', 'NVARCHAR(256)')
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE s.name = @SessionName
	AND t.target_name = 'event_file'

SELECT @Target_Dir = LEFT(@Target_File, Len(@Target_File) - CHARINDEX('\', REVERSE(@Target_File))) 

SELECT @Target_File_WildCard = @Target_Dir + '\'  + @SessionName + '_*.xel'

--SELECT @Target_File_WildCard
CREATE TABLE #Events 
(
	event_data_XML XML
)

INSERT INTO #Events 
SELECT TOP (@TopCount) CAST(event_data AS XML) AS event_data_XML
FROM sys.fn_xe_file_target_read_file(@Target_File_WildCard, null, null, null) AS F
ORDER BY File_name DESC
	, file_offset DESC 

SELECT  EventType = event_data_XML.value('(event/@name)[1]', 'varchar(50)')
	, ObjectName = event_data_XML.value ('(/event/data  [@name=''object_name'']/value)[1]', 'sysname')
	, ObjectType = event_data_XML.value ('(/event/data  [@name=''object_type'']/text)[1]', 'sysname')
	, UserName = event_data_XML.value ('(/event/action  [@name=''username'']/value)[1]', 'sysname')
	, Statement_Text = ISNULL(ISNULL(event_data_XML.value ('(/event/data  [@name=''statement'']/value)[1]', 'NVARCHAR(4000)'), event_data_XML.value ('(/event/data  [@name=''batch_text''     ]/value)[1]', 'NVARCHAR(4000)')), event_data_XML.value ('(/event/data[@name=''wait_type'']/text)[1]', 'NVARCHAR(60)'))
	, Recompile_Cause = event_data_XML.value ('(/event/data  [@name=''recompile_cause'']/text)[1]', 'sysname')
	, TimeStamp = DateAdd(Hour, DateDiff(Hour, GetUTCDate(), GetDate()) , CAST(event_data_XML.value('(event/@timestamp)[1]', 'varchar(50)') as DateTime2))
	, SPID = event_data_XML.value ('(/event/action  [@name=''session_id'']/value)[1]', 'BIGINT')
	, Database_Name = DB_Name(event_data_XML.value ('(/event/action  [@name=''database_id'']/value)[1]', 'BIGINT'))
	, EventDetails = event_data_XML 
INTO #Queries
FROM #Events

SELECT q.EventType
	, q.ObjectType
	, q.ObjectName
	, q.Statement_Text
	, q.Recompile_Cause
	, q.TimeStamp
	, q.SPID
	, q.UserName
	, q.Database_Name
	, q.EventDetails
FROM #Queries q
ORDER BY TimeStamp DESC 

Hope It Helps

Let me know how this helps.  If anything can be done to make my approach or my scripts better, I’m always open to improvement.

 

I/O Baseline Comparison – Idera Diagnostic Manager

What servers need the most attention when it comes to I/O?  When I increase the memory on a server what effect does it have on I/O?  What was it before the change and what was it afterwards?

These are tough questions, and I have a script that I hope will help you out.

Most monitoring tools, including Idera, are great at looking at what a single server is doing right now, but not so much with comparing performance and loads between multiple larger windows of time and seeing which servers are doing the most work.  So I wrote a script to see how much I/O load I’m doing on each server and compare that to how much I did in that same time window in previous weeks.

Your servers will vary, a lot.  They’ll vary from my servers, and they’ll vary from each other as well.  Depending on the workload, specifically the difference between OLTP and OLAP, higher I/O may be expected and additional memory will have less of an impact.  For OLAP, if you have 32 GB of memory and read a 64 GB table followed by a different 92 GB table then it’s all physical reads.  If you up your memory to 48 GB, they’re still all physical reads.  The cost to up this server (or VM) is the same as the cost to upgrade an OLTP server from 32 GB to 48 GB, but the story will end quite a bit differently there.

The problem is that this change, like any other change, should be measured before, measured afterwards, then have the costs of the change analyzed and justified.  The results you’ll see below coupled with my Wait Stats Baseline Comparison using Idera post will help you figure out where to make a change like that and how to measure it.

There are other reasons you’d want to look at this data as well.  Even a new index that dramatically dropped the physical reads for one query could have the opposite effect on your server as a whole.  The point is, even something very straightforward needs to be verified that it’s doing what you expect it to.  Looking at the details you see in the canned reports is great, but so is the ability to stand back and look at the big picture that I’ll get into here.

The solutions I’ve put into place using this type of query to varied from query tuning, timing of larger processes, adding memory like I mentioned above, buying faster disks, and many other options including acknowledging we were doing a ton of reads and leaving it be.

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it:

The biggest issue is the word “before”.  If you weren’t watching your servers before then you have nothing to compare the current state of your server to.  Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is capturing this type of information to help make all the pretty charts and graphs you see through the GUI.  You need this software to run so you can get alerted when things go wrong, but pulling this data will just make in more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any.  However, that’s taking on a lot of responsibility, time, and effort.  If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool.  Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll tack them on here and give you credit for writing them.  I can only write against what I can test against, it’s nothing personal.

I/O Differences Script

What I look for are the true apples-to-apples comparisons, so I’m typically looking at a longer time range (I like 24 hours) during peak hours (12:00 to 21:00 UTC on weekdays for me, yours will differ, even server-to-server) comparing this week to the last couple weeks.

DECLARE @StartTime DateTime
	, @EndTime DateTime
	, @InstanceName sysname
	, @Weekdays bit
	, @BusinessHours bit
		
SET @EndTime = GetUTCDate()
SET @StartTime = DateAdd(Hour, -24, @EndTime)
SET @InstanceName = NULL --Do 'Server\Instance' for individual server
SET @Weekdays = 1
SET @BusinessHours = 1

SELECT S.InstanceName
	, StartTime = @StartTime 
	, EndTime = @EndTime
	, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
	, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
	, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
	, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
	, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
	, AvgPLE = Avg(St.PageLifeExpectancy)
	, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
	INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
	AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
	AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
	AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC

SELECT @StartTime = @StartTime - 7 
	, @EndTime = @EndTime - 7

SELECT S.InstanceName
	, StartTime = @StartTime 
	, EndTime = @EndTime
	, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
	, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
	, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
	, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
	, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
	, AvgPLE = Avg(St.PageLifeExpectancy)
	, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
	INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
	AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
	AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
	AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC

SELECT @StartTime = @StartTime - 7 
	, @EndTime = @EndTime - 7

SELECT S.InstanceName
	, StartTime = @StartTime 
	, EndTime = @EndTime
	, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
	, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
	, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
	, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
	, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
	, AvgPLE = Avg(St.PageLifeExpectancy)
	, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
	INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
	AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
	AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
	AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC


SELECT @StartTime = @StartTime - 7 
	, @EndTime = @EndTime - 7

SELECT S.InstanceName
	, StartTime = @StartTime 
	, EndTime = @EndTime
	, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
	, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
	, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
	, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
	, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
	, AvgPLE = Avg(St.PageLifeExpectancy)
	, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
	INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
	AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
	AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
	AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC

The Results

This is, by default, going to show you your biggest I/O users that are being monitored.  It’s great to know, but it’s not the whole picture.  Some servers are going to be almost pure OLTP, others are going to be almost pure OLAP, and most will be some kind of mix.  The more OLAP you do, the more physical reads (Reads_GB and ReadAhead_GB) should be expected.  It’s really not a bad thing, it’s how your databases are used.

Then there’s the realistic version of the users’ expectation (not always the version they tell you) to consider.  Are the queries coming back as fast as they should?  Large annual reports can take a bit, but a website that takes more than a couple seconds isn’t going to be used.

This report shows the load you’re throwing at the disks, but user expectations are measured in seconds, not gigabytes.  Yeah, more gigabytes means more seconds, but it’s not like there’s a solid ratio.

That being said, the true value of this information is when you combine it with the type of server it is, the reasonable expectations of the users, the amount of time the server is waiting on I/O, and so much more.  To start you out, look at the Wait Stats Baseline Comparison using Idera, too.

If there are OLTP servers that stand out as doing a ton of physical reads and the wait stats show that it’s an issue as well, look into more memory and more time for tuning.  Take it a step further and use all of the trending info (not just this report) to split out I/O load and plan for faster disks.  There’s a lot of potential to use this info to do a lot of good.

My Idera Scripts

I keep all the scripts that query Idera Diagnostic Manager in one place, on my Idera Diagnostic Manager Scripts page.  Before this script had it’s own post, it was posted there.  As I make posts for each script, that page will live on as a directory of what scripts I have to get more use out of this product.

This is the first time I’m doing something like this, but I’ll eventually make other pages that pull together topics like this as well.  It feels good to find a great way to take blogging to the next level like this.

Speaking of taking it to the next level, let me know your thoughts on this script, others you’d like to see, and anything else in the comments below.

Wait Stats Baseline Comparison – Idera Diagnostic Manager

Database servers have to wait on different resources, and these waits are huge to the performance of SQL Server.  Sometimes something changes without our knowledge and is running differently than normal, this is the first place you look to see what’s going on.  On the other side, you may have made a change that could have been anything from increasing memory, tuning a query, moving to different hardware, changing indexes, etc..  All of these things should reduce waits, but how effective were they?

The trick is to know what things looked like before so you have something to compare it to.  If you know what it looked like before then you’ll know how much of a difference you made.

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it if you’ve seen it before.

The biggest issue is the word “before”.  If you weren’t watching your servers before then you have nothing to compare the current state of your server to.  Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is already capturing this type of information to help make all the pretty charts and graphs you see through the GUI.  You need this software to run so you can get alerted when things go wrong, and pulling this data will just make it more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any.  However, that’s taking on a lot of responsibility, time, and effort.  If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool.  Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll make a write up on them and give you credit for being awesome supporting your product like that.  I can only write against what I can test against, it’s nothing personal.

The Script

There are 3 parameters at the start, and that’s typically all I ever change.  A couple things stick out with these parameters.  The times are in UTC, so I default to GetUTCDate().  I want the time range to be at least an hour so I’m not letting a single query that ran skew the results too much.  Then I’m also converting my instance name to upper case because Idera DM may very well be your only database that’s case sensitive, so I cheat by putting everything in upper case.

My upper case trick isn’t a recommended trick.  It causes scans and possible bad joins.  In this case I know it’s against small tables and that I don’t have two servers with the same name using different cases.  It’s safe here, but not everywhere.

DECLARE @StartTime DateTime
	, @EndTime DateTime
	, @InstanceName sysname
		
SET @EndTime = GetUTCDate()
SET @StartTime = DateAdd(Hour, -24, @EndTime)
SET @InstanceName = UPPER('Server\Instance')

IF OBJECT_ID('tempdb..#IgnoredWaits') IS NOT NULL BEGIN
    DROP TABLE #IgnoredWaits
END

IF OBJECT_ID('tempdb..#Now') IS NOT NULL BEGIN
    DROP TABLE #Now
END

IF OBJECT_ID('tempdb..#Yesterday') IS NOT NULL BEGIN
    DROP TABLE #Yesterday
END

IF OBJECT_ID('tempdb..#LastWeek') IS NOT NULL BEGIN
    DROP TABLE #LastWeek
END

IF OBJECT_ID('tempdb..#TwoWeeks') IS NOT NULL BEGIN
    DROP TABLE #TwoWeeks
END

CREATE TABLE #IgnoredWaits (Wait_Type NVarChar(60) PRIMARY KEY) 

--Values taken from Paul Randal of SQLskills
--http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/
INSERT INTO #IgnoredWaits 
SELECT N'BROKER_EVENTHANDLER'            
UNION SELECT N'BROKER_RECEIVE_WAITFOR'
UNION SELECT N'BROKER_TASK_STOP'               
UNION SELECT N'BROKER_TO_FLUSH'
UNION SELECT N'BROKER_TRANSMITTER'             
UNION SELECT N'CHECKPOINT_QUEUE'
UNION SELECT N'CHKPT'                          
UNION SELECT N'CLR_AUTO_EVENT'
UNION SELECT N'CLR_MANUAL_EVENT'               
UNION SELECT N'CLR_SEMAPHORE'
UNION SELECT N'DBMIRROR_DBM_EVENT'             
UNION SELECT N'DBMIRROR_EVENTS_QUEUE'
UNION SELECT N'DBMIRROR_WORKER_QUEUE'          
UNION SELECT N'DBMIRRORING_CMD'
UNION SELECT N'DIRTY_PAGE_POLL'                
UNION SELECT N'DISPATCHER_QUEUE_SEMAPHORE'
UNION SELECT N'EXECSYNC'                       
UNION SELECT N'FSAGENT'
UNION SELECT N'FT_IFTS_SCHEDULER_IDLE_WAIT'    
UNION SELECT N'FT_IFTSHC_MUTEX'
UNION SELECT N'HADR_CLUSAPI_CALL'              
UNION SELECT N'HADR_FILESTREAM_IOMGR_IOCOMPLETION'
UNION SELECT N'HADR_LOGCAPTURE_WAIT'           
UNION SELECT N'HADR_NOTIFICATION_DEQUEUE'
UNION SELECT N'HADR_TIMER_TASK'                
UNION SELECT N'HADR_WORK_QUEUE'
UNION SELECT N'KSOURCE_WAKEUP'                 
UNION SELECT N'LAZYWRITER_SLEEP'
UNION SELECT N'LOGMGR_QUEUE'                   
UNION SELECT N'ONDEMAND_TASK_QUEUE'
UNION SELECT N'PWAIT_ALL_COMPONENTS_INITIALIZED'
UNION SELECT N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP'
UNION SELECT N'QDS_SHUTDOWN_QUEUE'
UNION SELECT N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP'
UNION SELECT N'REQUEST_FOR_DEADLOCK_SEARCH'    
UNION SELECT N'RESOURCE_QUEUE'
UNION SELECT N'SERVER_IDLE_CHECK'              
UNION SELECT N'SLEEP_BPOOL_FLUSH'
UNION SELECT N'SLEEP_DBSTARTUP'                
UNION SELECT N'SLEEP_DCOMSTARTUP'
UNION SELECT N'SLEEP_MASTERDBREADY'            
UNION SELECT N'SLEEP_MASTERMDREADY'
UNION SELECT N'SLEEP_MASTERUPGRADED'           
UNION SELECT N'SLEEP_MSDBSTARTUP'
UNION SELECT N'SLEEP_SYSTEMTASK'               
UNION SELECT N'SLEEP_TASK'
UNION SELECT N'SLEEP_TEMPDBSTARTUP'            
UNION SELECT N'SNI_HTTP_ACCEPT'
UNION SELECT N'SP_SERVER_DIAGNOSTICS_SLEEP'    
UNION SELECT N'SQLTRACE_BUFFER_FLUSH'
UNION SELECT N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP'
UNION SELECT N'SQLTRACE_WAIT_ENTRIES'          
UNION SELECT N'WAIT_FOR_RESULTS'
UNION SELECT N'WAITFOR'                        
UNION SELECT N'WAITFOR_TASKSHUTDOWN'
UNION SELECT N'WAIT_XTP_HOST_WAIT'             
UNION SELECT N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG'
UNION SELECT N'WAIT_XTP_CKPT_CLOSE'            
UNION SELECT N'XE_DISPATCHER_JOIN'
UNION SELECT N'XE_DISPATCHER_WAIT'             
UNION SELECT N'XE_TIMER_EVENT'

SELECT s.InstanceName
	, t.WaitType
	, WaitTime_Minutes = SUM(WaitTimeInMilliseconds)/1000/60
	, dStart = Min(UTCCollectionDateTime)
	, dEnd = Max(UTCCollectionDateTime)
INTO #Now
FROM SQLdmrepository..WaitStatisticsDetails wsd
	INNER JOIN SQLdmrepository..WaitTypes t ON wsd.WaitTypeID = t.WaitTypeID
	INNER JOIN SQLdmrepository..WaitStatistics ws ON ws.WaitStatisticsID = wsd.WaitStatisticsID
	INNER JOIN SQLdmrepository..MonitoredSQLServers s ON ws.SQLServerID = s.SQLServerID
WHERE ws.UTCCollectionDateTime BETWEEN @StartTime and @EndTime
	AND	UPPER(s.InstanceName) = @InstanceName
	AND t.WaitType NOT IN (SELECT Wait_Type COLLATE SQL_Latin1_General_CP1_CI_AS FROM #IgnoredWaits)
	--AND t.WaitType LIKE 'PAGEIO%'
GROUP BY s.InstanceName, t.WaitType

SELECT @StartTime = @StartTime - 1
	, @EndTime = @EndTime - 1

SELECT s.InstanceName
	, t.WaitType
	, WaitTime_Minutes = SUM(WaitTimeInMilliseconds)/1000/60
	, dStart = Min(UTCCollectionDateTime)
	, dEnd = Max(UTCCollectionDateTime)
INTO #Yesterday
FROM SQLdmrepository..WaitStatisticsDetails wsd
	INNER JOIN SQLdmrepository..WaitTypes t ON wsd.WaitTypeID = t.WaitTypeID
	INNER JOIN SQLdmrepository..WaitStatistics ws ON ws.WaitStatisticsID = wsd.WaitStatisticsID
	INNER JOIN SQLdmrepository..MonitoredSQLServers s ON ws.SQLServerID = s.SQLServerID
WHERE ws.UTCCollectionDateTime BETWEEN @StartTime and @EndTime
	AND	UPPER(s.InstanceName) = @InstanceName	
	AND t.WaitType NOT IN (SELECT Wait_Type COLLATE SQL_Latin1_General_CP1_CI_AS FROM #IgnoredWaits)
	--AND t.WaitType LIKE 'PAGEIO%'
GROUP BY s.InstanceName, t.WaitType

SELECT @StartTime = @StartTime - 6
	, @EndTime = @EndTime - 6

SELECT s.InstanceName
	, t.WaitType
	, WaitTime_Minutes = SUM(WaitTimeInMilliseconds)/1000/60
	, dStart = Min(UTCCollectionDateTime)
	, dEnd = Max(UTCCollectionDateTime)
INTO #LastWeek
FROM SQLdmrepository..WaitStatisticsDetails wsd
	INNER JOIN SQLdmrepository..WaitTypes t ON wsd.WaitTypeID = t.WaitTypeID
	INNER JOIN SQLdmrepository..WaitStatistics ws ON ws.WaitStatisticsID = wsd.WaitStatisticsID
	INNER JOIN SQLdmrepository..MonitoredSQLServers s ON ws.SQLServerID = s.SQLServerID
WHERE ws.UTCCollectionDateTime BETWEEN @StartTime and @EndTime
	AND	UPPER(s.InstanceName) = @InstanceName	
	AND t.WaitType NOT IN (SELECT Wait_Type COLLATE SQL_Latin1_General_CP1_CI_AS FROM #IgnoredWaits)
	--AND t.WaitType LIKE 'PAGEIO%'
GROUP BY s.InstanceName, t.WaitType

SELECT @StartTime = @StartTime - 7
	, @EndTime = @EndTime - 7

SELECT s.InstanceName
	, t.WaitType
	, WaitTime_Minutes = SUM(WaitTimeInMilliseconds)/1000/60
	, dStart = Min(UTCCollectionDateTime)
	, dEnd = Max(UTCCollectionDateTime)
INTO #TwoWeeks
FROM SQLdmrepository..WaitStatisticsDetails wsd
	INNER JOIN SQLdmrepository..WaitTypes t ON wsd.WaitTypeID = t.WaitTypeID
	INNER JOIN SQLdmrepository..WaitStatistics ws ON ws.WaitStatisticsID = wsd.WaitStatisticsID
	INNER JOIN SQLdmrepository..MonitoredSQLServers s ON ws.SQLServerID = s.SQLServerID
WHERE ws.UTCCollectionDateTime BETWEEN @StartTime and @EndTime
	AND	UPPER(s.InstanceName) = @InstanceName	
	AND t.WaitType NOT IN (SELECT Wait_Type COLLATE SQL_Latin1_General_CP1_CI_AS FROM #IgnoredWaits)
	--AND t.WaitType LIKE 'PAGEIO%'
GROUP BY s.InstanceName, t.WaitType

SELECT TOP 10 n.InstanceName
	, n.WaitType
	, Now_Minutes = n.WaitTime_Minutes
	, Yesterday_Minutes = d1.WaitTime_Minutes
	, OneWeek_Minutes = d7.WaitTime_Minutes
	, TwoWeeks_Minutes = d14.WaitTime_Minutes  
FROM #Now n
	LEFT JOIN #Yesterday d1 ON n.WaitType = d1.WaitType
	LEFT JOIN #LastWeek d7 ON n.WaitType = d7.WaitType
	LEFT JOIN #TwoWeeks d14 ON n.WaitType = d14.WaitType 
ORDER BY n.WaitTime_Minutes DESC

The Results

The results pretty much speak for themselves.  What were the top 10 things you waited on recently and how much did you wait on them in the past.  It’s great to have a view of it where you’re comparing totals side-by-side.  Seeing the same thing in charts just doesn’t have the same effect for me.

These numbers are going to have natural variances to them.  Did a user run a couple large queries during one interval but not another?  Was the shared storage under different loads?  Really anything can come up.  The longer the period the more you smooth out these numbers.

So, the real question is what do these waits mean?  I’m not going to pretend like I can cover all the wait stats, their meanings, and how to affect them in this post.  What I can say is that there are good resources out there for you.

Paul Randal (b|t) has the best source I know of on his post Wait Statistics, or please tell me where it hurts.

CXPACKET

One wait type I like to talk about, and discredit a bit, is CXPACKET.  This wait type is just saying you used parallelism, nothing more.  I first covered this topic in my post Why worry about CXPACKET.

Think of it in terms of a manager who delegates a large project to multiple employees.  As the employees are hard at work accomplishing the task, the manager goes about their day and records their time as CXPACKET.  The task is typically completed in less overall duration because it was a group effort, but the total amount of work done goes up some because there’s overhead and small duplicated tasks when you delegate to a group.

That less overall duration could make or break your application, so the idea of having CXPACKET isn’t necessarily bad (setting MAXDOP = 1 is rarely a good idea).  However, a task being so large that it needed to be delegated to multiple employees may mean it’s not efficient, making query and index tuning a good approach.  Also, your manager may be too eager to delegate their tasks, so possibly adjust your Cost Threshold for Parallelism as suggested in Jeremiah Peschka’s (b|t) post Five SQL Server Settings to Change.

As I said before, I can’t cover all the wait types here, so go to Paul Randal’s post I suggested above for a better list.

What I Skipped

This is looking specifically at WaitTimeInMilliseconds, which is one of two technically correct things to do.  It does not look at ResourceWaitTimeInMilliseconds, which is the other technically correct thing to do.  I wrote it one way and was happy with the results so I didn’t change it or clutter it with returning both.

Here’s the difference so you can decide if you want to change it or talk me into changing mine.  It’s a conversation I’d enjoy…I’m strange like that.

ResourceWaitTimeInMilliseconds is how long you’re waiting on your actual resource.  If you’re waiting for a lock, this is how long you waited for that lock to be made so you could get back in line for CPU time.

WaitTimeInMilliseconds also includes the time from when you get in line for CPU time until you’re using the CPU.  This additional time is called your signal wait time.

The reason I like using WaitTimeInMilliseconds is that the reason you gave up the CPU and had to get back in line is due to the fact you weren’t ready due to what you were waiting on.  In that way, it’s the whole picture.

The argument going the other way is that you were only waiting for the resource.  It’s not the resource’s fault there was a line all queued up at the CPU when it got there.  Why should I tack on extra time on a non-CPU wait for that?

I’m undecided on this part, but having the info both ways allows you to figure out the signal wait time on your entire instance and see if you have too much CPU contention.  There’s no arguing about that being a good thing to know.

Steal my query, make it your own, and write that part the way that works best for you.

When Idera Finds Out What I’m Doing….

Someone asked me for queries pulling from Diagnostic Manager, and I posted on Twitter when I made my Idera Diagnostic Manager Scripts page.  The outcry and rage from this was…very positive.

Vicky Harp (t), who is now Idera’s Corporate Strategist wrote pieces of Diagnostic Manager and its schema, took time out of her day to meet with me and talk about the details of querying the data they collect.

Scott Stone (t) who is Diagnostic Manager’s Product Manager also jumped in very quickly to see if I’d like to be a beta tester for them, bringing real power users into the loop to make their products better.

This is the level of support you want from your monitoring software.

While I don’t officially endorse or recommend any specific software at this time, I do recommend you look for responses like this.

 

Database Log VLFs

Virtual Log Files (VLFs) split a physical database log file into smaller segments, which are required for how log files work in the background.  These are created automatically, but automatically doesn’t always mean perfect.  Here’s a practical view of what you need to know and how you can set them up properly.

What VLFs Do

My initial view of how a transaction log worked was that each change was written to a log file, and, as the documentation of those changes was no longer required, it was deleted from the log file.  I think this is a common way to view it, and it’s close to the truth.  However, it’s not close enough for a conversation on VLFs.

What really happens is that changes are written to the first VLF, which is just a segment of the log file.  When that VLF fills up, it moves on to the next VLF in a systematic order.

Each change made to the database is assigned a Log Sequence Number (LSN), and SQL Server keeps track of the oldest LSN it still needs for any purpose.  This purpose can be many things, but is typically the oldest LSN of the following:

  • Oldest LSN on a log backup (full or bulk-logged recovery)
  • LSN at the start of the oldest active transaction
  • LSN last replicated when using certain types of replication

Here’s my extremely complicated script to tell you what that reason is for your database:

SELECT name
	, log_reuse_wait_desc 
FROM sys.databases

If the last LSN in a VLF is older than what you need to keep, that VLF can be reused.  This makes is very efficient because this becomes something like a set-based operation for SQL Server, the entire VLF is cleared and marked for reuse at once instead of line-by-line.

To help you picture how the VLFs work, someone at Microsoft drew you a picture on Transaction Log Physical Architecture.  Then you can look at the results of DBCC LogInfo, and it will make a lot more sense when you see a VLF on each line there along with its status.

Only One Log File, Dedicated Drive

Adding more than one log file on a single database isn’t going to help your performance.  SQL Server is writing to a single VLF at a time, and that single VLF is part of a single log file.  If you have multiple log files then you’re writing to one while the other sits idle.  Once you fill up the last VLF on the first file it starts writing to the second file while the first sits idle.  If these are on separate drives that means each drive has to be able to handle the I/O, but it’s hit or miss if you’re using the I/O which leads to wasted resources and inconsistent performance.

As for how you’re writing to the log, most of the work done is writing to the tail end of it so spinning disks do really well just keeping the head in one place and writing to the file.  That is unless you have other types of files on this disk so the head writes a little to the log, jumps over to write a little to a data file, writes a bit more to the log, then jumps over to read a couple indexes.  This is a lot of where the recommendation to keep your data and log files on separate disks come from, but there is more to it than I’ll get into here.

However, we’re going to use RAID for redundancy making the drives too large for just a log, then we’ll put it on a SAN with a write cache, and do so many other things to make it more complicated.  If a server is extremely stressed and highly critical, look into dedicated spindles, SSDs, or other options.  On general shared disk arrays, it’s nice if you can have an array dedicated to just log files, even if it’s log files for multiple servers.

VLF Size Matters

When you’re writing to VLFs there can be two issues.  First, the VLFs are very small and SQL Server is jumping all over the place maintaining small files and figuring out where it should write the next entry – this is common and a big performance hit.  Second, the VLFs are too large and SQL Server is reserving a ton of space when only a small part of the tail end of the VLF is in use, then trying to clear it all at once – this is less common and less of an issue.

In addition to this, VLFs each need to be processed when restoring or recovering a database.  Recovering is part of restarting SQL services, so you should be hitting this issue at least once a month with your Windows Updates.  Every VLF adds some overhead to this process, and a huge number adds a lot of overhead to lengthen this process.

Details of this are on the Microsoft Customer Service and Support blog post How a log file structure can affect database recovery time, and it includes this eye-opening quote, “The first phase of recovering a database is called discovery where all the VLFs are scanned (in serial and single threaded fashion) before actual recovery starts.”

Are your log files pre-sized to take up the entire drive even though they only ever use 5% of that space?  Is that hurting you on recovery?

How Big Are My VLFs?

I have a script for that.  This is set to filter only logs that I want to look at, but you can comment out there WHERE clause on the final statement to see it all.  Also, it would be extremely rare for me to look at individual VLFs, so this is only looking at the sums and averages for each file.

I used to manually comment out a line on this depending on the version of SQL Server until I read a post on VLFs by Andy Galbraith (b|t) Counting Your VLFs, or, Temp Tables Inside IF…ELSE Blocks.  Thank you, Andy.  The timing of your post was perfect to help me make this easier for everyone.

CREATE TABLE #VLF_temp 
(
	RecoveryUnitID int
	, FileID varchar(3) 
	, FileSize numeric(20,0)
	, StartOffset bigint 
	, FSeqNo bigint 
	, Status char(1)
	, Parity varchar(4) 
	, CreateLSN numeric(25,0)
)

CREATE TABLE #VLF_db_total_temp
(
	DatabaseName sysname 
	, LogiFilename sysname
	, PhysFileName sysname
	, AVG_VLF_Size_MB DECIMAL(12,2)
	, vlf_count int
	, log_size_mb FLOAT
	, log_growth_mb FLOAT
)

IF (SELECT LEFT(cast(serverproperty('ProductVersion') as varchar),2)) in ('8.','9.','10') BEGIN
	ALTER TABLE #VLF_temp DROP COLUMN RecoveryUnitID
END

DECLARE db_cursor CURSOR READ_ONLY FOR 
SELECT name FROM sys.databases
WHERE State = 0

DECLARE @name sysname, @stmt varchar(40)

OPEN db_cursor

FETCH NEXT FROM db_cursor INTO @name

WHILE (@@fetch_status <> -1) BEGIN
	IF (@@fetch_status <> -2) BEGIN
		INSERT INTO #VLF_temp
		EXEC ('DBCC LOGINFO ([' + @name + ']) WITH NO_INFOMSGS')

		INSERT INTO #VLF_db_total_temp (DatabaseName, LogiFilename, PhysFileName, vlf_count, AVG_VLF_Size_MB, log_size_mb, log_growth_mb)
		SELECT DatabaseName = @name
			, LogiFilename = mf.name
			, PhysFileName = mf.physical_name
			, vlf_count = COUNT(*)
			, AVG_VLF_Size_MB = (AVG(FileSize)/1024)/1024
			, log_size_mb = (mf.size * 8)/1024 
			, log_growth_mb = CASE mf.is_percent_growth
									WHEN 1 THEN (mf.size * 8)/1024 * mf.growth/100
									WHEN 0 THEN  (mf.growth * 8)/1024 
									END
		FROM #VLF_temp vt
			INNER JOIN sys.master_files mf ON mf.database_id = db_id(@name) AND mf.file_id = vt.fileid
		GROUP BY mf.name, mf.physical_name,mf.size,mf.growth,mf.is_percent_growth;           

		TRUNCATE TABLE #VLF_temp           
	END

	FETCH NEXT FROM db_cursor INTO @name
END

CLOSE db_cursor
DEALLOCATE db_cursor

SELECT * 
	, USEDB = 'USE [' + DatabaseName + ']' 
	, CP = 'CHECKPOINT'
	, ShrinkZero = 'DBCC SHRINKFILE (N''' + LogiFileName + ''' , 0)'
	, GrowBack = 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', SIZE = ' + CAST(log_size_mb AS VARCHAR(20)) + 'MB ) --I split this up when over 10,000 MB'
	, ChangeGrowth = CASE --Note, never gets up to the 8000 MB I recommend for manual growths, but still 250 MB VLFs
						WHEN DatabaseName = 'tempdb' THEN ''
						WHEN log_size_mb <= 128 THEN 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 32MB )'
						WHEN log_size_mb <= 512 THEN 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 128MB )'
						WHEN log_size_mb <= 4000 THEN 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 1000MB )'
						WHEN log_size_mb <= 16000 THEN 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 1000MB )'
						ELSE 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 1000MB )'
						END
FROM #VLF_db_total_temp
WHERE (vlf_count > 50 
		AND Avg_VLF_Size_MB < vlf_count / 3) 
	OR (vlf_count > 250
		AND AVG_VLF_Size_MB < vlf_count / 2)
ORDER BY vlf_count DESC 

DROP TABLE #VLF_temp 
DROP TABLE #VLF_db_total_temp

What’s the Right Size?

The most common thing you’ll see on the internet is people talking about having too many with the limit being 1,000 VLFs and ideally at 50 VLFs.  I’m a little different and focus more on size than number.  The concept I’m going for is that excessive context switching is what causes me grief, and a reasonable size VLF will keep that under control.

Now there’s no magic number that will work noticeably better in all cases, so I won’t give you static numbers and say they’re magic.  What I can say is that too small is basically fragmentation with the overhead of context switching, too many increases recovery time, too big is huge chucks, and too few could cause extra growths as unused space is still being reserved.

It’s easy to visualize how smaller and excessive numbers of VLFs can cause issues, but that doesn’t mean huge VLFs are the way to go.  Tammy Richter Jones (b) gets into details with her post If > 50 VLFs are bad, then 16 is good, right?.  I’d love to see more on the affects of huge VLFs to have a more complete understanding if a 1 TB log file is better off with 50 20GB VLFs or 2,000 500MB VLFs.  I’m not convinced that >50 is bad.

Personally, I like formulas a lot more than static numbers, especially since they allow me to avoid extremes without stressing out.  The formulas I used broke it down into three categories:

  1. Below 50 VLFs, this is a reasonable number, they get a free ride no matter how big they are.
  2. Between 50 and 250 VLFs, the average size in MB has to be at least 1/3 of the count of VLFs.
    • For a file with 100 VLFs, the average VLF has to be over 33 MB.
  3. Over 250 VLFs, the average size in MB has to be at least 1/2 of the count of VLFs.
    • For a file with 500 VLFs, the average VLF has to be over 250 MB.

This is good enough for almost any database.  I don’t care about there being a couple small VLFs mixed in there because I’ll spend most of my time in the large ones.  I don’t care about the numbers getting larger because the average VLF is large enough to avoid too much context switching.  You’re not out to micromanage, so take a relaxed approach to this unless you have a reason to not be relaxed.

What’s Default?

Remember how I said there were no magic numbers?  Well, that doesn’t mean there aren’t wrong numbers.

By default, every database is based off of model, which has a 1 MB log files growing in 10% increments.  So 1/10th of a MB for the first growth, which is just a tiny fragment.  If you leave it this way on large databases you could end up with a 1 TB log trying to grow 100 GB at a time, which your users will notice.

Be default, you get the worst of both worlds!  The goal here is to avoid extremes, so you’ll want to change the growth rates away from the default.

How Do I Change the Size?

There’s only one way to change the size of your VLFs.  Delete them and recreate them.

For every database EXCEPT TempDB, this means shrinking the logs (it deletes VLFs to get to the size you want), then growing the logs again.  Both manual and automatic growths will split the new physical space into VLFs, but that depends on your version of SQL Server.

Here are the growth rates I pulled from Paul Randal’s post Important change to VLF creation algorithm in SQL Server 2014.

SQL 2012 and earlier uses this formula:

<= 64 MB growth is 4 VLFs

64 MB and <= 1 GB = 8 VLFs

1 GB = 16 VLFs

SQL 2014+ uses this formula:

Is the growth size less than 1/8 the size of the current log size?

Yes: create 1 new VLF equal to the growth size

No: use the formula above

So if you have too many VLFs you can shrink your log file.  It will delete VLFs that are not in use to make it smaller.  Then grow the file in increments that give you VLFs sized how you want them.  If you’re going for 500 MB VLFs then you grow your file 8,000 MB at a time.

The VLFs that weren’t deleted in this process, because they were in use or you didn’t shrink the file as far as you could have, will not be affected.  This means you’ll have different sized VLFs throughout your file, but that doesn’t really matter.  What does matter is that you don’t have any ridiculously large VLFs and you’re spending most of your time in properly sized VLFs.

Here’s how I do it.  Shrink it as much as possible.  Shrinking is never as easy as it should be, so verify it shrunk it a significant amount.  If it didn’t, take log backups and try again, repeating this process a handful of times if needed.  Then, if I want the log to be 32,000 MB, I just grow it by 8,000 MB 4 times.

--Run a log backup before this if database is not in simple and log backups run less frequently than every 5 minutes.
USE [DatabaseName]
GO
CHECKPOINT
GO
DBCC SHRINKFILE (N'LogicalFileName_Log' , 0)
GO
USE [master]
GO
ALTER DATABASE [DatabaseName] MODIFY FILE ( NAME = N'LogicalFileName_Log', SIZE = 8000MB )
ALTER DATABASE [DatabaseName] MODIFY FILE ( NAME = N'LogicalFileName_Log', SIZE = 16000MB )
ALTER DATABASE [DatabaseName] MODIFY FILE ( NAME = N'LogicalFileName_Log', SIZE = 24000MB )
ALTER DATABASE [DatabaseName] MODIFY FILE ( NAME = N'LogicalFileName_Log', SIZE = 32000MB )
GO

Assuming you were able to shrink this down to 10 VLFs remaining and you want it to be 32,000 MB, this would give you 74 VLFs with all new ones being 500 MB.  It goes over the 50 VLF limit I’ve seen elsewhere, but I’m very happy with these sizes and numbers.

With 8,000 MB growths the new 2014 calculation won’t kick in until the log was already 64,000 MB, and at that point an 8,000 MB VLF probably isn’t a bad thing.

Autogrowth Rates

I recommend manually growing larger log files between 8,000 and 16,000 MB at a time, but not autogrowths.  The process that causes a file to grow, and any other process that needs that space, will sit around and wait for an autogrowth to complete.  That could take a minute on descent disks, which is longer than many timeouts I’ve worked with.

Manually growing files is best, but almost every database will rely on autogrowth.  Know your timeouts, know how long it takes to grow a file, and size it accordingly.  I like to do 1,000 MB here, but will do as less if I have to.  If you can’t make a descent sized growth based on this, be more paranoid about manually growing files while leaving autogrowth set to smaller sizes.

The reason for 1,000 MB is that this is the largest size that will still give you 8 VLFs instead of being split into 16 VLFs, with the actual limit being at 1,024 MB.  If you did 2,000 MB, you’d still end up with 125 MB VLFs.  You can get that up to 250 MB each at 4,000 MB, but you’re telling your users to wait while 4,000 MB is zeroed out before they can move on.  Although the calculation changes for larger logs on SQL 2014, I’d still stick with 1,000 MB for a good autogrowth size.

Note, there is a resolved bug that would cause issues when log files grew in 4 GB increments, so you’ll see a lot of people using 8,000 MB instead of 8,192 MB.  We aren’t using an effected patch level anymore, but DBAs are made to be paranoid.  We’re not just lazy at math, although this doesn’t rule out that possibility.

TempDB

TempDB is different.  The database is recreated every time the SQL service restarts, so the log is recreated as well, following the formulas above.  If the log is over 1 GB, it starts out with 16 VLFs.  Jonathan Kehayias got into the details in his post TSQL Tuesday #11 – Misconceptions – The TempDB Log File and VLF Counts.

So TempDB never has too many on startup, but is this too few?  What if TempDB’s log is 40 GB, do you want a 2.5 GB VLF?  Is setting up TempDB to have 5 log files that are 8,000 MB each so it starts up with 500 MB VLFs a better idea?

Seriously, someone answer this one, I’m curious myself.

Changes Have Risk

The script above creates scripts to make changes.  It’s not perfectly safe to make those changes.  Test it, understand it, and be careful running it.

One of the most important pieces is that the scripts it generates will try to shrink the log as much as possible before growing it again.  In terms of cleaning up VLFs, this is perfect.  In terms of trying to run this during the day, it risks transactions failing because they’re trying to grab the next VLF as you’re running a shrink job to delete all unused VLFs.

You might be able to get away with this during the day most of the time, but it’s best to wait for a maintenance window to be safe.

What Others Did

Linchi Shea showed performance degradation from having too many small VLFs on Performance impact: a large number of virtual log files – Part I.  Updates can take almost 10x as long, and I would assume its due to the size and not number of VLFs.

SQLskills.com has blog posts that read like a series on this.  Kimberly Tripp wrote 8 Steps to better Transaction Log throughput, then followed that up with Transaction Log VLFs too many or too few?, which was updated to link to Paul Randal’s post Important change to VLF creation algorithm in SQL Server 2014.  All are must-reads if you want to really understand this topic.

Tony Rogerson (b|t) at the time of this writing has only put one post on his new blog, Transaction Log Concepts: Part 1.  If this is his typical quality of work, we have a lot to look forward to.  I’m personally looking forward to the other 3 parts of this 4 part series.

Grahm Kent (b|t) ran tests on the performance differences between two log files in his post Slow recovery times and slow performance due to large numbers of Virtual Log Files.  Both were 5 GB, but one was 16 VLFs (320 MB each) while the other was 20,480 VLFs (0.25 MB each).  It was good to see transaction performance as well as recovery performance covered.

Test It

If your database is absolutely critical to the point that squeezing an extra couple milliseconds of performance throughout the day would be noticed or if a server coming back online 15 seconds faster would save thousands of dollars, test it.  See how it performs in load tests and service restarts using your test servers and your database.

If you have a 1 TB log file and verified it needs to be about that big, what happens when you size it so you have 2,000 VLFs that are 500 MB each?  What happens when you have 100 VLFs that are 10 GB each?  Did it change recovery, any replication you’re doing, backup times, etc.?

Know that this can make a difference and that this is just another configuration change you can test out.  Crazy numbers on either side can be bad, and perfect numbers in the middle aren’t going to come from some stranger on the internet that never saw your server.

Querying Deadlocks From System_Health XEvent

It’s easy to query XEvents to see some of the basic info in deadlocks, including the system_health session which is already capturing this information by default.

This script will, by default, read from system_health.  However, change the name of one parameter at the start to have it read from any active session.  I pull the file name automatically, so the session has to be active to avoid many changes to the script.

The Script

DECLARE @SessionName SysName 

SELECT @SessionName = 'system_health'

/* 
SELECT  Session_Name = s.name, s.blocked_event_fire_time, s.dropped_buffer_count, s.dropped_event_count, s.pending_buffers
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE target_name = 'event_file'
--*/

IF OBJECT_ID('tempdb..#Events') IS NOT NULL BEGIN
	DROP TABLE #Events
END

DECLARE @Target_File NVarChar(1000)
	, @Target_Dir NVarChar(1000)
	, @Target_File_WildCard NVarChar(1000)

SELECT @Target_File = CAST(t.target_data as XML).value('EventFileTarget[1]/File[1]/@name', 'NVARCHAR(256)')
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE s.name = @SessionName
	AND t.target_name = 'event_file'

SELECT @Target_Dir = LEFT(@Target_File, Len(@Target_File) - CHARINDEX('\', REVERSE(@Target_File))) 

SELECT @Target_File_WildCard = @Target_Dir + '\'  + @SessionName + '_*.xel'

--Keep this as a separate table because it's called twice in the next query.  You don't want this running twice.
SELECT DeadlockGraph = CAST(event_data AS XML)
	, DeadlockID = Row_Number() OVER(ORDER BY file_name, file_offset)
INTO #Events
FROM sys.fn_xe_file_target_read_file(@Target_File_WildCard, null, null, null) AS F
WHERE event_data like '<event name="xml_deadlock_report%'

;WITH Victims AS
(
	SELECT VictimID = Deadlock.Victims.value('@id', 'varchar(50)')
		, e.DeadlockID 
	FROM #Events e
		CROSS APPLY e.DeadlockGraph.nodes('/event/data/value/deadlock/victim-list/victimProcess') as Deadlock(Victims)
)
, DeadlockObjects AS
(
	SELECT DISTINCT e.DeadlockID
		, ObjectName = Deadlock.Resources.value('@objectname', 'nvarchar(256)')
	FROM #Events e
		CROSS APPLY e.DeadlockGraph.nodes('/event/data/value/deadlock/resource-list/*') as Deadlock(Resources)
)
SELECT *
FROM
(
	SELECT e.DeadlockID
		, TransactionTime = Deadlock.Process.value('@lasttranstarted', 'datetime')
		, DeadlockGraph
		, DeadlockObjects = substring((SELECT (', ' + o.ObjectName)
							FROM DeadlockObjects o
							WHERE o.DeadlockID = e.DeadlockID
							ORDER BY o.ObjectName
							FOR XML PATH ('')
							), 3, 4000)
		, Victim = CASE WHEN v.VictimID IS NOT NULL 
							THEN 1 
						ELSE 0 
						END
		, SPID = Deadlock.Process.value('@spid', 'int')
		, ProcedureName = Deadlock.Process.value('executionStack[1]/frame[1]/@procname[1]', 'varchar(200)')
		, LockMode = Deadlock.Process.value('@lockMode', 'char(1)')
		, Code = Deadlock.Process.value('executionStack[1]/frame[1]', 'varchar(1000)')
		, ClientApp = CASE LEFT(Deadlock.Process.value('@clientapp', 'varchar(100)'), 29)
						WHEN 'SQLAgent - TSQL JobStep (Job '
							THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(Deadlock.Process.value('@clientapp', 'varchar(100)'),32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(Deadlock.Process.value('@clientapp', 'varchar(100)'), 67, len(Deadlock.Process.value('@clientapp', 'varchar(100)'))-67)
						ELSE Deadlock.Process.value('@clientapp', 'varchar(100)')
						END 
		, HostName = Deadlock.Process.value('@hostname', 'varchar(20)')
		, LoginName = Deadlock.Process.value('@loginname', 'varchar(20)')
		, InputBuffer = Deadlock.Process.value('inputbuf[1]', 'varchar(1000)')
	FROM #Events e
		CROSS APPLY e.DeadlockGraph.nodes('/event/data/value/deadlock/process-list/process') as Deadlock(Process)
		LEFT JOIN Victims v ON v.DeadlockID = e.DeadlockID AND v.VictimID = Deadlock.Process.value('@id', 'varchar(50)')
) X --In a subquery to make filtering easier (use column names, not XML parsing), no other reason
ORDER BY DeadlockID DESC

System_health session

The system_health XEvent by default is turned on, captures deadlocks, and is limited to four 5MB files.  You can change all three of those, but it’s not common for people to do that.

Yes, you’re limited to more recent events, but the information is there.  If you find the limitations are too much, you have a couple options that should be considered in this order.

  1. Is anything else capturing deadlocks?  For me, I have Idera Diagnostic Manager capturing deadlocks as well for my production servers.  This is my main source of looking at deadlocks, so I actually rarely use system_health for this purpose in production.  The first goal is to avoid doing anything additional.
  2. Consider setting system_health to use bigger files (not too big) or have more rollover files. The more you do the more resources it will take to read it all at once, so don’t go crazy.  However, if you have 5 days of data when you wish you had 7, it’s not horrible to double the size of data you retain.  The second goal is to increase disk usage before you add more overhead.
  3. Consider a separate XEvent session.  Ugh…I wish I didn’t have to throw this out there as an option.  Love your servers, don’t add more overhead if you don’t have to.  The last resort is to not end up here at your last resort.

Idera Diagnostic Manager

At the time of me writing this, I have a script to look at all of the deadlocks in Idera DM that I use quite regularly for my production servers.  However, they just came out with a new major version and I want to make sure my query runs there before posting my script publically.  If you don’t care and just want the script, email me.

Viewing the Deadlock

The details of this script give you a great overview of the deadlock.  You know the code involved on each side, the tables and applications involved, etc.   It’s enough to look for trends, know if a deadlock was one you expected, and other basics.

However, this is just a summary list of your recent deadlocks, it’s not the details.  For me, I’m very happy just looking at the XML output to see the details.  This is how I look at it from XEvents and my monitoring software.  That makes me odd, and I’m ok with that.

Jonathan Kehayias (b|t) talks about viewing them in SQL Sentry Plan Explorer Pro in his post Graphically Viewing Extended Events Deadlock Graphs.  There’s no single right answer on how to view deadlocks, so do what works best for you.

Troubleshooting Deadlocks

This topic goes well beyond what I’m going for in this post.  Jonathan Kehayias gets into it some in his post The Accidental DBA (Day 29 of 30): Troubleshooting Deadlocks, which is just a start.  The book he co-authored with Ted Krueger (b|t), Troubleshooting SQL Server – A Guide for the Accidental DBA, is available as a free PDF and has an entire chapter on this subject.

 

Query the Buffer Pool

DBAs are known for asking for more memory, but often can’t say what’s in memory.  While I agree that many database servers can use more memory, I’m a firm believer in knowing how you’re using your resources before asking for more.  The script below allows me to do just that.

What It Returns

This will return every index that is using at least 1 MB of memory for every database on your server.  It also returns space in memory that is associated with unallocated space in the tables which shows up as NULL for everything except the size of the space and the table name.

I’ll warn you now that the unallocated space can be surprisingly high for TempDB, and I talk about that in TempDB Memory Leak?.  Hopefully we can get a good comment thread going on that post to talk through what we’re seeing and how common the issue really is.

The Script

IF OBJECT_ID('TempDB..#BufferSummary') IS NOT NULL BEGIN
	DROP TABLE #BufferSummary
END

IF OBJECT_ID('TempDB..#BufferPool') IS NOT NULL BEGIN
	DROP TABLE #BufferPool
END

CREATE TABLE #BufferPool
(
	Cached_MB Int
	, Database_Name SysName
	, Schema_Name SysName NULL
	, Object_Name SysName NULL
	, Index_ID Int NULL
	, Index_Name SysName NULL
	, Used_MB Int NULL
	, Used_InRow_MB Int NULL
	, Row_Count BigInt NULL
)

SELECT Pages = COUNT(1)
	, allocation_unit_id
	, database_id
INTO #BufferSummary
FROM sys.dm_os_buffer_descriptors 
GROUP BY allocation_unit_id, database_id 
	
DECLARE @DateAdded SmallDateTime  
SELECT @DateAdded = GETDATE()  
  
DECLARE @SQL NVarChar(4000)  
SELECT @SQL = ' USE [?]  
INSERT INTO #BufferPool (
	Cached_MB 
	, Database_Name 
	, Schema_Name 
	, Object_Name 
	, Index_ID 
	, Index_Name 
	, Used_MB 
	, Used_InRow_MB 
	, Row_Count 
	)  
SELECT sum(bd.Pages)/128 
	, DB_Name(bd.database_id)
	, Schema_Name(o.schema_id)
	, o.name
	, p.index_id 
	, ix.Name
	, i.Used_MB
	, i.Used_InRow_MB
	, i.Row_Count     
FROM #BufferSummary AS bd 
	LEFT JOIN sys.allocation_units au ON bd.allocation_unit_id = au.allocation_unit_id
	LEFT JOIN sys.partitions p ON (au.container_id = p.hobt_id AND au.type in (1,3)) OR (au.container_id = p.partition_id and au.type = 2)
	LEFT JOIN (
		SELECT PS.object_id
			, PS.index_id 
			, Used_MB = SUM(PS.used_page_count) / 128 
			, Used_InRow_MB = SUM(PS.in_row_used_page_count) / 128
			, Used_LOB_MB = SUM(PS.lob_used_page_count) / 128
			, Reserved_MB = SUM(PS.reserved_page_count) / 128
			, Row_Count = SUM(row_count)
		FROM sys.dm_db_partition_stats PS
		GROUP BY PS.object_id
			, PS.index_id
	) i ON p.object_id = i.object_id AND p.index_id = i.index_id
	LEFT JOIN sys.indexes ix ON i.object_id = ix.object_id AND i.index_id = ix.index_id
	LEFT JOIN sys.objects o ON p.object_id = o.object_id
WHERE database_id = db_id()  
GROUP BY bd.database_id   
	, o.schema_id
	, o.name
	, p.index_id
	, ix.Name
	, i.Used_MB
	, i.Used_InRow_MB
	, i.Row_Count     
HAVING SUM(bd.pages) > 128  
ORDER BY 1 DESC;'  

EXEC sp_MSforeachdb @SQL

SELECT Cached_MB 
	, Pct_of_Cache = CAST(Cached_MB * 100.0 / SUM(Cached_MB) OVER () as Dec(20,3))
	, Pct_Index_in_Cache = CAST(Cached_MB * 100.0 / CASE Used_MB WHEN 0 THEN 0.001 ELSE Used_MB END as DEC(20,3))
	, Database_Name 
	, Schema_Name 
	, Object_Name 
	, Index_ID 
	, Index_Name 
	, Used_MB 
	, Used_InRow_MB 
	, Row_Count 
FROM #BufferPool 
ORDER BY Cached_MB DESC

Where’d the Script Come From

I’ve had a script similar to this one around for a while.  It’s originally based off of Jonathan Kehayias’s script on his post Finding What Queries in the Plan Cache Use a Specific Index, and I couldn’t have done this without having his script to start with.

Then I originally posted a version of this script on my post Cleaning Up the Buffer Pool to Increase PLE, which was great to see the index usage for a single database.  It runs slower than this, only returns a single database, and does not show unallocated space in memory.  Those changes warranted either an update to that post or a completely new post…I opted for the latter.

What It Means

Now you can see what’s in your memory. Hopefully you’ll see one or two things that stand out on here that don’t make sense; those are your easy tuning opportunities.

If an index is 100% in cache then you’re scanning on it, and that may be an issue.  Yes, you can find when you did scans on indexes using the scripts in my Indexes – Unused and Duplicates post, but it helps to have another view of what that means in your memory.

One thing the index monitoring scripts in the post I just mentioned can’t do is tell you when you’re doing large seeks as opposed to small seeks.  With the typical phone book example, you could ask for all the names in the phone book where the last names begins with anything from A to Y, giving you 98% of the phone book as results.  Index usage stats will show you did a seek, which sounds efficient.  The script on this post will show that you have 98% of your index in cache immediately after running the query, and that gives you the opportunity to find the issue.

When you see an index that looks out of place here, dive back into the scripts on Cleaning Up the Buffer Pool to Increase PLE to see what’s in cache using that index.  If the query isn’t in cache for any reason, you may be able to look at the last time the index had a scan or seek against it in sys.dm_db_index_usage_stats and compare that to results from an Extended Events session you had running to see what it could have been.

The main point is that you have something to get you started.  You have specific indexes that are in memory, and you can hunt down when and why those indexes are being used that way.  It’s not always going to be easy, but you have a start.

We’re All On a Budget

It’s not too uncommon for this process to end in asking for more memory, and I view memory like being on a budget.  The amount of memory you have right now is your current budget.  Asking for more memory should be viewed like asking for more money in a financial budget.  For a financial budget increase, here are the questions I’d be prepared to answer:

  1. What did you spend the money we already gave you on?
  2. Did you spend that money as efficiently as possible?
  3. What else do you want to spend money on?

Now you can answer these questions in database form:

  1. Here’s what I have in cache at multiple times, specifically right after PLE dropped.
  2. I went through the queries that pulled the data into cache and tuned what I could.
  3. When I checked what’s in cache multiple times, these indexes fluctuated a lot in how much was in there.  I believe adding more memory would allow them to stay in cache instead of kicking each other out to make room.

Be it Virtual or Physical environments, there’s only so much memory that can be made available to us.  We’re on budgets of how much memory the host has, how many memory slots a server has, and how large the memory chips are that those slots can handle.  Prove you’re doing it right and it’s a lot harder to say no to you.

I have an odd habit of getting the memory I ask for because I answer these questions up front in the initial request for memory.

SQL Server Error Log Reader

Reading the SQL Server Error Log is miserable.  It contains very useful information you should address as soon as possible, or at least know that it’s happening.  However, it’s hidden between so many informational messages that it’s hard to find, then it’s spread out between multiple files for every server reboot or automated file rollover event you may have set up.

Many DBAs skim these files, but when there’s a single login failure mixed into log backups running every 5 minutes for 100 databases then they’re just happy to have found something.  That login failure tells you nothing, just that someone should have been more careful typing in their password, right?  When you’re just happy you were even able to find something then you’re almost certainly not going to see it clearly enough to notice a trend, such as that login failure happens every Sunday between 10:00 PM and 10:15 PM.  However, if you knew that then you could tell someone that there’s an automated job that’s failing, it’s obviously part of a bigger process because the time varies a little, but it’s consistent enough to say it’s definitely a process.

So, the trick is to get past the junk and to the useful information.  You can listen to Warner Chaves (b|t) in his Most Important Trace Flags post and turn on trace flag 3226 to stop backup information from going to the logs, but I’m always paranoid (it’s part of the job) that it just may come in useful some day.  I know it never has, but I leave it in there anyways.

Even if you do take out information from the logs like that, it’s still a flat file that’s difficult to analyze for any number of reasons.  We’re just a T-SQL kind of group, and flat files just fall flat.

As with everything in SQL Server, I cheat my way through it.  This time I pull it into a temp table, delete the stuff I’m ignoring (please be very, very careful about what you ignore because you’ll never know it happened), then look at the results.  If there’s a login failure then I’ll uncomment the section that deletes everything except a single error and trends will pop out at me.  If I wanted to do more advanced analysis I would run queries doing aggregates of any kind against the temp table that was created.  Everything’s in the format you’re used to analyzing, so you can do crazy things without going crazy.

DECLARE @dStart DateTime 
	, @dEnd DateTime
	, @MaxLogFiles Int 

SELECT @dStart = GetDate()-30
	, @dEnd = GetDate()-0
	, @MaxLogFiles = 5

--Pulls into #TempLog because an empty log file causes errors in the temp table
--If there are no records, don't pass the issues onto your #Log table and return the results

IF OBJECT_ID('tempdb..#Log') IS NOT NULL BEGIN
	DROP TABLE #Log
END

IF OBJECT_ID('tempdb..#TempLog') IS NOT NULL BEGIN
	DROP TABLE #TempLog
END

CREATE TABLE #Log
(
	LogDate DateTime
	, ProcessInfo NVarChar(50)
	, LogText NVarChar(1000)
)

CREATE TABLE #TempLog
(
	LogDate DateTime
	, ProcessInfo NVarChar(50)
	, LogText NVarChar(1000)
)

DECLARE @Num int
SELECT @Num = 0

WHILE @Num < @MaxLogFiles BEGIN
	TRUNCATE TABLE #TempLog 

	INSERT INTO #TempLog 
	exec xp_readerrorlog @Num, 1, null, null, @dStart, @dEnd

	IF @@ROWCOUNT > 0 BEGIN
		INSERT INTO #Log 
		SELECT *
		FROM #TempLog
	END ELSE BEGIN
		SELECT @Num = @MaxLogFiles
	END
		
	SELECT @Num = @Num + 1
END 

/*
--Uncomment to trend out a specific message and ignore the rest
DELETE #Log
WHERE LogText NOT LIKE 'Login failed for user ''WhatAreYouDoingToMe?!?!?''%'
--*/

--Ignore most of the log file rollover process
--Keep "Attempting to cycle" and "The error log has been reinitialized" if you want to confirm it happened and succeeded
DELETE #Log 
WHERE LogText LIKE '%(c) Microsoft Corporation%'
	OR LogText LIKE 'Logging SQL Server messages in file %'
	OR LogText LIKE 'Authentication mode is MIXED.'
	OR LogText LIKE 'System Manufacturer: %'
	OR LogText LIKE 'Server process ID %'
	OR LogText LIKE 'All rights reserved.'
	OR LogText LIKE 'Default collation: %'
	OR LogText LIKE 'The service account is %'
	OR LogText LIKE 'UTC adjustment: %'
	OR LogText LIKE '(c) 2005 Microsoft Corporation.'--Should I be ignoring this or fixing it?
	OR LogText LIKE 'Microsoft SQL Server % on Windows NT %'
	OR LogText LIKE 'The error log has been reinitialized. See the previous log for older entries.'
	OR LogText LIKE 'Attempting to cycle error log.%'

--Ignore databases being backed up and integrity checks running, assuming you verify this some other way.
--I don't want to complain to try to have these removed because I may need that info someday; today isn't that day.
DELETE #Log
WHERE LogText LIKE 'Log was backed up%'
	OR LogText LIKE 'Database differential changes were backed up%'
	OR LogText LIKE 'Backup database with differential successfully %'
	OR LogText LIKE 'Backup database successfully %'
	OR LogText LIKE 'Database backed up%'
	OR LogText LIKE 'DBCC CHECK% found 0 errors %'
	OR LogText LIKE 'CHECKDB for database % finished without errors %'

--We all have vendor databases...
--Ignore the stuff where it keeps making sure the setting is where the setting was.
DELETE #Log
WHERE LogText LIKE 'Configuration option % changed from 30 to 30.%'
	OR LogText LIKE 'Configuration option % changed from 5 to 5.%'
	OR LogText LIKE 'Setting database option COMPATIBILITY_LEVEL to 100 for database ReportServer%'
	OR LogText LIKE 'Configuration option ''user options'' changed from 0 to 0. Run the RECONFIGURE statement to install.'

--Now your own custom ones
--Just be careful.  You'll rarely read logs without this script once you see how easy it is.
--If you put it on the ignore list, you won't see it again.
--I have starting and stopping traces on mine, because my monitoring software likes to start and stop them a lot
----I'm accepting the risk that I won't see other people starting and stopping traces.
DELETE #Log
WHERE LogText LIKE 'Know what risk you''re taking on by putting stuff in here'
	OR LogText LIKE 'You will rarely read logs without this, so you won''t see these ever again'
	OR LogText LIKE 'DBCC TRACEON 3004,%'
	OR LogText LIKE 'DBCC TRACEON 3014,%'
	OR LogText LIKE 'DBCC TRACEON 3604,%'
	OR LogText LIKE 'DBCC TRACEOFF 3604,%'
	OR LogText LIKE 'DBCC TRACEON 3605,%'
	OR LogText LIKE 'Error: %, Severity:%'--They give the english version next
	OR LogText LIKE 'SQL Trace ID % was started by %'
	OR LogText LIKE 'SQL Trace stopped.%'
	OR LogText LIKE 'Changing the status to % for full-text catalog %'
	OR LogText LIKE 'I/O was resumed on database %'
	OR LogText LIKE 'I/O is frozen on database %' 

/*
--When mirroring gives me trouble it lets me know by flooding the logs
--I uncomment this to see if there were other issues in the middle of all that.
DELETE #Log 
WHERE LogText LIKE 'Database mirroring is inactive for database%'
	OR LogText LIKE 'The mirroring connection to%has timed out%'
	OR LogText LIKE 'Database mirroring is active with database%'
--*/

/*
--This is only useful if you're using the trace flag 1222
--Only show the line that says 'deadlock-list'.  Remove this if you need to see the deadlock details.
--Note, only use this when needed.  It will give you a 1 second blind spot for every deadlock found.
--Why aren't you using extended events anyways?
DELETE L
FROM #Log L
	INNER JOIN #Log L2 ON L.LogDate BETWEEN L2.LogDate AND DateAdd(second, 1, L2.LogDate) AND L.ProcessInfo = L2.ProcessInfo 
WHERE L2.LogText = 'deadlock-list'
	AND L.LogText <> 'deadlock-list'
--*/

SELECT * 
FROM #Log
ORDER BY LogDate DESC

Don’t Ignore Me

Anything you ignore you won’t see here again. It’s still in the logs, but not in what you’re reading on your screen when you mentally check the logs off as being read through.  If you’re ignoring anything make sure it either doesn’t matter or you’re watching for it another way.

Backups are the first thing to be ignored.  Yes, yes, they ran successfully, they do that a lot, don’t tell me about them.  That can be good advice gone horribly wrong.  Do you have another way of saying I absolutely know I have backups taken of everything?

DBCC CheckDB ran successfully is next on the list.  Same thing goes for it, except more DBAs miss verifying that this is running and also miss running it.  If you ignore it, how are you verifying that it ran?

I don’t care how you do it.  Do what works best for you, just do something.

Be Careful

I’ll just end by saying be careful again. This code is a life saver when it’s not shooting you in the foot.

Monitoring Database and Table Sizes

Trending database and table sizes helps give you an idea of what to expect, and, sometimes, points out problems and their root causes. I even go so far as to monitor the disk space, file growth rates, and have a report going out telling me that “according to your current growth trends, you’ll run out of space on this day”. That last part will have to be another post since getting disk space can be kinda tricky depending on how your disks are set up; you need to go outside of SQL Server to grab the size of mountpoints if you use them. Don’t worry about mountpoints right now, if you don’t know the term that usually means that you can swap out the word “drive” and have everything I say be accurate for your environment.

First, this is lightweight stuff, there’s really no reason to capture it more than once a day, and it will only add up to several MB for a year’s worth of data. You’re not going to kill yourself doing this, you’re not going to stare at the data every day taking up your time, but you’re going to look like a heavyweight when you put this into action and reference this data to others.

Second, I do have other similar posts doing snapshots of this same info for Table and File sizes. It’s basically the same thing just different in how you’re using it and if you’ll have a historical view of it.

Database sizes

Lets start with the databases because, well, they’re bigger. I capture it all, the database name, logical file name, file type (row, logs, etc), filegroup, allocated space, used space, max size, growth rate, if it’s percent growth, and the drive/mountpoint it’s on. Other than dividing the appropriate numbers by 1024 twice to store all my sizes in MB, everything gets stored in native format. The reason for that is you have raw data in a format easy for you to verify and easy for any outsider (consultant, new employee, etc) to understand everything you have. I may make some of you cringe because I don’t normalize this; it’s small data and not worth the effort, especially with page compression (see Brad McGehee’s post on compression) turned on.

What I like doing is having automated reports with logic behind it saying “if drive size is X, file growth is Y, then you can grow to a potential size of Z”, follow that up with “the used space in this database has grown X in the last 90 days and Y in the last 30 days, so according to those trends I’ll hit the potential size in Z days”. One of my favorite reports takes that last Z as a parameter and tells me anything that will run out of space before a specific number of days. I get that report daily for anything that will run out of space in the next two weeks and a separate instance of that report monthly telling me what will run out of space in the next 100 days. For me to run out of space without having a couple automated emails sitting in my inbox yelling at me saying “See, I told you that would happen!!!”, a database has to do something crazy and out of character. The only part of that last sentence that doesn’t really happen is the automated email yelling, those voices don’t exist outside of my head (do they?).

Even without monitoring disk space, you can predict autogrowths. This is useful for trying preventing autogrowth all together if you’re into that, but with instant file initialization this has never been a spot I’ve chosen to spend my time on. However, I do care about fragmentation, and I have a report that will tell me if we continue growing according to our 30 or 90 day trends then the data files will have to grow X times. If X is higher than 4 and filegrowth is below 2 GB chunks, it shows up on my report and I change the filegrowth to make it more reasonable. If you allow percent filegrowths in your environment (just say no), or even if one sneaks through, things get complex and you have to stroll through financial websites to find the calculations for compounding interest to figure out how much the files will grow.

Cast(CEILING(LOG((Cast((ProjectedGrowthInAMonthMB - FileGroupFreeMB) as Dec(20,4)) + Cast(FileSizeMB as Dec(20,4)))/(Cast(FileSizeMB as Dec(20,4)))) /LOG(1+(Cast(Growth as Dec(20,4))/100))) AS Int)

Something inside that database is growing, but what is it?

Now if a database is growing, the first thing someone will ask you is “why is it growing”. That’s an answer I can’t give you, but “where is it growing” I can do. This is why I’m also tracking the table sizes in a database. Be fair to yourself here and admit it up front that you don’t care if a table changes size radically if the largest size is still insignificant to that database. So make up rules on “I want to watch tables that are bigger than…” and only store the stats for those tables to keep your little collections from being the source of your size issues. For me, that means that a table has to be both over 100MB in total size including indexes and over 0.5% of the size of the database. I’ve used these sizes for a while now and have never had an issue on either side; no one needed info I didn’t have for space trending issues and the collection time and space I used never caused any complaints. This isn’t saying that someone won’t ask “did that table have 10 or 11 records this time last month”, but I’m not willing to collect enough to answer every little question.

If a database is growing at an alarming rate it’s easy to assume the largest table is causing it. That’s not always the case, and it’s not too uncommon for a database change to either create a new table that grows quick intentionally or accidentally cause an old table to stop cleaning itself up. No matter what the reason is, if you want to see how a database is growing, look at your table stats documenting the growth. Developers always listen to you better when you tell them this is what happened and here’s how I know while the words “I think” are viewed as instant discredidation.

How am I supposed to do this stuff?

You’ll need somewhere to store all the info. I like to store two weeks of info on the local server and pull everything to a central server nightly where data is held for a full 13 months. However, to keep me under control, lets just focus on getting everything to the local server in the following tables.

CREATE TABLE [Drives](
    [DateAdded] [smalldatetime] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
    [CapacityMB] [int] NULL,
    [FreeSpaceMB] [int] NULL,
 CONSTRAINT [DBInven_Drives_DriveLetter_MountPoint_DateAdded] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DriveLetter] ASC,
    --[MountPoint] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Files](
    [DateAdded] [smalldatetime] NOT NULL,
    [DatabaseName] [sysname] NOT NULL,
    [FileName] [nvarchar](128) NOT NULL,
    [TypeDesc] [nvarchar](60) NULL,
    [FileGroup] [sysname] NULL,
    [SizeMB] [int] NULL,
    [UsedMB] [int] NULL,
    [FreeMB] [int] NULL,
    [MaxSizeMB] [int] NULL,
    [Growth] [int] NULL,
    [IsPercentGrowth] [bit] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
 CONSTRAINT [DBInven_Files_DatabaseID_FileID_DateAdded_U_C_IX] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [FileName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Tables](
    [DateAdded] [datetime] NOT NULL,
    [DatabaseName] [nvarchar](128) NOT NULL,
    [SchemaName] [nvarchar](128) NOT NULL,
    [TableName] [nvarchar](128) NOT NULL,
    [RowCounts] [bigint] NULL,
    [AllocatedMB] [int] NULL,
    [DataSizeMB] [int] NULL,
    [IndexSizeMB] [int] NULL,
    [PercentOfDB] [decimal](5, 2) NULL,
 CONSTRAINT [PK_Tables] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [SchemaName] ASC,
    [TableName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

If you use mountpoints, and if you call them “mountpoint” followed by an incrementing digit, and you never have more than can fit into a tinyint, uncommenting those pieces of the tables will do wonders for you. In case you didn’t guess, they’re commented out because the people who do use them probably don’t use them that way.

DECLARE @RunTime SmallDateTime
DECLARE @SQL NVarChar(max)

SET @RunTime = (SELECT Max(DateAdded) FROM Perf.DBInven.Drives)

SET @SQL = N'use [?]; 
    INSERT INTO Perf..Files 
    select ''' + cast(@RunTime as nvarchar(50)) + ''' 
        , DatabaseName = db_name()
        , FileName = f.Name
        , TypeDesc = f.Type_Desc
        , FileGroup = fg.Name
        , SizeMB = ((f.size*8/1024)) 
        , UsedMB = (FileProperty(f.name, ''SpaceUsed'')*8/1024) 
        , FreeMB = ((f.size-FileProperty(f.name, ''SpaceUsed''))*8/1024) 
        , MaxSizeMB = Case f.Max_Size
            WHEN -1 THEN -1
            ELSE f.Max_Size / 128
            END
        , f.Growth
        , f.Is_Percent_Growth
        , DriveLetter = left(f.physical_name,1) 
        /*, MountPoint = CASE substring(f.physical_name,4,10)
            WHEN ''MountPoint'' THEN 
                Case isnumeric(substring(f.physical_name,14,4)) --MountPoint >= 1000
                WHEN 1 THEN substring(f.physical_name,14,4)
                ELSE Case isnumeric(substring(f.physical_name,14,3)) --MountPoint >= 100
                    WHEN 1 THEN substring(f.physical_name,14,3)
                    ELSE Case isnumeric(substring(f.physical_name,14,2)) --MountPoint >= 10
                        WHEN 1 THEN substring(f.physical_name,14,2)
                        ELSE substring(f.physical_name,14,1) -- MountPoint single digit
                        END
                    END
                END
            ELSE Cast(0 as varchar(10)) --Root drive is MountPoint 0 
            END*/
    from sys.database_files f
        LEFT JOIN sys.filegroups fg ON f.data_space_id = fg.data_space_id
    '

exec sp_MSforeachdb @SQL

And now to grab the tables. Keep in mind that I specifically only want to grab stuff over 0.5% of the database size AND over 100 MB. You’ll see that towards the end of the code.

DECLARE @SQL NVarChar(max)
DECLARE @DateAdded SmallDateTime

SELECT @DateAdded = MAX(DateAdded) FROM Perf.DBInven.Drives 

SET @SQL = ' Use [?];
    DECLARE @DatabaseSize Float

    IF OBJECT_ID(''tempdb..#TableSizes'') IS NOT NULL BEGIN
          DROP TABLE #TableSizes
    END

    SELECT DatabaseName = DB_NAME()
    , TableName = a2.name
    , SchemaName = a3.name
    , RowCounts = a1.rows
    , TableSizeMB = (a1.reserved + ISNULL(a4.reserved,0)) / 128
    , DataSizeMB = a1.data / 128
    , IndexSizeMB = (CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data 
                        THEN (a1.used + ISNULL(a4.used,0)) - a1.data 
                        ELSE 0 
                    END) /128
INTO #TableSizes
FROM (SELECT ps.object_id
            , [rows] = SUM(CASE
                                WHEN (ps.index_id < 2) THEN row_count
                                ELSE 0
                            END)
            , reserved = SUM(ps.reserved_page_count)
            , data = SUM(CASE
                            WHEN (ps.index_id < 2) 
                                THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count)
                            ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count)
                        END)
            , used = SUM (ps.used_page_count) 
        FROM sys.dm_db_partition_stats ps
        GROUP BY ps.object_id) AS a1
    INNER JOIN sys.all_objects a2  ON a1.object_id = a2.object_id
    INNER JOIN sys.schemas a3 ON a2.schema_id = a3.schema_id
    LEFT JOIN (SELECT it.parent_id
            , reserved = SUM(ps.reserved_page_count)
            , used = SUM(ps.used_page_count)
        FROM sys.dm_db_partition_stats ps
            INNER JOIN sys.internal_tables it ON it.object_id = ps.object_id
        WHERE it.internal_type IN (202,204)
        GROUP BY it.parent_id) AS a4 ON a4.parent_id = a1.object_id
WHERE a2.type <> ''S'' and a2.type <> ''IT''

    SELECT @DatabaseSize = Sum(TableSizeMB) FROM #TableSizes

    INSERT INTO Perf..[Tables]
    SELECT ''' + Cast(@DateAdded as VarChar(50)) + '''
        , DatabaseName
        , SchemaName
        , TableName
        , RowCounts
        , TableSizeMB
        , DataSizeMB
        , IndexSizeMB
        , (TableSizeMB/@DatabaseSize) * 100 
    FROM #TableSizes
    WHERE TableSizeMB > .005 * @DatabaseSize 
        AND TableSizeMB > 100

    DROP TABLE #TableSizes'

exec sp_MSforeachdb @SQL

So this leaves us with the size of the drives. This isn’t an easy subject. You can get everything you need through SQL Server if you don’t use mountpoints, but you’re left with either Powershell or VBScript to capture it with mountpoints so you can query WMI. Another post, another day. I’d actually prefer to rewrite my VBScript as PowerShell to both learn PowerShell and put a better solution out here for you.

This was my last post I needed to prep for my Monitoring and Baselines Presentation for SQL Saturday where I talk about what you want to watch and why. Since the presentation didn’t get into how to watch everything, I made sure that everything was on my blog ahead of time.

Query Stats

DBAs rarely use the full potential of sys.dm_exec_query_stats.  It’s common to see the queries for looking at the most expensive queries according to any of the stats contained within the current cache, which is great to see.  However, if you grab snapshots of this information the proper way then you can get more detail that will really make a difference in your job.  Capturing this every 4 hours is probably enough, and here’s what it’s going to do for you.

First, using it the way most DBAs do and look at the most expensive queries according to a specific metric.  It works the same for them all, so we’ll just say we’re interested in physical reads in this case because our PLE isn’t what we’re hoping it could be.  You’ve probably seen the queries that directly query sys.dm_exec_query_stats to give you this similar to the one on MSDN for this DMV.  That will give you everything that’s in the cache right now, which could contain queries that recompile every hour in combination with other queries that have remained in the cache since the server was rebooted two weeks ago.  This means that when you’re querying to see what you can do to improve your PLE during business hours you’re getting anything that happens to be there now, not what ran in the time window you’re looking to improve.

Now look at it from the point of view that you capture this information every 4 hours.  You can get the total number of physical reads grouped by sql_handle where the interval_end time is in the last month, on weekdays, between 9:00 AM and 5:00 PM.  This is both a more complete and more accurate picture of what you care about, and you’re more likely to be focusing on the queries you should be tuning from the start.

Once you do start tuning, you want to know what effect you really had, and not just rely on the couple sets of test parameters you used in your NonProd environments.  A couple days after your changes made it to production, it’s a good idea to make sure you had the desired effect.  Query the data you have saved off for the last three days and for the same days last week.

What I went over so far is just realizing how this takes the traditional use of this to a new level, but there’s more functionality here now that it’s being saved off.  Now this is extremely useful in incident response.  Before you could see if anything was hogging resources as seen in my Recently Recompiled Resource Hogs post, but now you take it a step further by being able to verify if what’s running long now has always run that way or if something is different.  Also, with having the plan_handles saved off you can even see if it’s getting a different execution plan, which will help you determine if this was a change in data or if it’s a bad execution plan.

Not to shoot myself in the foot, but this isn’t perfect.  If you capture this data every 4 hours starting at 12:00 and it recompiles at 3:59 then you’ll only capture one minute for that interval.  Also, if you’re dealing with a query that can’t be cached then it won’t be found here at all; this is why you’re also running traces to find the big queries and watching sys.dm_os_performance_counters to know when there are a lot of compiles and recompiles.

To start capturing this, here are the tables you’ll need to capture everything.

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

/*
IF OBJECT_ID('QueryStats') IS NOT NULL BEGIN
    DROP TABLE QueryStats
END

IF OBJECT_ID('QueryStatsLast') IS NOT NULL BEGIN
    DROP TABLE QueryStatsLast
END

IF OBJECT_ID('QueryText') IS NOT NULL BEGIN
    DROP TABLE QueryText
END
--*/

CREATE TABLE [dbo].[QueryStats](
    [interval_start] [smalldatetime] NOT NULL,
    [interval_end] [smalldatetime] NOT NULL,
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [execution_count] [int] NOT NULL,
    [total_elapsed_time_ms] [int] NOT NULL,
    [min_elapsed_time_ms] [int] NOT NULL,
    [max_elapsed_time_ms] [int] NOT NULL,
    [total_worker_time_ms] [int] NOT NULL,
    [min_worker_time_ms] [int] NOT NULL,
    [max_worker_time_ms] [int] NOT NULL,
    [total_logical_reads] [int] NOT NULL,
    [min_logical_reads] [int] NOT NULL,
    [max_logical_reads] [int] NOT NULL,
    [total_physical_reads] [int] NOT NULL,
    [min_physical_reads] [int] NOT NULL,
    [max_physical_reads] [int] NOT NULL,
    [total_logical_writes] [int] NOT NULL,
    [min_logical_writes] [int] NOT NULL,
    [max_logical_writes] [int] NOT NULL,
    [creation_time] [smalldatetime] NOT NULL,
    [last_execution_time] [smalldatetime] NOT NULL,
) ON [PRIMARY]
GO

--Wide is bad, if you have a lot of nonclustered indexes.  
--This is a high-writes, low-reads table, so a single clustered index is probably all I want.
--With Interval_End being the first key field, they're inserted in order and eliminate mass fragmentation
  -- as well as bad write performance.
CREATE UNIQUE CLUSTERED INDEX IX_QueryStats_intervalend_sqlhandle_statementstartoffset_planhandle_U_C ON QueryStats
(
    interval_end
    , sql_handle
    , statement_start_offset
    , plan_handle
)

CREATE TABLE [dbo].[QueryStatsLast](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryStatsLast_sqlhandle_planhandle_statementstartoffset_U_C] ON [dbo].[QueryStatsLast] 
(
    [sql_handle] ASC,
    [plan_handle] ASC,
    [statement_start_offset] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

CREATE TABLE [dbo].[QueryText](
    [sql_handle] [varbinary](64) NOT NULL,
    [QueryText] [nvarchar](max) NOT NULL,
    [DatabaseName] [nvarchar](128) NULL,
    [objtype] [nvarchar](20) NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryText_sqlhandle_U_C] ON [dbo].[QueryText] 
(
    [sql_handle] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

Now that the tables are created, here’s how I populate them.  Set this up to run once every 4 hours to keep the data small during typical running, and kick it up to once an hour or more if needed to pinpoint an issue if you need it.

DECLARE @interval_start smalldatetime
    , @interval_end smalldatetime

IF OBJECT_ID('tempdb..#QS') IS NOT NULL BEGIN
    DROP TABLE #QS 
END 

CREATE TABLE [dbo].[#QS](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) 
/*
--This sounded like a great idea, but it just slowed it down several seconds.
CREATE UNIQUE CLUSTERED INDEX TempQS_Cluster ON #QS  
(
    sql_handle
    , plan_handle
    , statement_start_offset
)
*/

--The Cached Plans Object Type is in here in case you want to treat ad-hoc or prepared statements differently
INSERT INTO #QS 
SELECT qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , cp.objtype
    , qs.execution_count 
    , total_elapsed_time_ms = qs.total_elapsed_time/1000
    , min_elapsed_time_ms = qs.min_elapsed_time/1000
    , max_elapsed_time_ms = qs.max_elapsed_time/1000 
    , total_worker_time_ms = qs.total_worker_time/1000 
    , min_worker_time_ms = qs.min_worker_time/1000
    , max_worker_time_ms = qs.max_worker_time/1000
    , qs.total_logical_reads
    , qs.min_logical_reads
    , qs.max_logical_reads
    , qs.total_physical_reads
    , qs.min_physical_reads 
    , qs.max_physical_reads
    , qs.total_logical_writes 
    , qs.min_logical_writes 
    , qs.max_logical_writes
    , qs.creation_time
    , qs.last_execution_time
    , DateAdded = getDate()
FROM sys.dm_exec_query_stats AS qs
    INNER JOIN sys.dm_exec_cached_plans cp on qs.plan_handle = cp.plan_handle
--WHERE cp.objtype NOT IN ('Adhoc')

INSERT INTO QueryText (sql_handle, QueryText, DatabaseName, objtype)
SELECT QS.sql_handle
    , QueryText = qt.text
    , DatabaseName = DB_NAME(max(qt.dbid))
    , max(QS.objtype)
FROM (SELECT #QS.sql_handle
        , #QS.objtype  
        FROM #QS
            LEFT JOIN QueryText QST ON #QS.sql_handle = QST.sql_handle
        WHERE QST.sql_handle IS NULL) QS
    CROSS APPLY sys.dm_exec_sql_text(QS.sql_handle) qt
GROUP BY QS.sql_handle
    , qt.text

--All the values are the same in each table
SELECT TOP 1 @interval_start = dateadded FROM QueryStatsLast
SELECT TOP 1 @interval_end = dateadded FROM #QS

IF @interval_start IS NULL BEGIN
    SELECT @interval_start = create_date
    FROM sys.databases
    WHERE name = 'tempdb'
END

INSERT INTO QueryStats (interval_start
    , interval_end
    , sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time)
SELECT @interval_start
    , @interval_end
    , qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , qs.execution_count - ISNULL(qsl.execution_count, 0)
    , qs.total_elapsed_time_ms - ISNULL(qsl.total_elapsed_time_ms, 0)
    , qs.min_elapsed_time_ms 
    , qs.max_elapsed_time_ms 
    , qs.total_worker_time_ms - ISNULL(qsl.total_worker_time_ms, 0)
    , qs.min_worker_time_ms
    , qs.max_worker_time_ms 
    , qs.total_logical_reads - ISNULL(qsl.total_logical_reads, 0)
    , qs.min_logical_reads 
    , qs.max_logical_reads 
    , qs.total_physical_reads - ISNULL(qsl.total_physical_reads, 0)
    , qs.min_physical_reads 
    , qs.max_physical_reads 
    , qs.total_logical_writes - ISNULL(qsl.total_logical_writes, 0)
    , qs.min_logical_writes 
    , qs.max_logical_writes 
    , qs.creation_time
    , qs.last_execution_time
FROM #QS qs
    LEFT OUTER JOIN QueryStatsLast qsl ON qs.sql_handle = qsl.sql_handle
        AND qs.plan_handle = qsl.plan_handle
        AND qs.statement_start_offset = qsl.statement_start_offset
        AND qs.creation_time = qsl.creation_time 
WHERE qs.execution_count - ISNULL(qsl.execution_count, 0) > 0 --Only keep records for when it was executed in that interval

TRUNCATE TABLE QueryStatsLast

INSERT INTO QueryStatsLast
SELECT sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , objtype
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time
    , DateAdded
FROM #QS

Sys.dm_os_performance_counters Demystified

The dmv sys.dm_os_performance_counters is awesome, if you can understand it. This is how I make it easy to read and use. Here are the values I watch and why I watch them. My list isn’t going to be perfect and you’re not going to agree with it 100%, and I’m ok with that. First, there is no perfect. Second, if we agree 100% then one of us is just mindlessly following the other which isn’t good.

  • Cache Hit Ratio – I ignore this value, but I still monitor it. I will, ideally, never be the only DBA on a team again, and everyone seems to think this value is cool.
  • Page Life Exp – My favorite! When you read a page from disk into memory how many seconds will it stay there? Just don’t use the outdated “300” rule or your disks will catch on fire!!!

    Burning Drive!!!

    PLE = 300

  • Page Lookups/Sec – How many pages are read from memory.
  • Page Reads/Sec – How many pages are read from disk.
  • Page Writes/Sec – How many pages are written to disk.
  • Lazy Writes/sec – How many pages are written to disk outside of a checkpoint due to memory pressure.
  • Batch Requests/sec – How busy is the server?
  • Trans/sec – How busy is the server?
  • Total Server Memory – How much memory SQL Server is currently using. Typically ramps up to Target value and PLE is low as it ramps up since new pages are in memory dropping the average.
  • Target Server Memory – How much memory SQL Server is allowed to use. Should be the same as the max memory setting, but memory pressure can cause this to decrease.
  • Memory Grants Pending – How many processes aren’t able to get enough memory to run. Should be 0, always 0, if not then find out why.
  • Deadlocks – How many deadlocks are we getting. Most apps handle deadlocks gracefully, but they still lose time doing it. If this number starts going up, start looking into it.
  • SQL Compilations/sec – This is a hidden performance killer! Some queries can’t be cached so they’re compiled every time they’re run. I’ve seen this with a query being run once a second and a big server was running slower than my laptop. It’s normal for things to compile throughout the day, it’s not normal for this number to be 10x higher than before that last upgrade.
  • SQL Re-Compliations/sec – Same goes here. The counters aren’t that much different.

If you know a little about this DMV then you know these values are cryptic. There’s several ways this data is stored and it has to be retrieved differently for each type to be useful. Then many of these are cumulative since the server was restarted, which isn’t going to help too much. Even worse, MSDN failed us on this one and figuring out this DMV required help outside of that site. Now for the good news, the script below will take care of all of that for you and leave you with some easy reading with values you can filter to the time periods you care about. If you want to add some of your own counters then just follow my lead on one that has the same cntr_type, or you can go to Rabin’s blog post that I learned from.

IF object_id('tempdb..#OSPC') IS NOT NULL BEGIN
    DROP TABLE #OSPC
END

DECLARE @FirstCollectionTime DateTime
    , @SecondCollectionTime DateTime
    , @NumberOfSeconds Int
    , @BatchRequests Float
    , @LazyWrites Float
    , @Deadlocks BigInt
    , @PageLookups Float
    , @PageReads Float
    , @PageWrites Float
    , @SQLCompilations Float
    , @SQLRecompilations Float
    , @Transactions Float

DECLARE @CounterPrefix NVARCHAR(30)
SET @CounterPrefix = CASE WHEN @@SERVICENAME = 'MSSQLSERVER'
                            THEN 'SQLServer:'
                        ELSE 'MSSQL$' + @@SERVICENAME + ':'
                        END

--Grab the current values from dm_os_performance_counters
--Doesn't do anything by instance or database because this is good enough and works unaltered in all envirornments
SELECT counter_name, cntr_value--, cntr_type --I considered dynamically doing each counter type, but decided manual was better in this case
INTO #OSPC 
FROM sys.dm_os_performance_counters 
WHERE object_name like @CounterPrefix + '%'
    AND instance_name IN ('', '_Total')
    AND counter_name IN ( N'Batch Requests/sec'
                        , N'Buffer cache hit ratio'
                        , N'Buffer cache hit ratio base'
                        , N'Free Pages'
                        , N'Lazy Writes/sec'
                        , N'Memory Grants Pending'
                        , N'Number of Deadlocks/sec'
                        , N'Page life expectancy'
                        , N'Page Lookups/Sec'
                        , N'Page Reads/Sec'
                        , N'Page Writes/Sec'
                        , N'SQL Compilations/sec'
                        , N'SQL Re-Compilations/sec'
                        , N'Target Server Memory (KB)'
                        , N'Total Server Memory (KB)'
                        , N'Transactions/sec')

--Just collected the second batch in the query above
SELECT @SecondCollectionTime = GetDate()

--Grab the most recent values, if they are appropriate (no reboot since grabbing them last)
SELECT @FirstCollectionTime = DateAdded
    , @BatchRequests = BatchRequests
    , @LazyWrites = LazyWrites
    , @Deadlocks = Deadlocks
    , @PageLookups = PageLookups
    , @PageReads = PageReads
    , @PageWrites = PageWrites
    , @SQLCompilations = SQLCompilations
    , @SQLRecompilations = SQLRecompilations
    , @Transactions = Transactions
FROM OSPerfCountersLast 
WHERE DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'TempDB')

--If there was a reboot then all these values would have been 0 at the time the server came online (AKA: TempDB's create date)
SELECT @FirstCollectionTime = ISNULL(@FirstCollectionTime, (SELECT create_date FROM sys.databases WHERE name = 'TempDB'))
    , @BatchRequests = ISNULL(@BatchRequests, 0)
    , @LazyWrites = ISNULL(@LazyWrites, 0)
    , @Deadlocks = ISNULL(@Deadlocks, 0)
    , @PageLookups = ISNULL(@PageLookups, 0)
    , @PageReads = ISNULL(@PageReads, 0)
    , @PageWrites = ISNULL(@PageWrites, 0)
    , @SQLCompilations = ISNULL(@SQLCompilations, 0)
    , @SQLRecompilations = ISNULL(@SQLRecompilations, 0)
    , @Transactions = ISNULL(@Transactions, 0)

SELECT @NumberOfSeconds = DATEDIFF(ss, @FirstCollectionTime, @SecondCollectionTime)

--I put these in alphabetical order by counter_name, not column name.  It looks a bit odd, but makes sense to me
--Deadlocks are odd here.  I keep track of the number of deadlocks in the time period, not average number of deadlocks per second.
  --AKA, I keep track of things the way I would refer to them when I talk to someone.  "We had 2 deadlocks in the last 5 minutes", not "We averaged .00002 deadlocks per second there"
INSERT INTO OSPerfCounters (DateAdded, Batch_Requests_Sec, Cache_Hit_Ratio, Free_Pages, Lazy_Writes_Sec, Memory_Grants_Pending
    , Deadlocks, Page_Life_Exp, Page_Lookups_Sec, Page_Reads_Sec, Page_Writes_Sec, SQL_Compilations_Sec, SQL_Recompilations_Sec
    , ServerMemoryTarget_KB, ServerMemoryTotal_KB, Transactions_Sec)
SELECT @SecondCollectionTime
    , Batch_Request_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec') - @BatchRequests) / @NumberOfSeconds
    , Cache_Hit_Ratio        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio')/(SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio base')
    , Free_Pages            = (SELECT cntr_value FROM #OSPC WHERE counter_name =N'Free pages')
    , Lazy_Writes_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec') - @LazyWrites) / @NumberOfSeconds
    , Memory_Grants_Pending    = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Memory Grants Pending')
    , Deadlocks                = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec') - @Deadlocks) 
    , Page_Life_Exp         = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page life expectancy')
    , Page_Lookups_Sec      = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec') - @PageLookups) / @NumberOfSeconds
    , Page_Reads_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec') - @PageReads) / @NumberOfSeconds
    , Page_Writes_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec') - @PageWrites) / @NumberOfSeconds
    , SQL_Compilations_Sec  = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec') - @SQLCompilations) / @NumberOfSeconds
    , SQL_Recompilations_Sec= ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec') - @SQLRecompilations) / @NumberOfSeconds
    , ServerMemoryTarget_KB = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Target Server Memory (KB)')
    , ServerMemoryTotal_KB  = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Total Server Memory (KB)')
    , Transactions_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec') - @Transactions) / @NumberOfSeconds

TRUNCATE TABLE OSPerfCountersLast

--Note, only saving the last value for ones that are done per second.
INSERT INTO OSPerfCountersLast(DateAdded, BatchRequests, LazyWrites, Deadlocks, PageLookups, PageReads
    , PageWrites, SQLCompilations, SQLRecompilations, Transactions)
SELECT DateAdded            = @SecondCollectionTime
    , BatchRequests            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec')
    , LazyWrites            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec')
    , Deadlocks             = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec')
    , PageLookups            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec')
    , PageReads                = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec')
    , PageWrites            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec')
    , SQLCompilations        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec')
    , SQLRecompilations        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec')
    , Transactions             = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec')

DROP TABLE #OSPC

Throw that code above here in a proc, schedule it to run every so often (I like 5 minutes) and it’ll….fail. It kinda relies on a couple tables you should create first. Here ya go.

CREATE TABLE OSPerfCounters(
    DateAdded datetime NOT NULL
    , Batch_Requests_Sec int NOT NULL
    , Cache_Hit_Ratio float NOT NULL
    , Free_Pages int NOT NULL
    , Lazy_Writes_Sec int NOT NULL
    , Memory_Grants_Pending int NOT NULL
    , Deadlocks int NOT NULL
    , Page_Life_Exp int NOT NULL
    , Page_Lookups_Sec int NOT NULL
    , Page_Reads_Sec int NOT NULL
    , Page_Writes_Sec int NOT NULL
    , SQL_Compilations_Sec int NOT NULL
    , SQL_Recompilations_Sec int NOT NULL
    , ServerMemoryTarget_KB int NOT NULL
    , ServerMemoryTotal_KB int NOT NULL
    , Transactions_Sec int NOT NULL
)

--You'll typically only query this by one value, which is added sequentually.  No page splits!!!
CREATE UNIQUE CLUSTERED INDEX IX_OSPerfCounters_DateAdded_U_C ON OSPerfCounters
(
    DateAdded
) WITH (FillFactor = 100)

--Only holds one value at a time, indexes are a waste
CREATE TABLE OSPerfCountersLast(
    DateAdded datetime NOT NULL
    , BatchRequests bigint NOT NULL
    , LazyWrites bigint NOT NULL
    , Deadlocks bigint NOT NULL
    , PageLookups bigint NOT NULL
    , PageReads bigint NOT NULL
    , PageWrites bigint NOT NULL
    , SQLCompilations bigint NOT NULL
    , SQLRecompilations bigint NOT NULL
    , Transactions bigint NOT NULL
)

The important part of all this is how you use it. It’s tempting to just look at the last 7 records and say that you know what’s going on, that makes me want to slap you. Every server is different, every server has different loads and baselines, and you’re either underworked or you don’t know what those baselines are for every server you manage. I do simple baselines every time I look at an incident and look at the last hour, the same time yesterday, and the same time a week ago. That gives you a chance to see what’s normal for this server and what’s different right now. This query is so simple you’ll wonder why I even posted it, but it’s effective which is why it’s here. The 7 records per day thing, that’s because 21 records show up on my screen without me scrolling, it is NOT a magic number!

SELECT 'Today', * FROM (
SELECT TOP 7 *
FROM OSPerfCounters
ORDER BY dateadded DESC
) X

UNION
SELECT 'Yesterday', * FROM (SELECT TOP 7 *
FROM OSPerfCounters
WHERE dateadded <= GETDATE()-1
ORDER BY dateadded DESC
) Y

UNION
SELECT 'Last Week', * FROM (
SELECT TOP 7 *
FROM OSPerfCounters
WHERE dateadded <= GETDATE()-7
ORDER BY dateadded DESC) Z

ORDER BY dateadded DESC

And, well, something I’ve been skipping on my posts and telling people to handle cleanup on their own…. Here’s step 2 of my jobs that populate my monitoring tables to keep your data from being the ever-growing data you’re struggling with in every other app. I delete in batches according to the clustered index. It’s overkill for something deleting one row at a time, or, even if you put this in a separate daily job, 288 rows if the process is scheduled every 5 minutes. So, why the batches? Because I copy/paste my own code everywhere, batches is reusable, and this is how I chop off the tail end of EVERYTHING!

SELECT 'Start' --Give me a rowcount of 1

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (100000)
    FROM OSPerfCounters
    where dateadded < (GetDate() - 400)
END 

In the beginning I mentioned that if you agreed with me 100% then one of us is a mindless monkey. Look, I put this out there first, so I’m obviously not the mindless monkey here, am I? There’s a box below that gives you a chance to show that you’re not a mindless monkey either! Tell me I’m wrong, how I can do better, and how everyone else reading this can benefit from it even more! I’ll promote you from mindless monkey to talking monkey!

Deadlock – A quick, easy view

Deadlocks are rough to work with. Here are the scripts I use to capture deadlocks, find which ones are reoccurring, and view them along with a couple free eBooks to resolve them.

You could have thousands of deadlocks and it would take you forever to find out which ones are reoccurring or which ones are some freak accident with an annual process. I used to read through deadlock graphs one-by-one to see what was reoccurring, and I used to be less satisfied with my job as well. If you’re going to bang your head on your desk making sure a deadlock never happens again, you’ll want to make sure it wasn’t a one-time event that wouldn’t have happened again anyways.

I’m not going to get into how to figure out deadlocks, that’s a chapter in a book more than it’s a blog post. Chapter 7 to be exact in the FREE eBook or $25 physical book for Accidental DBAs by Jonathan Kehayias and Ted Krueger. Also, a whole book on blocking by Kalen Delaney in a FREE eBook or $22 physical book. I have to admit that Ted Krueger isn’t a name I recognize. However, Jonathan and Kalen are easily on my list of top 5 best MVPs out there.

Anyways, back to me. Here’s what I use to look at deadlock traces. The base code was found on the internet years ago, and you can find it in several places now. Because of that I have no idea who the original author is, but I’d be glad to throw up credit here for anyone who can show me a site dated older than when I started using it.

There are two versions here, almost identical code in each. The first one summarizes the deadlocks and gives you a count of the occurrences. The second one shows each SPID involved in the deadlock separately along with the XML for the deadlock.

DECLARE @Path VarChar(500);
SET @Path = (SELECT TOP 1 [path] FROM sys.traces WHERE [path] LIKE '%deadlock%')
--SET @Path = 'D:\Perflogs\PerfTraces\DeadlockTrace.trc'

;with CTE as
(
select 
 [TraceID] = 3, 
 [RowID] = row_number() OVER (ORDER BY StartTime), -- assign a row number to each deadlock
 [DeadlockGraph]=case when TextData like '<deadlock-list%' then convert(xml, TextData) else null end, 
 * 
 from ::fn_trace_gettable(@path, default)
where TextData like '<deadlock-list%'
    --AND StartTime BETWEEN '2013-06-17 00:00' AND '2013-06-30 00:00'
)
SELECT [Procedure], Inputbuffer, dMonth = DatePart(Month, DeadLockTime), dDay = DatePart(Day, DeadLockTime), DeadlockCount = Count(*), dMax = max(DeadlockTime), dMin = min(DeadlockTime)
FROM
(
select 
 [TimeoutID] = CTE.RowID,
 [DeadlockTime] = [StartTime],
 [DeadlockGraph],
 [PagelockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/pagelock[1]/@objectname', 'varchar(200)'),
 [DeadlockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/objectlock[1]/@objectname', 'varchar(200)'),
 [Victim] = case when Deadlock.Process.value('@id', 'varchar(50)') = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/@victim', 'varchar(50)') then 1 else 0 end,
 [Procedure] = Deadlock.Process.value('executionStack[1]/frame[1]/@procname[1]', 'varchar(200)'),
 [LockMode] = Deadlock.Process.value('@lockMode', 'char(1)'),
 [Code] = Deadlock.Process.value('executionStack[1]/frame[1]', 'varchar(1000)'),
 [ClientApp] = Deadlock.Process.value('@clientapp', 'varchar(100)'),
 [HostName] = Deadlock.Process.value('@hostname', 'varchar(20)'),
 [HostPID] = Deadlock.Process.value('@hostpid', 'varchar(20)'),
 [LoginName] = Deadlock.Process.value('@loginname', 'varchar(20)'),
 [TransactionTime] = Deadlock.Process.value('@lasttranstarted', 'datetime'),
 [InputBuffer] = Deadlock.Process.value('inputbuf[1]', 'varchar(1000)')
 from CTE
 CROSS APPLY CTE.[DeadlockGraph].nodes('/deadlock-list/deadlock/process-list/process') as Deadlock(Process)
) X
--WHERE Victim = 0 
--WHERE TimeoutID IN (SELECT RowID FROM CTE WHERE [PROCEDURE] = 'master.dbo.FakeProcName')
--WHERE inputbuffer like '%select top 10%'
GROUP BY [procedure], Inputbuffer, DatePart(Month, DeadLockTime), DatePart(Day, DeadLockTime)
--ORDER BY TimeoutID 

Then there’s this to actually let me look at the XML. Although you can easily see in the script below which deadlocks reoccur back-to-back, it’s not so easy to see reoccurrence of deadlocks that happen once every morning between 2 and 3 AM. That’s where the first script shines, and you should use and abuse it.

DECLARE @Path VarChar(500);
SET @Path = (SELECT TOP 1 [path] FROM sys.traces WHERE [path] LIKE '%deadlock%');

with CTE as
(
select 
 [TraceID] = 3, 
 [RowID] = row_number() OVER (ORDER BY StartTime), -- assign a row number to each deadlock
 [DeadlockGraph]=case when TextData like '<deadlock-list%' then convert(xml, TextData) else null end, 
 * 
 from ::fn_trace_gettable(@Path, default)
where TextData like '<deadlock-list%'
    --AND StartTime BETWEEN '2013-03-08 02:00' AND '2013-03-08 04:00'
    AND StartTime > DATEADD(Hour, -2, GetDate())
)
SELECT *
FROM
(
select 
 [TimeoutID] = CTE.RowID,
 [DeadlockTime] = [StartTime],
 [DeadlockGraph],
 [PagelockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/pagelock[1]/@objectname', 'varchar(200)'),
 [DeadlockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/objectlock[1]/@objectname', 'varchar(200)'),
 [Victim] = case when Deadlock.Process.value('@id', 'varchar(50)') = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/@victim', 'varchar(50)') then 1 else 0 end,
 [Procedure] = Deadlock.Process.value('executionStack[1]/frame[1]/@procname[1]', 'varchar(200)'),
 [LockMode] = Deadlock.Process.value('@lockMode', 'char(1)'),
 [Code] = Deadlock.Process.value('executionStack[1]/frame[1]', 'varchar(1000)'),
 [ClientApp] = CASE LEFT(Deadlock.Process.value('@clientapp', 'varchar(100)'), 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(Deadlock.Process.value('@clientapp', 'varchar(100)'),32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(Deadlock.Process.value('@clientapp', 'varchar(100)'), 67, len(Deadlock.Process.value('@clientapp', 'varchar(100)'))-67)
                    ELSE Deadlock.Process.value('@clientapp', 'varchar(100)')
                    END ,
 [HostName] = Deadlock.Process.value('@hostname', 'varchar(20)'),
 [LoginName] = Deadlock.Process.value('@loginname', 'varchar(20)'),
 [TransactionTime] = Deadlock.Process.value('@lasttranstarted', 'datetime'),
 [InputBuffer] = Deadlock.Process.value('inputbuf[1]', 'varchar(1000)')
 from CTE
 CROSS APPLY CTE.[DeadlockGraph].nodes('/deadlock-list/deadlock/process-list/process') as Deadlock(Process)
-- ORDER BY [DeadlockObject], [ClientApp], [Procedure], [Code]
) X
--WHERE Victim = 0 
--WHERE [Procedure] = 'master.dbo.FakeProcedureName'
ORDER BY TimeoutID DESC

Finally, to make all of this happen you’ll need a deadlock trace. Don’t go thinking that I open profiler, make a deadlock trace, script it out, and run it every time I need one. Reusable code is awesome, and this is reused every time I see an alert come through that we are having too many deadlocks. I have this proc in our Perf database, which is on every SQL Server we manage. So you don’t have to dig through the numbers, it grabs every event for the deadlock graph and nothing else.

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[usp_ServerSideTrace_Deadlock]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_ServerSideTrace_Deadlock]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE procedure [dbo].[usp_ServerSideTrace_Deadlock]
    @FilePath nvarchar(1000) = N'D:\Perflogs\PerfTraces\DeadlockTrace',                                            
    @maxfilesize bigint = 25,
    @maxfiles int = 10,
    @TraceDBID bit=0,
    @DBID int=0
AS

declare @rc int
declare @TraceID int

--Create Trace
exec @rc = sp_trace_create @TraceID output, 2, @FilePath, @maxfilesize, NULL, @maxfiles
if (@rc != 0) 
Begin
    select ErrorCode=@rc    
    return
End

-- Set the RPC Completed and SQL Batch Completed events
declare @on bit
set @on = 1
exec sp_trace_setevent @TraceID, 148, 11, @on
exec sp_trace_setevent @TraceID, 148, 12, @on
exec sp_trace_setevent @TraceID, 148, 14, @on
exec sp_trace_setevent @TraceID, 148, 1, @on
exec sp_trace_setevent @TraceID, 25, 15, @on
exec sp_trace_setevent @TraceID, 25, 8, @on
exec sp_trace_setevent @TraceID, 25, 32, @on
exec sp_trace_setevent @TraceID, 25, 56, @on
exec sp_trace_setevent @TraceID, 25, 1, @on
exec sp_trace_setevent @TraceID, 25, 9, @on
exec sp_trace_setevent @TraceID, 25, 57, @on
exec sp_trace_setevent @TraceID, 25, 2, @on
exec sp_trace_setevent @TraceID, 25, 10, @on
exec sp_trace_setevent @TraceID, 25, 3, @on
exec sp_trace_setevent @TraceID, 25, 11, @on
exec sp_trace_setevent @TraceID, 25, 12, @on
exec sp_trace_setevent @TraceID, 25, 13, @on
exec sp_trace_setevent @TraceID, 25, 6, @on
exec sp_trace_setevent @TraceID, 25, 14, @on
exec sp_trace_setevent @TraceID, 25, 22, @on
exec sp_trace_setevent @TraceID, 59, 32, @on
exec sp_trace_setevent @TraceID, 59, 56, @on
exec sp_trace_setevent @TraceID, 59, 1, @on
exec sp_trace_setevent @TraceID, 59, 57, @on
exec sp_trace_setevent @TraceID, 59, 2, @on
exec sp_trace_setevent @TraceID, 59, 14, @on
exec sp_trace_setevent @TraceID, 59, 22, @on
exec sp_trace_setevent @TraceID, 59, 3, @on
exec sp_trace_setevent @TraceID, 59, 12, @on

-- Set the Filters
--Default to exclude the trace from Tracing Itself
exec sp_trace_setfilter @TraceID, 10, 0, 7, N'SQL Server Profiler'

--Add other filters as specified by parameters
If @TraceDBID=1
    exec sp_trace_setfilter @TraceID, 3, 0, 0, @DBID

-- Set the trace status to start
exec sp_trace_setstatus @TraceID, 1

-- display trace id for future references
return @TraceID

GO

The basics of a deadlock is that process 1 gets a lock on data A while process 2 gets a lock on data B. Then process 1 says it needs a lock on data B to continue while process 2 needs a lock on data A to continue. It’s a simple basic concept, but it gets complex…too complex for me to cover it here and say I did it justice. Sometimes a process deadlocks on itself, doing a parallelism deadlock as demonstrated by Jason Strate. Solutions also vary from order of operations (lets make processes 1 and 2 both get data A first, then they can’t deadlock), tuning (if the locks are released quicker then there’s less time for deadlocks), reducing the number of transactions (the lock on data A was grabbed earlier in this transaction, and there’s no reason these need to be in a explicit transaction), scheduling (these are both processes that have to run between 10 PM and 6 AM, why do they both start at midnight?), and many other approaches. There isn’t one right answer, and anyone promising a single right answer in one short blog post is selling you short. That’s either a very long blog post or it’s an incomplete answer. I go with option 3, and here’s my post on the best way to view them along with a couple book chapters to get you started on fixing what you found.

Related Content:

SQL Server Concurrency: Locking, Blocking and Row Versioning By Kalen Delaney
FREE eBook or $22 physical book

Troubleshooting SQL Server: A Guide for the Accidental DBA By Jonathan Kehayias and Ted Krueger
FREE eBook or $25 physical book

Recently Recompiled Resource Hogs

It’s not too uncommon for a query to get a new execution plan that performs a lot worse than it could, and sometimes it’s bad enough to drag the whole server down to a halt.  When it’s something obvious such as a query going from 2 seconds duration to 30 seconds you’ll pick it up on a standard monitoring trace.  That’s not always the case, and the query that hits your server twice a second going from 100ms to 900ms can fly under the radar except for users saying the server isn’t as responsive as it was yesterday.

Before you go all out and throw up an intrusive trace that can slow down your server capturing an unfiltered sample of RPC:Completed and SQL:BatchCompleted to run aggregates on, see what the DMVs can tell you.  If you look in sys.dm_exec_query_stats you can get the total amount of resources used since the last recompilation, which is going to give you pretty much everything you need as long as you’re looking for a query that can stay in the cache.  The only trick is to look at everything in the same context.  Something that did 10,000,000 reads since it was compiled yesterday is nothing compared to something that did 1,000,000 reads since it was compiled 10 minutes ago.  Here’s my answer to this:

SELECT top 10 --WorkerTime_Sec = cast(round(total_worker_time / 1000 / 1000.0, 0) as Int)
    WorkerTimeSec_PerSec =  cast(cast(round(total_worker_time / 1000 / 1000.0, 0) as Int) / (Cast(DateDiff(second, Creation_Time, GetDate()) as DEC(20,2)) + .00000001) as DEC(20,2))
    --, LogicalReads_K = total_logical_reads / 1000
    , LogicalReads_K_PerSec = cast(total_logical_reads / 1000 / (Cast(DateDiff(second, Creation_Time, GetDate())as DEC(20,2)) + .00000001) as DEC(20,2))
    --, ElapsedTime_Sec = cast(round(total_elapsed_time / 1000 / 1000.0, 0) as Int)
    , ElapsedTimeSec_PerSec = cast(cast(round(total_elapsed_time / 1000 / 1000.0, 0) as Int)/ (Cast(DateDiff(second, Creation_Time, GetDate())as DEC(20,2)) + .00000001) as DEC(20,2))
    , Seconds_Since_Recompile = DateDiff(second, Creation_Time, GetDate())
    , Creation_Time
    , qs.execution_count
    , avg_reads = total_logical_reads / qs.execution_count
    , avg_elapsed_ms = total_elapsed_time / 1000 / qs.execution_count
    , avg_worker_ms = total_worker_time / 1000 / qs.execution_count
    , Offset = statement_start_offset
    , ObjectName = OBJECT_NAME(qt.objectid, qt.dbid)
    , qt.text
    , StatementText = SUBSTRING (qt.text, qs.statement_start_offset/2 + 1,
        abs(CASE WHEN qs.statement_end_offset = -1
               THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2 
            ELSE qs.statement_end_offset END - qs.statement_start_offset)/2 + 1)
    --, qt.*
    --, qs.*
FROM sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE Creation_Time > DateAdd(Hour, -24, GetDate())
    AND (creation_time < DATEADD(SECOND, -60, GETDATE()) or execution_count > 5) --Has to be a minute old or executed 6 times before I consider the data stable
    --AND (total_worker_time > 20000 or execution_count > 15000)
ORDER BY 1 DESC 

I comment out what I don’t typically use, same as always with me.  Those pieces don’t usually change.  The only thing that does typically change is the date filter on the creation time; if someone says there weren’t any problems this time yesterday then it stays at the default value, if they say it worked fine before lunch then it drops down to 4 hours.

There are four things to keep in mind with this DMV.  Everything is in microseconds which I don’t use, so I “fixed” the problem in my query.  It will only hold data while the query is in cache, which means it’s never tracked here if it’s never in cache.  Each row is for a single plan for a single statement, although  you could easily aggregate it if you wanted to.  Also, it’s possible for the worker time or duration per second to be over 1 due to multiple simultaneous runs or, with worker time, parallelism can cause this.  I typically try not to restate what you should be reading on MSDN or BOL, but these aren’t things everyone expects.

For me to run this, I typically have to already have a complaint that a server is running slower than expected. Check out my Server Running Slow post to see everything else that runs along with this.

Reading Traces

Capturing information is useless unless you know how to use it. Here’s what I use to get the most out of my basic trace of anything taking over so many seconds. This is also my template for more invasive traces, and it gets tweaked on a case-by-case basis.

It’s my typical style with half of it commented out where less run portions can be highlighted and run and portions of my WHERE clause don’t have to be rewritten to be added in. The first line gets highlighted as soon as I open this script every time so I can run it, copy the path of the trace I’m looking for, and paste it into my main script. As for the WHERE clause, call me lazy or whatever it is you call people who can’t remember the right word, but I don’t always remember the exact wording of each column name that I want to filter by. I simply uncomment them as I need them.

The only thing I can say is really impressive here is finding the job names, which is a piece of this script I will never forget figuring out. Other than that, just little things to help readability when both running this script and when playing with the WHERE clause. I can’t tell you why duration is microseconds and CPU is milliseconds, but I can say the SELECT and WHERE clauses are set up to use the same units I would use when I talk to someone.

--SELECT * FROM sys.traces

/*
--sp_trace_setstatus id, status
exec sp_trace_setstatus 3,0 --stop
exec sp_trace_setstatus 3,2 --close
--0 stop
--1 start
--2 close
*/

--query running trace
SELECT TOP 10000 Database_Name = DB_Name(DatabaseID)
    , DatabaseName
    , starttime
    , endtime
    , textdata = cast(textdata as varchar(4000)) 
    , Duration_Sec = cast(duration/1000/1000.0 as Dec(10,1)) 
    , CPU_Sec = cast(cpu/1000.0 as Dec(10,1)) 
    , Reads_K = cast(reads/1000.0 as Dec(10,0)) 
    , Writes_K = Cast(writes/1000.0 as Dec(10,1)) 
    , hostname  
    , LoginName 
    , NTUserName 
    , ApplicationName = CASE LEFT(ApplicationName, 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(ApplicationName,32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(ApplicationName, 67, len(ApplicationName)-67)
                    ELSE ApplicationName
                    END 
    , SPID
    , TE.*
FROM fn_trace_gettable('D:\Traces\DurationOver10sec_110.trc',default) T
    LEFT JOIN sys.trace_events TE ON  T.EventClass = TE.trace_event_id
WHERE endtime > DATEADD(Hour, -3, GetDate()) 
    --AND Duration > 1000000 * 15 --in seconds
    --AND CPU > 1000 * 10 --in seconds
    --AND Writes > 10000
    --AND Reads > 1000000
    --AND te.category_id = 2 --Auto Grow/Shrink found in default trace
    --AND t.databaseid = DB_ID('master')
    --AND t.DatabaseName = 'tempdb'
    --AND t.LoginName <> 'shood'
    --AND t.textdata like '%%'
ORDER BY endTime DESC, starttime DESC

So, for those of you not too familiar with tracing, what does all of this mean and what am I looking for? It depends on what’s wrong. The day after an app’s upgrade I’m looking for procs that snuck through testing and are killing the server. When a user complains about performance you might see a new proc showing up because it’s getting a bad execution plan. If tempdb or the logs fill up I can tell you what had 100,000 writes and just happened to end at that exact moment. Sometimes I’ll glance to see if anything is hitting or approaching the app’s 30 second timeout.

A general trace such as the RPC:Completed and SQL:BatchCompleted that take over X seconds won’t answer all of your questions, but it will be a good start. The most important things are to always have something running (could be extended events instead of a trace) to tell you the big things that are running and to have your monitoring limited enough where you aren’t the one slowing down the server. Make sure it’s customized to each server since a 1 second duration in a web app is a big deal, but other servers you would get flooded with results unless you bump it up to 10 seconds. You can run more invasive traces when you need them.

There are two hits against extended events where I’m even using traces for this. First, I still have too many SQL 2005 servers around and like to do things as consistent as reasonably possible across all servers to make it simple. Second, I’m still not too happy about extended events skipping SQL:BatchCompleted in SQL 2008 as that misses out on finding large loops of small statements. Extended events are much more efficient, can do a lot that traces can’t do, but aren’t a 100% replacement for traces.

Comment on my posts, let me know what you think. Every post I have at Simple SQL Server is written to myself 5 years ago, and is exactly what I wish I had back then in my own mind. The key words being “in my own mind”, and I’d like to hear what’s on your mind. Be it constructive criticism on this post or a completely separate SQL Server subject you wish you knew more about either now or 5 years ago that you’d like to see thrown up for everyone to use, I’ll never know until you tell me.

Tracing Introduction

SQL Server tracing is essential for troubleshooting performance issues, yet it can put loads on your server that would cause noticeable slowness, capture so much information that it fills the drive, and I’ve even seen it cause a cluster failover more than once.  With a tool this powerful you just need to know how to use it before you use it.  It’s almost like the idea where companies want their entry-level people to have a year of experience.

There is one trace I’d like to see on every server, preferably as a startup proc or a job that starts automatically when SQL Agent starts.  It doesn’t matter how you do it, if you want me to be happy you’ll tell me this was running for at least 24 hours before you had a performance incident.  All I’m looking for is RPC:Completed and SQL:BatchCompleted with a reasonable filter of duration over 10 seconds.

If you’ve never used Profiler before, here’s a quick overview to make the server-side trace I’m using.  Leave the first screen at the defaults, with the possible exception of changing the template to blank.  Select RPC:Completed and SQL:BatchCompleted and set the Duration filter to greater than or equal to 10,000 ms which you can get to using the “Column Filters…” button.  You can collect all columns if you want to, it won’t hurt with these events although there are some columns that won’t help.

Create trace with duration filter

Run the trace, then you can export it to a script.  I’m not happy that you have to run it first, but in the end this is either something that doesn’t affect the server at all or something I run on a dev server that’s at the same patch level.

Export trace to file

There, you have the basic script to start the server-side trace.  You still need to tweak it a bit before you can use it, but the hard part is done.  Everything highlighted in yellow was changed by me.  The default max file size was changed from 5 MB to 25 MB because I view 25 to 50 MB to be the range that’s not annoyingly small or too big to move around or send to someone easily.  The second parameter of sp_trace_create is set to 2 to enable rollover files, the third parameter is set to my output file sans an extension, and the last parameter is added to say how many output files there are.

/****************************************************/
/* Created by: SQL Server 2008 R2 Profiler          */
/* Date: 06/16/2013  06:10:02 AM         */
/****************************************************/

-- Create a Queue
declare @rc int
declare @TraceID int
declare @maxfilesize bigint
set @maxfilesize = 25

-- Please replace the text InsertFileNameHere, with an appropriate
-- filename prefixed by a path, e.g., c:\MyFolder\MyTrace. The .trc extension
-- will be appended to the filename automatically. If you are writing from
-- remote server to local drive, please use UNC path and make sure server has
-- write access to your network share

exec @rc = sp_trace_create @TraceID output, 2, N'D:\Traces\DurationOver10Sec', @maxfilesize, NULL, 5
if (@rc != 0) goto error

-- Client side File and Table cannot be scripted

-- Set the events
declare @on bit
set @on = 1
exec sp_trace_setevent @TraceID, 10, 15, @on
exec sp_trace_setevent @TraceID, 10, 31, @on
exec sp_trace_setevent @TraceID, 10, 8, @on
exec sp_trace_setevent @TraceID, 10, 16, @on
exec sp_trace_setevent @TraceID, 10, 48, @on
exec sp_trace_setevent @TraceID, 10, 1, @on
exec sp_trace_setevent @TraceID, 10, 9, @on
exec sp_trace_setevent @TraceID, 10, 17, @on
exec sp_trace_setevent @TraceID, 10, 10, @on
exec sp_trace_setevent @TraceID, 10, 18, @on
exec sp_trace_setevent @TraceID, 10, 34, @on
exec sp_trace_setevent @TraceID, 10, 3, @on
exec sp_trace_setevent @TraceID, 10, 11, @on
exec sp_trace_setevent @TraceID, 10, 51, @on
exec sp_trace_setevent @TraceID, 10, 12, @on
exec sp_trace_setevent @TraceID, 10, 13, @on
exec sp_trace_setevent @TraceID, 10, 14, @on
exec sp_trace_setevent @TraceID, 12, 15, @on
exec sp_trace_setevent @TraceID, 12, 31, @on
exec sp_trace_setevent @TraceID, 12, 8, @on
exec sp_trace_setevent @TraceID, 12, 16, @on
exec sp_trace_setevent @TraceID, 12, 48, @on
exec sp_trace_setevent @TraceID, 12, 1, @on
exec sp_trace_setevent @TraceID, 12, 9, @on
exec sp_trace_setevent @TraceID, 12, 17, @on
exec sp_trace_setevent @TraceID, 12, 10, @on
exec sp_trace_setevent @TraceID, 12, 14, @on
exec sp_trace_setevent @TraceID, 12, 18, @on
exec sp_trace_setevent @TraceID, 12, 3, @on
exec sp_trace_setevent @TraceID, 12, 11, @on
exec sp_trace_setevent @TraceID, 12, 51, @on
exec sp_trace_setevent @TraceID, 12, 12, @on
exec sp_trace_setevent @TraceID, 12, 13, @on

-- Set the Filters
declare @intfilter int
declare @bigintfilter bigint

set @bigintfilter = 10000000
exec sp_trace_setfilter @TraceID, 13, 0, 4, @bigintfilter

-- Set the trace status to start
exec sp_trace_setstatus @TraceID, 1

-- display trace id for future references
select TraceID=@TraceID
goto finish

error: 
select ErrorCode=@rc

finish: 
go

This is done as a server-side trace for a couple reasons. Profiler needs to stay open to keep collecting data, SQL Server has to send the data to Profiler which is usually across the network, then the trace data is held in memory. None of these are a problem for a properly filtered short-term trace. This is properly filtered, but I’m really hoping everyone will run this long-term.

I owe you another blog post on querying traces, then we can move in to putting it all together with where to start when a user makes their favorite statement to you…”The server is running slow”.

Wait Stats – Monitoring and Using

sys.dm_os_wait_stats is one of the most important DMVs out there, and one that you should know the historical values of for every server you care about. This DMV will tell you the cumulative amount of time you waited for each wait type since SQL Server services started, which is nice to know, but limited in its usefulness as-is.

If, however, you knew what the values were 30 minutes ago and what they are now then you could know why recent queries were waiting. Take it a step further and know what you were waiting for in the same 30 minute period yesterday and you have a bit of a baseline to go off of. Now you can compare what you’re waiting for now to your baseline so you can say if what you’re seeing now is just normal for the server or if you’re seeing a true incident. Remember, the only way to know if numbers on a server are a problem are not is to know what’s normal for that specific server.

A half-hour increment is a bit much for me, and I would prefer to know it every 5 minutes to give a little more detail. Also, just looking at yesterday’s stats for comparison isn’t good enough for me, especially on when looking into an incident on a Monday where Sunday isn’t a very good baseline. Because of that, I want to look at the baseline data from the same time period yesterday and exactly a week ago.

To capture the data you need to capture a snapshot of sys.dm_os_wait_stats along with the time you grabbed it. Then grab it again, do a diff between the two values, and save that off with the starting and ending times. Keep this running every so many minutes in a SQL job and you have a running baseline to compare what you’re seeing now to what the server is typically doing.

After that, keep the data trimmed down to what you’d actually use. Personally, I’m a huge fan of 13 months. This running every 5 minutes for 13 months will accumulate about 750 MB of data, and you can adjust these numbers as you see fit.

Here are the tables; I comment out stuff if I would cry if I accidently ran it at the wrong time.

/*
IF OBJECT_ID('WaitStats') IS NOT NULL BEGIN
    DROP TABLE WaitStats
END

IF OBJECT_ID('WaitStatsLast') IS NOT NULL BEGIN
    DROP TABLE WaitStatsLast
END
*/
GO

CREATE TABLE WaitStats (
    DateStart datetime
    , DateEnd datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStats_DateStart_waittype_U_C ON WaitStats
(
    DateStart
    , wait_type
) WITH (Fillfactor = 95)
GO

CREATE TABLE WaitStatsLast (
    DateAdded datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStatsLast_waittype_U_C ON WaitStatsLast
(
    wait_type
) WITH (Fillfactor = 95)
GO

Notice the lack of creativity. It is very intentional. The code is more stable as I’m copying the field sizes directly from the DMV, and the data is more legible to an outsider (new employee, Microsoft engineer, consultant, vendor) because the fields are exactly the same as the DMV they’re used to seeing. I’m not a fan of non-creative solutions, so there’s a reason to go with it when I decide to be boring.

Now to populate the tables. This script will need to be ran on a regular schedule. Preferably, throw it in a proc and run that proc every 5 minutes, but that’s just my opinion in a discussion about how to watch your servers.

IF OBJECT_ID('tempdb..#WaitStatsNew') IS NOT NULL BEGIN
    DROP TABLE #WaitStatsNew 
END

SELECT DateAdded = GETDATE()
    , wait_type
    , waiting_tasks_count 
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms 
INTO #WaitStatsNew
FROM sys.dm_os_wait_stats

INSERT INTO WaitStats (DateStart, DateEnd, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateStart = ISNULL(l.DateAdded, (SELECT create_date FROM sys.databases WHERE name = 'tempdb'))
    , DateEnd = n.DateAdded
    , wait_type = n.wait_type
    , waiting_tasks_count = n.waiting_tasks_count - ISNULL(l.waiting_tasks_count, 0)
    , wait_time_ms = n.wait_time_ms - ISNULL(l.wait_time_ms, 0)
    , max_wait_time_ms = n.max_wait_time_ms --It's a max, not cumulative
    , signal_wait_time_ms = n.signal_wait_time_ms - ISNULL(l.signal_wait_time_ms, 0)
FROM #WaitStatsNew n
    LEFT OUTER JOIN WaitStatsLast l ON n.wait_type = l.wait_type AND l.DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'tempdb')

TRUNCATE TABLE WaitStatsLast 

INSERT INTO WaitStatsLast (DateAdded, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateAdded
    , wait_type
    , waiting_tasks_count
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms
FROM #WaitStatsNew 

DROP TABLE #WaitStatsNew

Finally, querying it. Give it a descent time frame, I typically start at 2 hours to get an overall picture without having a single event skew the numbers too much. Then look at your most relevant baseline data – yesterday and 7 days ago during the same timeframe.

DECLARE @dStart datetime
    , @dEnd datetime

SELECT @dStart = DATEADD(Hour, -2, GETDATE())
    , @dEnd = GETDATE()

IF OBJECT_ID('tempdb..#IgnoredWaits') IS NOT NULL BEGIN
    DROP TABLE #IgnoredWaits
END

CREATE TABLE #IgnoredWaits (Wait_Type NVarChar(60) PRIMARY KEY) 
--/*
--Values taken from p25 of "Troubleshooting SQL Server - A Guide for the Accidental DBA" by Jonathan Kehayias and Ted Krueger  
INSERT INTO #IgnoredWaits 
SELECT 'BAD_PAGE_PROCESS'
UNION SELECT 'BROKER_EVENTHANDLER'
UNION SELECT 'BROKER_RECEIVE_WAITFOR'
UNION SELECT 'BROKER_TASK_STOP'
UNION SELECT 'BROKER_TO_FLUSH'
UNION SELECT 'BROKER_TRANSMITTER'
UNION SELECT 'CHECKPOINT_QUEUE'
UNION SELECT 'CLR_AUTO_EVENT'
UNION SELECT 'CLR_MANUAL_EVENT'
UNION SELECT 'DBMIRROR_EVENTS_QUEUE'
UNION SELECT 'DISPATCHER_QUEUE_SEMAPHORE'
UNION SELECT 'FT_IFTS_SCHEDULER_IDLE_WAIT'
UNION SELECT 'FT_IFTSHC_MUTEX'
UNION SELECT 'KSOURCE_WAKEUP'
UNION SELECT 'LAZYWRITER_SLEEP'
UNION SELECT 'LOGMGR_QUEUE'
UNION SELECT 'ONDEMAND_TASK_QUEUE'
UNION SELECT 'PREEMPTIVE_OS_AUTHENTICATIONOPS'
UNION SELECT 'PREEMPTIVE_OS_GETPROCADDRESS'
UNION SELECT 'REQUEST_FOR_DEADLOCK_SEARCH'
UNION SELECT 'RESOURCE_QUEUE'
UNION SELECT 'SLEEP_BPOOL_FLUSH'
UNION SELECT 'SLEEP_SYSTEMTASK'
UNION SELECT 'SLEEP_TASK' 
UNION SELECT 'SQLTRACE_BUFFER_FLUSH'
UNION SELECT 'WAITFOR'
UNION SELECT 'XE_DISPATCHER_JOIN'
UNION SELECT 'XE_DISPATCHER_WAIT'
UNION SELECT 'XE_TIMER_EVENT'

SELECT TOP 7 TimeFrame = 'Right Now'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart 
    AND DateEnd <= @dEnd 
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Yesterday'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 1
    AND DateEnd <= @dEnd - 1
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Last Week'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 7
    AND DateEnd <= @dEnd - 7
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

The only thing left to do is wait and interpret the data. All the wait types mean something, with some easily pointing to an issue and others just being normal operations. Doing a dive into what wait types are out there, what they mean to you, and what they mean when they’re not what they were yesterday isn’t something I’m going to get into with this post.  However, there is an Accidental DBA book out there that explains this amazingly well.  Here are links to the FREE PDF Version and the Physical Book.

Indexes – Unused and Duplicates

Indexes aren’t free, and many databases end up with unused indexes. Every time you make any update to a table you will be updating the clustered index (I assume no heaps, because I hate heaps), and every index that has uses one of the columns that were updated. Inserts and Deletes affect every column and will affect every index, with exceptions for non-typical indexes. These updates cost a lot, to the point that I have several processes that disable certain nonclustered indexes, do all of the work, then rebuild the indexes afterwards.

The problem is that you can’t tell very easily if an index is being used or not. You can look at sys.dm_db_index_usage_stats, but that information is very limited in that gets wiped out on index rebuilds, SQL service restarts (server reboot included), and anytime you ask it to. That’s not mentioning that it also ignores obscure usage of the index. However, the obscure usage is more of something to keep in mind than it is something to base all of your decisions on.

From my point of view, there are two answers to this problem.  There isn’t a right and wrong answer, it’s more of a step 1 and step 2.

The long and hard one is to capture the values in sys.dm_db_index_usage_stats on a normal basis (daily, before index maintenance) and give yourself long term statistics that you can determine if an index’s usage justifies its cost. Some indexes are updated 1,000,000 times for two uses. You obviously don’t need to keep this index around all the time, but you’re still paranoid about what those uses were. Were they ad-hoc queries run by a developer, or were they year-end reports run by the CEO? You can get some idea if you look at your history and check what day it was run. I would want to make a rollback script, drop the index, and make a searchable document of indexes I dropped in case anyone complained about year-end reports not working.

That was the long answer, which is right and is not replaced by the short answer. The script below is the short answer. If all of the key fields in an index match, or at least the first three in a larger index, then there’s a good chance that one of the indexes isn’t required or they could be consolidated.

For instance, if you have two indexes that have the same two key fields in the same order but the second index also includes two columns, then you’ll probably see usage on both indexes (see the long-hard solution above). If one index isn’t in use in this situation, your job is easy. If both are used, then look into dropping the index that doesn’t include any columns. Why? Because a slightly larger index rarely increases the overhead too much, and the queries that used the first index aren’t usually impacted too much by using a slightly larger index that has the same foundation.

This area gets shakey on how you should handle it, and is really one of the few areas in SQL Server I consider more of an art. For instance, the script below will show you indexes that have 5 key columns where the first three match. You look and find that both indexes are being used, but users are complaining that updates are a bit slow, too. I’d lean towards condensing by taking the the last two key columns of one index and including them on the other, but which index gets dropped and how do you test it?

Do my uses of “look into”, “rarely”, “usually”, and “too much” seem uncertain to you? Welcome to indexing.

Here are the scripts I use for both the short and long answers mentioned.  The first one is for the short answer, giving you indexes that are for the most part duplicated.

DECLARE @KeyFieldMatches INT

SET @KeyFieldMatches = 3 --Number of key fields to match in order

IF object_id('tempdb..#IndexList') IS NOT NULL BEGIN
    DROP TABLE #IndexList
END

IF object_id('tempdb..#IndexListShort') IS NOT NULL BEGIN
    DROP TABLE #IndexListShort
END

CREATE TABLE #IndexList (
      object_id    bigint not null
    , table_name varchar(150) not null
    , index_id int not null
    , index_name varchar(150) not null
    , index_column int not null
    , included bit not null
    , column_name varchar(150) not null
    , index_type int not null
)

CREATE CLUSTERED INDEX IndexList_Clu ON #IndexList (object_id, index_id, index_column, included)

CREATE TABLE #IndexListShort (
      object_id bigint not null
    , table_name varchar(150) not null
    , index_id int not null
    , index_name varchar(150) not null
    , column_names_first_n varchar(450) not null
    , column_names_key varchar(4000) not null
    , column_names_included varchar(4000) not null
)

DECLARE @object_id bigint
DECLARE @index_id int
DECLARE @List varchar(4000)
DECLARE @ListIncl varchar(4000)
DECLARE @ListShort varchar(450)

INSERT INTO #IndexList
SELECT o.object_id
    , table_name = o.name
    , i.index_id
    , index_name = i.name
    , index_column = ic.index_column_id
    , included = ic.is_included_column
    , column_name = c.name
    , index_type = i.type
FROM sys.objects o
    INNER JOIN sys.indexes i ON o.object_id = i.object_id
    INNER JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id
    INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id

SET @object_id = (SELECT TOP 1 object_id FROM #IndexList)
SET @index_id = (SELECT TOP 1 index_id FROM #IndexList WHERE object_id = @object_id)

WHILE @object_id IS NOT NULL BEGIN
    SET @List = ''
    SET @ListIncl = ''
    SET @ListShort = ''

    SELECT @List = @List + COALESCE(i.column_name + ', ', '') FROM #IndexList i WHERE i.object_id = @object_id and i.index_id = @index_id and i.included = 0 ORDER BY i.index_column
    SELECT @List = substring(@List, 0, Len(@List))

    SELECT @ListIncl = @ListIncl + COALESCE(i.column_name + ', ', '') FROM #IndexList i WHERE i.object_id = @object_id and i.index_id = @index_id and i.included = 1 ORDER BY i.index_column
    SELECT @ListIncl = substring(@ListIncl, 0, Len(@ListIncl))

    SELECT @ListShort = @ListShort + COALESCE(i.column_name + ', ', '') FROM #IndexList i WHERE i.object_id = @object_id and i.index_id = @index_id and i.included = 0 and index_column <= @KeyFieldMatches ORDER BY i.index_column
    SELECT @ListShort = substring(@ListShort, 0, Len(@ListShort))

    INSERT INTO #IndexListShort 
    SELECT TOP 1 i.object_id
        , i.table_name
        , i.index_id
        , i.index_name
        , @ListShort
        , @List
        , CASE i.index_type --Clustered indexes include everything
            WHEN 1 THEN '*'
            ELSE @ListIncl
            END
    FROM #IndexList i
    WHERE i.object_id = @object_id and i.index_id = @index_id

    DELETE #IndexList WHERE object_id = @object_id and index_id = @index_id

    SET @object_id = (SELECT TOP 1 object_id FROM #IndexList)
    SET @index_id = (SELECT TOP 1 index_id FROM #IndexList WHERE object_id = @object_id)
END

SELECT table_name, index_name, column_names_key, column_names_included
FROM #IndexListShort i
WHERE EXISTS (SELECT * FROM #IndexListShort i2 WHERE i.object_id = i2.object_id AND i.column_names_first_n = i2.column_names_first_n GROUP BY object_id, column_names_first_n HAVING Count(*) > 1)
ORDER BY table_name, column_names_key

DROP TABLE #IndexList
DROP TABLE #IndexListShort

The long answer takes a bit more work to figure out. You need to store the information long-term, keep a snapshot of what it looked like most recently, and do all the calculations to make sure you have the right numbers. It’s really no different than what I do for other DMVs, so you’ll see this process duplicated a lot if you follow my work. The tables for storing this information:

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[DMVIndex_Usage_Stats](
    [StartTime] [smalldatetime] NOT NULL,
    [EndTime] [smalldatetime] NOT NULL,
    [database_id] [smallint] NOT NULL,
    [object_id] [int] NOT NULL,
    [index_id] [int] NOT NULL,
    [user_seeks] [bigint] NOT NULL,
    [user_scans] [bigint] NOT NULL,
    [user_lookups] [bigint] NOT NULL,
    [user_updates] [bigint] NOT NULL,
    [system_seeks] [bigint] NOT NULL,
    [system_scans] [bigint] NOT NULL,
    [system_lookups] [bigint] NOT NULL,
    [system_updates] [bigint] NOT NULL,
 CONSTRAINT [PK_DMVIndex_Usage_Stats] PRIMARY KEY CLUSTERED 
(
    [StartTime] ASC,
    [database_id] ASC,
    [object_id] ASC,
    [index_id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[DMVIndex_Usage_Stats_Temp](
    [TS] [smalldatetime] NOT NULL,
    [database_id] [smallint] NOT NULL,
    [object_id] [int] NOT NULL,
    [index_id] [int] NOT NULL,
    [user_seeks] [bigint] NOT NULL,
    [user_scans] [bigint] NOT NULL,
    [user_lookups] [bigint] NOT NULL,
    [user_updates] [bigint] NOT NULL,
    [system_seeks] [bigint] NOT NULL,
    [system_scans] [bigint] NOT NULL,
    [system_lookups] [bigint] NOT NULL,
    [system_updates] [bigint] NOT NULL,
 CONSTRAINT [PK_DMVIndex_Usage_Stats_Temp] PRIMARY KEY CLUSTERED 
(
    [TS] ASC,
    [database_id] ASC,
    [object_id] ASC,
    [index_id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

GO

And now the script that captures the information and does all of your calculations. You may have noticed that the tables themselves do NOT store the index or table names which was intentional. Name changes happen, tables and indexes get dropped and created, all of that is extra work for little benefit. Is the benefit worth it? I thought it wasn’t. I’m not expecting everyone to agree with me here because there are arguments going the other way. My advice is STEAL MY WORK!!! Don’t copy it, you need to steal it. Make it your own, learn it well enough that you can improve it and build upon it. At that point you’ll truly understand every bit of it and hopefully start up your own blog. I hear EvenSimplerSQLServer.com is available, but that’s subject to change. If you don’t want to go that far, then write your ideas in the comment box below. I change these posts from time to time, and this script was actually added months after the original post was made. Just don’t ask me why it wasn’t here in the first place, my answer is just….DOH!!!! Even saying I was only 6 days into blogging at the time I wrote the original doesn’t make that acceptable.

DECLARE @TS as SmallDateTime
DECLARE @Old_TS as SmallDateTime
DECLARE @Reboot_TS as SmallDateTime

SET @TS = GetDate()
SET @Reboot_TS = (SELECT create_date FROM sys.databases WHERE [name] = 'tempdb')

--If update wasn't already done this minute
  -- Delete all but the newest two updates - Only the newest will be used, so one spare is kept
  -- Add the current stats
  -- Calculate the usage between the last two and save in the permanent table
    -- If any of the values go down then the DMV reset for that index
      --It's a horrible join clause, but it makes everything null and uses the isnull(x,0)
      --AND N.user_seeks >= O.user_seeks will be hit if database is refreshed, detached, etc.
IF 0 = (SELECT Count(*) FROM Perf..DMVIndex_Usage_Stats_Temp WHERE TS = @TS) BEGIN
    DELETE Perf..DMVIndex_Usage_Stats_Temp 
    WHERE TS NOT IN 
        (SELECT DISTINCT TOP 2 TS 
         FROM Perf..DMVIndex_Usage_Stats_Temp 
         ORDER BY TS DESC)

    SET @Old_TS = --Newest TS in the temp table
        (SELECT TOP 1 TS
        FROM Perf..DMVIndex_Usage_Stats_Temp
        ORDER BY TS DESC)
        
    INSERT INTO Perf.dbo.DMVIndex_Usage_Stats_Temp
    SELECT @TS
        , database_id
        , object_id
        , index_id
        , user_seeks
        , user_scans
        , user_lookups
        , user_updates
        , system_seeks
        , system_scans
        , system_lookups
        , system_updates
    FROM sys.dm_db_index_usage_stats
    
    INSERT INTO Perf..DMVIndex_Usage_Stats
    SELECT isnull(@Old_TS, @Reboot_TS)
        , @TS
        , N.database_id
        , N.object_id
        , N.index_id
        , N.user_seeks - isnull(O.user_seeks, 0)
        , N.user_scans - isnull(O.user_scans, 0)
        , N.user_lookups - isnull(O.user_lookups, 0)
        , N.user_updates - isnull(O.user_updates, 0)
        , N.system_seeks - isnull(O.system_seeks, 0)
        , N.system_scans - isnull(O.system_scans, 0)
        , N.system_lookups - isnull(O.system_lookups, 0)
        , N.system_updates - isnull(O.system_updates, 0)
    FROM Perf..DMVIndex_Usage_Stats_Temp N --N = New, O = Old
        LEFT JOIN Perf..DMVIndex_Usage_Stats_Temp O ON O.TS = @Old_TS
            AND O.database_id = N.database_id
            AND O.object_id = N.object_id
            AND O.index_id = N.index_id
            AND N.user_seeks >= O.user_seeks
            AND N.user_scans >= O.user_scans
            AND N.user_lookups >= O.user_lookups
            AND N.user_updates >= O.user_updates
            AND N.system_seeks >= O.system_seeks
            AND N.system_scans >= O.system_scans
            AND N.system_lookups >= O.system_lookups
            AND N.system_updates >= O.system_updates
            AND @Reboot_TS < O.TS --They're reset on reboot, so don't compare if it was rebooted since last reading
    WHERE N.TS = @TS
END

Job Schedules – SQL Agent

This info should be easier to get than it is. Keep in mind that for something to run the subscription and schedule both have to be enabled.

To be fair, the majority of this code was grabbed from the internet, and this is not something I modified to the extent that I could call it my own. Michael Abair is the original author, and the original code can be found at the link below. My contribution is limited to very minor changes and a lot of hitting the tab key. I know it works just as well, maybe even 2ms faster, without the whitespace, but I had to do it.

http://www.sqlservercentral.com/scripts/Jobs/69088/

SELECT /*S.job_id,*/ S.job_name, S.is_job_enabled, S.is_schedule_enabled, S.schedule_name, S.Description
    ,avg_duration_in_seconds = avg(datediff(s, '1/1/2000', ('1/1/2000 ' + cast(stuff(stuff(right('000000' + cast(h.run_duration as varchar(6)), 6),5,0,':'),3,0,':') as datetime))))
    ,number_of_runs = count(1)
FROM (SELECT SJ.job_id 
        , SJ.name as job_name
        , SJ.enabled as is_job_enabled
        , SS.enabled as is_schedule_enabled
        , SS.name as schedule_name
        , CASE freq_type
            WHEN 1 THEN 'Occurs on ' + STUFF(RIGHT(active_start_date, 4), 3,0, '/') + '/' + LEFT(active_start_date, 4) + ' at '
                + REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime) /* hh:mm:ss 24H */, 9), 14), ':000', ' ') /* HH:mm:ss:000AM/PM then replace the :000 with space.*/
            WHEN 4 THEN 'Occurs every ' + CAST(freq_interval as varchar(10)) + ' day(s) '
                + CASE freq_subday_type
                    WHEN 1 THEN 'at '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    WHEN 2 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' second(s)'
                    WHEN 4 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' minute(s)'
                    WHEN 8 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' hour(s)'
                    ELSE '' 
                    END
                + CASE WHEN freq_subday_type in (2,4,8) /* repeat seconds/mins/hours */
                        THEN ' between '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                        + ' and '
                        + LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_end_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    ELSE ''
                    END
            WHEN 8 THEN 'Occurs every ' + CAST(freq_recurrence_factor as varchar(10))
                + ' week(s) on '
                +
                REPLACE( CASE WHEN freq_interval&1 = 1 THEN 'Sunday, ' ELSE '' END
                + CASE WHEN freq_interval&2 = 2 THEN 'Monday, ' ELSE '' END
                + CASE WHEN freq_interval&4 = 4 THEN 'Tuesday, ' ELSE '' END
                + CASE WHEN freq_interval&8 = 8 THEN 'Wednesday, ' ELSE '' END
                + CASE WHEN freq_interval&16 = 16 THEN 'Thursday, ' ELSE '' END
                + CASE WHEN freq_interval&32 = 32 THEN 'Friday, ' ELSE '' END
                + CASE WHEN freq_interval&64 = 64 THEN 'Saturday, ' ELSE '' END
                + '|', ', |', ' ') /* get rid of trailing comma */

                + CASE freq_subday_type
                    WHEN 1 THEN 'at '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    WHEN 2 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' second(s)'
                    WHEN 4 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' minute(s)'
                    WHEN 8 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' hour(s)'
                    ELSE '' 
                    END
                + CASE WHEN freq_subday_type in (2,4,8) /* repeat seconds/mins/hours */
                        THEN ' between '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                        + ' and '
                        + LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_end_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    ELSE ''
                    END
            WHEN 16 THEN 'Occurs every ' + CAST(freq_recurrence_factor as varchar(10))
                + ' month(s) on '
                + 'day ' + CAST(freq_interval as varchar(10)) + ' of that month ' 
                + CASE freq_subday_type
                    WHEN 1 THEN 'at '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    WHEN 2 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' second(s)'
                    WHEN 4 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' minute(s)'
                    WHEN 8 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' hour(s)'
                    ELSE '' 
                    END
                + CASE WHEN freq_subday_type in (2,4,8) /* repeat seconds/mins/hours */
                        THEN ' between '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                        + ' and '
                        + LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_end_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    ELSE ''
                    END
            WHEN 32 THEN 'Occurs ' 
                + CASE freq_relative_interval
                    WHEN 1 THEN 'every first '
                    WHEN 2 THEN 'every second '
                    WHEN 4 THEN 'every third '
                    WHEN 8 THEN 'every fourth '
                    WHEN 16 THEN 'on the last '
                    END
                + CASE freq_interval 
                    WHEN 1 THEN 'Sunday'
                    WHEN 2 THEN 'Monday'
                    WHEN 3 THEN 'Tuesday'
                    WHEN 4 THEN 'Wednesday'
                    WHEN 5 THEN 'Thursday'
                    WHEN 6 THEN 'Friday'
                    WHEN 7 THEN 'Saturday'
                    WHEN 8 THEN 'day'
                    WHEN 9 THEN 'weekday'
                    WHEN 10 THEN 'weekend'
                    END
                + ' of every ' + CAST(freq_recurrence_factor as varchar(10)) + ' month(s) '
                + CASE freq_subday_type
                    WHEN 1 THEN 'at '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    WHEN 2 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' second(s)'
                    WHEN 4 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' minute(s)'
                    WHEN 8 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' hour(s)'
                    ELSE '' 
                    END
                + CASE 
                    WHEN freq_subday_type in (2,4,8) /* repeat seconds/mins/hours */
                        THEN ' between '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                        + ' and '
                        + LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_end_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    ELSE ''
                    END
            WHEN 64 THEN 'Runs when the SQL Server Agent service starts'
            WHEN 128 THEN 'Runs when the computer is idle'
            END 
        AS [Description]
        FROM msdb.dbo.sysjobs SJ
            INNER JOIN msdb.dbo.sysjobschedules SJS ON SJ.job_id = SJS.job_id
            INNER JOIN msdb.dbo.sysschedules SS ON SJS.schedule_id = SS.schedule_id
            INNER JOIN msdb.dbo.syscategories SC ON SJ.category_id = SC.category_id
        --WHERE SC.name = 'Name from query below'
    ) S
    INNER JOIN msdb.dbo.sysjobhistory H ON s.job_id = H.job_id AND H.step_id = 0
WHERE H.run_date >= /* 7 days ago */cast(datepart(yyyy, dateadd(d, -7, getDate())) as VarChar(10)) + cast(datepart(mm, dateadd(d, -7, getDate())) as VarChar(10)) + cast(datepart(dd, dateadd(d, -7, getDate())) as VarChar(10)) --format getDate once to compare against multiple run_dates
GROUP BY /*S.job_id,*/ S.job_name, S.is_job_enabled, S.is_schedule_enabled, S.schedule_name, S.Description
ORDER BY S.job_name

--SELECT * FROM msdb..syscategories --If you want to uncomment the line "WHERE SC.name = ...", copy/paste from this

Roll Logs – Backup Recovery

Rolling log files for a day, especially with 15 or even 5 minute log backups is a pain at best. Here’s a slightly better way to do it. Set the two variables at the top to the directory where the log backups are and the database you’re looking to restore. It will, assuming you named your log backups as DatabaseName_Log*, display all log backups in that folder in chronological order into an output script which is best read when doing results to text (Query/Results To/Results To Text).

Here’s what this won’t do. It won’t require SQL Server to remember taking the backups, so you can run it from any server that has access to this folder. It won’t restore anything for you, it will just give you the text to copy/paste and run yourself. It won’t leave xp_cmdshell on, although it does require it to be turned on for a bit if it was off.

Since this does nothing more than display the code you’ll probably run next, feel free to run it even if you’re just playing around or only want to roll half the logs.

DECLARE @LogBackupDirectory VarChar(1000)
DECLARE @DatabaseName VarChar(256)

SELECT @LogBackupDirectory = '\\BUServer\BUShare\Folder\logs\'
    , @DatabaseName = 'model'

---Don't change anything below this point unless upgrading the script---
---Also, slight change if backup and restore database names are different---

SET NOCOUNT ON

DECLARE @XP_CmdShell_Enabled INT
    , @XP_CmdShell_Command VarChar(4000)

IF Object_ID('TempDB..#temp') IS NOT NULL BEGIN
    DROP TABLE #temp
END

create table #temp
(
      Dir VARCHAR(MAX)
)

SELECT @XP_CmdShell_Enabled = CONVERT(INT, ISNULL(value, value_in_use))
FROM master.sys.configurations
WHERE name = 'xp_cmdshell'

IF @XP_CmdShell_Enabled = 0 BEGIN
    EXEC sp_configure 'xp_cmdshell', 1
    RECONFIGURE WITH OVERRIDE 
END 

SELECT @XP_CmdShell_Command = 'dir  /od /b ' + @LogBackupDirectory + @DatabaseName + '_log*'

INSERT INTO #temp
EXECUTE xp_cmdshell @XP_CmdShell_Command

IF @XP_CmdShell_Enabled = 0 BEGIN
    EXEC sp_configure 'xp_cmdshell', 0
    RECONFIGURE WITH OVERRIDE 
END 

DELETE #temp
WHERE Dir IS NULL

SELECT commands = 'Restore Log [' + @DatabaseName + '] from DISK=N''' + @LogBackupDirectory + Dir + ''' with NoRecovery' 
FROM #temp

SELECT 'RESTORE DATABASE [' + @DatabaseName + '] WITH RECOVERY'

Table Sizes

There’s no simple way in SQL Server to see the sizes of all the tables and their indexes.  Even seeing a single table’s size through SSMS can take a while.  The solution below is grabbed from statistics, so there’s no guarantee you’re getting exact numbers.  However, exact numbers don’t tend to matter, as you care more that the table is about 1 GB, not that it’s exactly 998 MB.

There are variations of this code floating all over the place, and I tweaked the results to be in the easiest to read format possible.

The only issue I have with this script is that you have to scroll to the next-to-the-last line to filter which tables it’s pulling. Yes, I’m picky, and I don’t like to scroll to see things that change.

SELECT Database_ID = DB_ID()
    , Database_Name = DB_NAME()
    , Schema_Name = a3.name
    , TableName = a2.name
    , TableSize_MB = (a1.reserved + ISNULL(a4.reserved,0)) / 128
    , RowCounts = a1.rows
    , DataSize_MB = a1.data / 128
    , IndexSize_MB = (CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data 
                        THEN (a1.used + ISNULL(a4.used,0)) - a1.data 
                        ELSE 0 
                    END) /128
    , Free_MB = (CASE WHEN (a1.reserved + ISNULL(a4.reserved,0)) > a1.used 
                        THEN (a1.reserved + ISNULL(a4.reserved,0)) - a1.used 
                        ELSE 0 
                    END) / 128
FROM (SELECT ps.object_id
            , [rows] = SUM(CASE
                                WHEN (ps.index_id < 2) THEN row_count
                                ELSE 0
                            END)
            , reserved = SUM(ps.reserved_page_count)
            , data = SUM(CASE
                            WHEN (ps.index_id < 2) 
                                THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count)
                            ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count)
                        END)
            , used = SUM (ps.used_page_count) 
        FROM sys.dm_db_partition_stats ps
        GROUP BY ps.object_id) AS a1
    INNER JOIN sys.all_objects a2  ON a1.object_id = a2.object_id
    INNER JOIN sys.schemas a3 ON a2.schema_id = a3.schema_id
    LEFT JOIN (SELECT it.parent_id
            , reserved = SUM(ps.reserved_page_count)
            , used = SUM(ps.used_page_count)
        FROM sys.dm_db_partition_stats ps
            INNER JOIN sys.internal_tables it ON it.object_id = ps.object_id
        WHERE it.internal_type IN (202,204)
        GROUP BY it.parent_id) AS a4 ON a4.parent_id = a1.object_id
WHERE a2.type <> 'S' and a2.type <> 'IT'
    --AND a2.name IN ('spt_values')
ORDER BY a1.reserved desc

Index Columns and Names

Don’t trust the name of an index to tell you what it is. The name is wrong, or at least that’s what you have to assume.

I’m a huge fan of index naming conventions, with my favorite being starting out by saying it’s an index (IDX) followed by the table name, then the key columns, the letters INCL if there are included columns, each included column listed (if reasonable, just do all or nothing for each index), a U or N denoting if it’s unique or not, then a C or N denoting if it’s clustered or not. However, there are too many times I’ve seen an index naming convention get me in trouble where the index IDX_TableName_Key1_Key2_Key3_INCL_Incl1_Incl2_U_N actually not have the column Key1 in it, and it wasn’t unique either.

You can’t always trust the naming convention, even if it does exist. However, the system tables are always right. Here we have all of the key and included fields in an index, how much each index has been used (since restart of services or rebuilds in 2012+), and the size of each index. This typically takes less than a second to run in my experience, and you’ll get a feel for how it runs in your nonprod environment before it runs in prod, right?

DECLARE @TableName VarChar(100) = '%' --Accepts wildcards, % will give you all indexes in the database

SELECT TableName = i.SchemaName + '.' + i.TableName
	, IndexName = ISNULL(i.IndexName, '[' + Lower(i.IndexType) + ']')
	, i.User_Updates
	, i.User_Seeks
	, i.User_Scans
	, i.User_Lookups
	, KeyColumnList = substring((SELECT (', ' + c.name)
							FROM sys.index_columns ic
								INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
							WHERE i.object_id = ic.object_id AND i.index_id = ic.index_id
								AND ic.is_included_column = 0
							ORDER BY ic.key_ordinal
							FOR XML PATH ('')
							), 3, 2000)
	, IncludedColumnList = CASE WHEN i.IndexType IN ('Clustered', 'Heap') THEN '*' ELSE
							substring((SELECT (', ' + c.name)
							FROM sys.index_columns ic
								INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
							WHERE i.object_id = ic.object_id AND i.index_id = ic.index_id
								AND ic.is_included_column = 1
							ORDER BY c.Name
							FOR XML PATH ('')
							), 3, 2000)
						END
	, i.filter_definition
	, i.is_unique
	, i.Used_MB
	, i.Reserved_MB
	, Free_MB = i.Reserved_MB - i.Used_MB
	, i.Used_InRow_MB
	, i.Used_LOB_MB 
	, i.Row_Count
	, i.index_id
--	, DropStatement = 'DROP INDEX [' + i.IndexName + '] ON [' + i.SchemaName + '].[' + i.TableName + ']'
FROM 
	(
	SELECT SchemaName = s.name
		, TableName = t.name
		, IndexName = i.name
		, IndexType = i.type_desc 
		, i.object_id  
		, i.index_id
		, i.filter_definition
		, i.is_unique
		, User_Updates = SUM(ius.User_Updates)
		, User_Seeks = SUM(ius.user_seeks)
		, User_Scans = SUM(ius.user_scans)
		, User_Lookups = SUM(ius.user_lookups)
		, i.Used_MB 
		, i.Reserved_MB 
		, i.Used_InRow_MB
		, i.Used_LOB_MB 
		, i.row_count
	FROM (SELECT i.name
				, i.type_desc
				, i.object_id
				, i.index_id
				, i.is_unique
				, i.filter_definition
				, Used_MB = SUM(PS.used_page_count) / 128
				, Reserved_MB = SUM(PS.reserved_page_count) / 128 
				, Used_InRow_MB = SUM(PS.in_row_used_page_count) / 128
				, Used_LOB_MB = SUM(PS.lob_used_page_count) / 128
				, row_count = SUM(PS.row_count)
			FROM sys.indexes i 
				LEFT JOIN sys.dm_db_partition_stats PS ON i.object_id = PS.object_id AND i.index_id = PS.index_id
			GROUP BY i.name, i.type_desc, i.object_id, i.index_id, i.is_unique, i.filter_definition) i
		INNER JOIN sys.all_objects t ON i.object_id = t.object_id
		INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
		LEFT JOIN sys.dm_db_Index_Usage_Stats ius ON ius.object_id = i.object_id 
											AND ius.index_id = i.index_id 
											AND ius.database_id = DB_ID()
	WHERE t.name LIKE @TableName  	
	GROUP BY s.name, t.name, i.name, i.object_id, i.index_id, i.is_unique, i.type_desc, i.filter_definition, i.Used_MB, i.Reserved_MB, i.row_count, Used_InRow_MB, Used_LOB_MB
	) i
ORDER BY 1, KeyColumnList 

You may say that’s a temporary one-off inqury the doesn’t fix anything, and it is. However, the permanent fix is very invasive, will void your support contracts, can cause damage, would cause pieces of future upgrades to fail, may not work if it generates a name that’s too long, and other minor details. Assuming you have a home-grown database and absolutely no query hints specifying an index anywhere in your code, have a dev environment recently refreshed from prod, and have looked into every other issue that I never even considered, do I have some code for you!!!

SELECT Command = '--DON''T RUN THIS WITHOUT FIRST LOOKING INTO THE CONSEQUENCES AND UPDATING YOUR RESUME'

UNION 
SELECT Command = 'EXEC sp_rename ''' + TableName + '.' + IndexName + ''', ''' + 'IX_' + TableName + ColumnList + case when len(IncludeList) > 3 then '_INCL' else '' end + ISNULL(IncludeList, '') + '_' + case is_unique when 1 then 'U' else 'N' end + '_' + left(IndexType COLLATE SQL_Latin1_General_CP1_CS_AS, 1) + ''' , ''INDEX'''
FROM (
        SELECT TableName = t.name
            , IndexName = i.name
            , i.is_unique
            , IndexType = i.type_desc
            , FileGroupName = d.name
            , ColumnList = substring((SELECT ('_' + c.name)
                                    FROM sys.index_columns ic
                                        INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
                                    WHERE i.object_id = ic.object_id AND i.index_id = ic.index_id
                                        AND ic.is_included_column = 0
                                    ORDER BY ic.key_ordinal
                                    FOR XML PATH ('')
                                    ), 1, 2000)
            , IncludeList = substring((SELECT ('_' + c.name)
                                    FROM sys.index_columns ic
                                        INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
                                    WHERE i.object_id = ic.object_id AND i.index_id = ic.index_id
                                        AND ic.is_included_column = 1
                                    ORDER BY ic.key_ordinal
                                    FOR XML PATH ('')
                                    ), 1, 2000)
        FROM sys.tables t
            INNER JOIN sys.indexes i ON t.object_id = i.object_id
            INNER JOIN sys.data_spaces d ON i.data_space_id = d.data_space_id
        WHERE t.name IN (
                  'Table_1'
                , 'Table_2')
    )x
ORDER BY 1

Running Processes

Sp_who2 is not used by this DBA.  That info is amazingly good to have, but I want more.  Don’t get me wrong, sp_who2 is kinda hidden in this one as it’s really just a proc that does little more than “SELECT * FROM SysProcesses“, and I use SysProcesses where I couldn’t find the same info in the DMVs that were supposed to replace it.  Don’t go thinking that SysProcesses is just a depreciated dinosaur, even MVPs still use it: http://sqlblog.com/blogs/kalen_delaney/archive/2008/06/29/why-i-still-need-sysprocesses.aspx

Threads and RunningThreads are new for me here.  It’s good to see when a query is split into 100 threads and only 1 is doing anything.  It’s also easier on the eyes to see the number 100 instead of seeing 100 separate rows for one SPID.

The percent complete and estimated completion time on this script are typically blank, with the major exceptions being backups and restores where these columns are the best thing ever! 

BlockedBy is something overlooked by many people.  If your query seems to be running forever when it normally runs quick, there’s about a 75% chance you’re just waiting for another process to complete. It’s also why you’ll see the notorious (nolock) in almost everything I write for myself, but that’s another conversation.  The HeadBlocker attempts find the start of a blocking chain, although I can say that I know my logic here isn’t bulletproof.

My wait types used to be perfect where if there wait time was 0 then it was lowercase, otherwise it was all caps.  It seems I broke it slightly, but it’s almost perfect.  My OCD will eventually get the best of me here, but it has to contend with time with the rest of my life.

The rest is pretty self-explanitory with the exception of query plans.  They are the last line of the SELECT and FROM clauses that are commented out.  It’s somewhat rare to need the plan, and it’s not too uncommon to have this script run noticably slower when looking for plans.  Because of that, I uncomment that column and the the cross apply that makes it possible only when needed.

SELECT SPID = er.session_id
    , ot.Threads
    , RunningThreads = coalesce(rsp.RunningThreads,0)
    , Pct_Comp = er.percent_complete
    , Est_Comp_Time = CASE er.estimated_completion_time WHEN 0 THEN NULL ELSE dateadd(ms, er.estimated_completion_time, getdate()) END 
    , er.status
    , er.command
    , database_name = sd.name
    , BlockedBy = wt.blocking_session_id
    , HeadBlocker = coalesce(hb5.session_id, hb4.session_id, hb3.session_id, hb2.session_id, hb1.session_id)
    , wait_type = coalesce(CASE er.wait_type WHEN 'CXPACKET' THEN 'CXPACKET - ' + sp.lastwaittype1 ELSE sp.lastwaittype1 END, lower(er.last_wait_type)) --Lowercase denotes it's not currently waiting, also noted by a wait time of 0.
    , Wait_Time_Sec = Cast(er.wait_time/1000.0 as DEC(20,3))
    , er.wait_resource
    , Duration_Sec = Cast(DATEDIFF(s, er.start_time, GETDATE()) as DEC(20,0))
    , CPU_Sec = Cast(er.cpu_time/1000.0 as DEC(20,3))
    , Reads_K = Cast(er.reads/1000.0 as DEC(20,3))
    , Writes_K = Cast(er.writes/1000.0 as DEC(20,3))
    , [Statement] = SUBSTRING (st.text, er.statement_start_offset/2,
        abs(CASE WHEN er.statement_end_offset = -1
               THEN LEN(CONVERT(NVARCHAR(MAX), st.text)) * 2 
            ELSE er.statement_end_offset END - er.statement_start_offset)/2)
    , st.text as Query
    , es.login_time
    , es.host_name
    , program_name = CASE LEFT(es.program_name, 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(es.program_name,32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(es.program_name, 67, len(es.program_name)-67)
                    ELSE es.program_name
                    END  
    , es.client_interface_name
    , es.login_name
    , es.status
    , es.total_scheduled_time
    , es.total_elapsed_time
    , er.start_time
    , es.last_request_start_time
    , es.last_request_end_time
    , er.database_id  
    --, qp.query_plan 
FROM sys.dm_exec_requests er
    INNER JOIN sys.dm_exec_Sessions es on er.session_id=es.session_id
    LEFT JOIN sys.databases sd on er.database_id=sd.database_id
    INNER JOIN (SELECT session_id, count(1) Threads FROM sys.dm_os_tasks GROUP BY session_id) ot on er.session_id=ot.session_id
    LEFT JOIN (SELECT spid, LastWaitType1 = MIN(lastwaittype), LastWaitType2 = MAX(lastwaittype) FROM sysprocesses sp WHERE waittime > 0 AND lastwaittype <> 'cxpacket' GROUP BY spid) sp ON er.session_id = sp.spid
    LEFT JOIN (SELECT spid, RunningThreads = COUNT(1) FROM sysprocesses sp WHERE waittime = 0 GROUP BY spid) rsp ON er.session_id = rsp.spid
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt WHERE wt.blocking_session_id <> wt.session_id GROUP BY session_id) wt ON er.session_id = wt.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb1 ON wt.blocking_session_id = hb1.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb2 ON hb1.blocking_session_id = hb2.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb3 ON hb2.blocking_session_id = hb3.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb4 ON hb3.blocking_session_id = hb4.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb5 ON hb4.blocking_session_id = hb5.session_id 
    CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS st  
    --CROSS APPLY sys.dm_exec_query_plan(er.plan_handle) qp
WHERE er.session_id <> @@SPID
    --AND es.host_name like '%%'
    --AND er.session_id = 2702
ORDER BY er.percent_complete DESC, er.cpu_time DESC, er.session_id

--Use the below command to get the last input of an open session id
--dbcc inputbuffer(61)

Backup History

Can you tell me the drive we were backing up server #58 to exactly 8 months ago, and, if possible, it would really help if you could give me the exact file name for me to pull the backup from tape for you.  Or, last night’s backup didn’t finish until after the tapes started, did those backups always take that long and is it taking longer because the databases are actually growing that much?

The answer to those questions tends to be “Ok, no problem.”

SELECT TOP 1000 bs.database_name
	, bs.backup_start_date
	, bs.backup_finish_date
	, backup_duration_minutes = (DateDiff(second, backup_start_date, backup_finish_date)+30)/60
	, backup_duration_seconds = DateDiff(second, backup_start_date, backup_finish_date)
	, backup_size_mb = Cast(bs.backup_size / 1024 / 1024 as Int)
	--, compressed_backup_size_mb = Cast(bs.compressed_backup_size / 1024 / 1024 as Int) --SQL 2008+ only
	, backup_type = CASE bs.type WHEN 'D' THEN 'FULL' WHEN 'I' THEN 'Diff' WHEN 'L' THEN 'Log' ELSE 'UNKNOWN' END
	, bmf.physical_device_name
	, EntryType = CASE WHEN bs.server_name = @@ServerName THEN 'Backup' ELSE 'Refresh' END
FROM msdb..backupset bs
	INNER JOIN msdb..backupmediafamily bmf ON bs.media_set_id = bmf.media_set_id
WHERE bs.type <> 'L'
	bs.database_name like  'test_case'
ORDER BY 2 DESC

It’s common in my scripts for me to put 10 things in the where clause and comment them out. My problem is that I don’t want to look around for field names to filter my scripts, but I don’t always want to filter my scripts the same way. It works for me, and you’ll either get used to it or rewrite it for your own use.

Cluster Info

This script is so simple you’ll start off by asking why you’d ever use it.  Then you’ll use it and wonder why no one gave this to you as part of your new-hire orientation at your first DBA job.  The truth is that in larger environments there will always be times when you’re asked what the active node is, are two instances running on the same node, what nodes are in that cluster, what drives belong to what instance, etc…  Here’s your answer.

SELECT VirtualServerName = SERVERPROPERTY('ServerName')
    , ActiveNode = SERVERPROPERTY('ComputerNamePhysicalNetBIOS') 
    , PassiveNode = CASE (SELECT COUNT(*) FROM sys.dm_os_cluster_nodes WHERE NodeName <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS'))
        WHEN 0 THEN 'Stand-Alone'
        WHEN 1 THEN (SELECT NodeName FROM sys.dm_os_cluster_nodes WHERE NodeName <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS'))
        ELSE Cast((SELECT COUNT(*) FROM sys.dm_os_cluster_nodes WHERE NodeName <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS')) as VarChar(2)) + ' Passive Nodes'
        END

SELECT PassiveNodes = NodeName 
FROM sys.dm_os_cluster_nodes 
WHERE NodeName <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS') 

SELECT ClusterDrives = DriveName 
FROM sys.dm_io_cluster_shared_drives
ORDER BY DriveName

File Sizes

There are two scripts I like to use to look at the file sizes on servers.  The first one is everything I could hope for on servers where there is only a single file per filegroup.  All of the vital information is all in one place.  I know, DBAs tend to be good at math and there was no need for me to make all of these columns.  However, any room I leave to demonstrate my mathematical powers in a professional environment…. Ok, look, CPU time is cheap anymore, I’ll go with that for my reasoning.

IF Object_ID('TempDB..##FileSize') IS NOT NULL BEGIN
    DROP TABLE ##FileSize
END

CREATE TABLE ##FileSize
(      DB VarChar(128)
    , FileLogicalName VarChar(128)
    , FilePhysicalName VarChar(512)
    , FileGroup VarChar(128)
    , Used_MB VarChar(128)
    , Free_MB VarChar(128)
    , Size_MB VarChar(128)
    , GrowthRate VarChar(128)
    , MaxSize VarChar(128)
)

exec sp_MSforeachdb N'use [?]; 
INSERT INTO ##FileSize
SELECT DB = db_name()
    , FileLogicalName = f.name
    , FilePhysicalName = f.physical_name
    , FileGroup = ISNULL(g.name, f.Type_Desc) 
    , Used_MB = REPLACE(CONVERT(varchar(100), (CAST((FileProperty(f.name, ''SpaceUsed'')) / 128 AS money)), 1), ''.00'', '''')
    , Free_MB = REPLACE(CONVERT(varchar(100), (CAST((f.size - FileProperty(f.name, ''SpaceUsed'')) / 128 AS money)), 1), ''.00'', '''')
    , Size_MB = REPLACE(CONVERT(varchar(100), (CAST((f.size) / 128 AS money)), 1), ''.00'', '''')
    , GrowthRate = Case Is_Percent_Growth
                    WHEN 1 THEN '''' + Cast(Growth as VarChar(100)) + ''%''
                    ELSE REPLACE(CONVERT(varchar(100), (CAST(Growth/128 AS money)), 1), ''.00'', '''') + '' MB''
                    END 
    , MaxSize = Case Max_Size
                    WHEN -1 THEN ''---''
                    WHEN 268435456 THEN ''---''
                    ELSE REPLACE(CONVERT(varchar(100), (CAST(Max_Size/128 AS money)), 1), ''.00'', '''') + '' MB''
                    END
FROM sys.database_files f
    LEFT JOIN sys.filegroups g on f.data_space_id = g.data_space_id
'

SELECT *
    , PctUsed = Cast(100 * replace(Used_MB, ',', '') / (Cast(replace(Size_MB,',', '') as Dec(20,2)) + .01) as Dec(20,2))
    --, PctFree = 100 - Cast(100 * replace(Used_MB, ',', '') / (Cast(replace(Size_MB,',', '') as Dec(20,2)) + .01) as Dec(20,2))
FROM ##FileSize
--WHERE DB = 'msdb'
ORDER BY FilePhysicalName

DROP TABLE ##FileSize

The problem with that script is that there are times it makes sense to have multiple files in a filegroup.  Be it disk limitations, best practices on a SAN saying not to extend a LUN, or your predecessor tossed logic to the wind, there are many reasons you could have multiple files where the previous script leaves you doing math to figure out how big the filegroup is or how much room for growth you have.  You know what I said about doing math…CPU time is cheap.  Here’s what I use when I need to see the filegroup sizes.

IF Object_ID('TempDB..##FileGroupSize') IS NOT NULL BEGIN
    DROP TABLE ##FileGroupSize
END

CREATE TABLE ##FileGroupSize
(      DB VarChar(128)
    , FileGroup VarChar(128)
    , Used_MB Int
    , Free_MB Int
    , Size_MB Int
)

exec sp_MSforeachdb N'use [?]; 
INSERT INTO ##FileGroupSize
SELECT DB = db_name()
    , FileGroup = ISNULL(g.name, f.Type_Desc) 
    , Used_MB = SUM(FileProperty(f.name, ''SpaceUsed'')) / 128
    , Free_MB = SUM(f.size - FileProperty(f.name, ''SpaceUsed'')) / 128
    , Size_MB = SUM(f.size) / 128 
FROM sys.database_files f
    LEFT JOIN sys.filegroups g on f.data_space_id = g.data_space_id
GROUP BY f.Type_Desc, g.name
'

SELECT *
    , PctUsed = Cast(100 * Used_MB / (Cast(Size_MB as Dec(20,2)) + .01) as Dec(20,2))
    , PctFree = 100 - Cast(100 * Used_MB / (Cast(Size_MB as Dec(20,2)) + .01) as Dec(20,2))
FROM ##FileGroupSize
ORDER BY DB, FileGroup

DROP TABLE ##FileGroupSize

It should be noted that I have different rules for different situations.  If a script is part of a stored proc or view then * isn’t even a consideration.  If an external program, even an SSRS report, will consume the code then I would never try to format it using SQL.  However, these are saved in my scripts folder as FileSize.sql and FilegroupSizes.sql, and my personal .sql files get to break a rule or two.

Database Assessment

This script looks much more intimidating than the results it produces, so I recommend running it before reading through it.  I’ll run this when I first get on a unfamiliar server as a basic assessment of what’s there and find some hard to find issues with no effort.

The issues you’ll find include:

  • DBCC CheckDB – When is the last time this was successfully run.  You need to know this, but it’s not as easy to find as it should be.  Personally, I feel it should be in sys.databases, but I’ll settle for having this script saved to find it for me.
  • Backups – Where are they, when did they last occur, how big, etc.
    • I always look for databases (excluding model) that are in full or bulk logged recovery with no log backups.  This is a common issue, especially with vendor databases, that will fill your drives with data you didn’t care to have.
      • To know how critical it is when you find this, and you probably will, I include the log size and percent full.
    • There are variables in this script for a cutoff date which will ignore old backups. If you set @use_cutoffdate to 1 then any database not backed up since the value of @cutoffdate will show up as never being backed up. I love this feature because there’s no chance that you’ll misread a year-old backup as being taken last night.
    • Warning, the size of the backup is the size of the data, not the backup file.  To keep this compatible with SQL 2005, I’m not grabbing the compressed_backup_size field.
    • Differential backup results are commented out for a reason…most people don’t use them.  There is absolutely no reason to comment this out other than limiting the number of columns in my results.
  • Database Size – Both allocated and used space.  It’s best that you store this and trend the data somewhere, such as on my post Monitoring Database and Table Sizes, but having a static value is a start.
  • Compatibility Level – If you upgrade a server it’s easy to leave a database behind.  I’ve seen SQL 2000 compatibility on SQL 2008 R2 for no other reason than “oops”.
  • Collation name – If this isn’t consistent you could run into issues that are next to impossible to debug.  The exceptions, ironically, are Microsoft databases.  Report Server and SharePoint databases specifically will be a different collation than anything else, but they were written with that in mind and handle it gracefully and shouldn’t be changed.
  • Auto Close – Ok, I admit it, I actually have this one commented out in my personal copy of this code.  If this is turned on it will typically blow up your error logs and make them half-way unreadable, and I’ll see this when I attempt to read them.  However, it’s good to know if this is ever turned on as it will kill performance.  I should uncomment this in my version with the mentality that too much information will slightly annoy you, too little will kill you.
Especially with most of us running more vendor databases than home-grown, don’t trust that this stuff was done right.
DECLARE @cutoffdate datetime
DECLARE @use_cutoffdate bit
SET @cutoffdate = DateAdd(Month, -3, GetDate())
SET @use_cutoffdate = 0

IF Object_ID('TempDB..#FileGroupSize') IS NOT NULL BEGIN
	DROP TABLE #FileGroupSize
END

IF Object_ID('TempDB..#DBCC_Temp') IS NOT NULL BEGIN
	DROP TABLE #DBCC_Temp
END

IF Object_ID('TempDB..#DBCC') IS NOT NULL BEGIN
	DROP TABLE #DBCC
END

CREATE TABLE #FileGroupSize
(	  DB VarChar(128)
	, FileGroup VarChar(128)
	, Used_MB Int
	, Free_MB Int
	, Size_MB Int
)

CREATE TABLE #DBCC_Temp 
(
	ParentObject VARCHAR(255),
	[Object] VARCHAR(255),
	Field VARCHAR(255),
	[Value] VARCHAR(255)
)

CREATE TABLE #DBCC
(
	DBName SysName 
	, DBCC_LastGood SmallDateTime
)

exec sp_MSforeachdb N'use [?]; 
INSERT INTO #FileGroupSize
SELECT DB = db_name()
	, FileGroup = ISNULL(g.name, f.Type_Desc) 
	, Used_MB = SUM(FileProperty(f.name, ''SpaceUsed'')) / 128
	, Free_MB = SUM(f.size - FileProperty(f.name, ''SpaceUsed'')) / 128
	, Size_MB = SUM(f.size) / 128 
FROM sys.database_files f
	LEFT JOIN sys.filegroups g on f.data_space_id = g.data_space_id
--WHERE f.Type_Desc = ''LOG''
GROUP BY f.Type_Desc, g.name

TRUNCATE TABLE #DBCC_Temp

INSERT INTO #DBCC_Temp
EXECUTE(''DBCC PAGE ([?], 1, 9, 3)WITH TABLERESULTS'')

INSERT INTO #DBCC
SELECT ''?'', NULLIF([Value],''1900-01-01 00:00:00.000'')
FROM #DBCC_Temp
WHERE Field = ''dbi_dbccLastKnownGood''
'

SELECT database_name = d.name 
	, d.create_date
	, d.compatibility_level
	, #DBCC.DBCC_LastGood
	, LastFullBU = BUFull.last_backup  
	, LastDiffBU = BUDiff.last_backup  
	, LastLogBU = BULog.last_backup  
	, Recovery_Model = d.recovery_model_desc
	, Log_Size_MB = LogSize.Size_MB
	, Log_Pct_Used = Cast(100 * LogSize.Used_MB / (Cast(LogSize.Size_MB as Dec(20,2)) + .01) as Dec(20,2))
	, Data_Size_MB = DataSize.Size_MB
	, Data_Used_MB = DataSize.Used_MB
	, d.collation_name
	, LastFullLoc = BUFullLoc.physical_device_name
	, LastFullSize_MB = Cast(BUFullLoc.backup_size / 1024 / 1024 as bigint)
	--, LastDiffLoc = BUDiffLoc.physical_device_name
	--, LastDiffSize_MB = Cast(BUDiffLog.backup_size / 1024 / 1024 as bigint)
	, LastLogLoc = BULogLoc.physical_device_name
	, LastLogSize_MB = Cast(BULogLoc.backup_size / 1024 / 1024 as bigint)
	, d.is_auto_close_on
FROM master.sys.databases d with (nolock)
	LEFT JOIN (SELECT b.database_name, max(b.backup_finish_date) last_backup
		FROM msdb..backupset b with (nolock)
		WHERE b.type = 'D' 
			AND (b.backup_finish_date > @cutoffdate or @use_cutoffdate = 0)
		GROUP BY b.database_name
		) BUFull ON d.name = BUFull.database_name 
	LEFT JOIN (SELECT b.database_name, max(b.backup_finish_date) last_backup
		FROM msdb..backupset b with (nolock) 
		WHERE b.type = 'I' 
			AND (b.backup_finish_date > @cutoffdate or @use_cutoffdate = 0)
		GROUP BY b.database_name
		) BUDiff ON d.name = BUDiff.database_name 
	LEFT JOIN (SELECT b.database_name, max(b.backup_finish_date) last_backup
		FROM msdb..backupset b with (nolock) 
		WHERE b.type = 'L' 
			AND (b.backup_finish_date > @cutoffdate or @use_cutoffdate = 0)
		GROUP BY b.database_name
		) BULog ON d.name = BULog.database_name 
	LEFT JOIN (SELECT f.physical_device_name, b.database_name, b.backup_finish_date, b.backup_size
		FROM msdb..backupset b with (nolock)
			INNER JOIN msdb..backupmediafamily f ON b.media_set_id = f.media_set_id
		WHERE b.type = 'D') BUFullLoc ON BUFull.database_name = BUFullLoc.database_name AND BUFull.last_backup = BUFullLoc.backup_finish_date
	/*LEFT JOIN (SELECT f.physical_device_name, b.database_name, b.backup_finish_date, b.backup_size
		FROM msdb..backupset b with (nolock)
			INNER JOIN msdb..backupmediafamily f ON b.media_set_id = f.media_set_id
		WHERE b.type = 'I') BUDiffLoc ON BUDiff.database_name = BUDiffLoc.database_name AND BUDiff.last_backup = BUDiffLoc.backup_finish_date
	*/
	LEFT JOIN (SELECT f.physical_device_name, b.database_name, b.backup_finish_date, b.backup_size
		FROM msdb..backupset b with (nolock)
			INNER JOIN msdb..backupmediafamily f ON b.media_set_id = f.media_set_id 
		WHERE b.type = 'L') BULogLoc ON BULog.database_name = BULogLoc.database_name AND BULog.last_backup = BULogLoc.backup_finish_date
	LEFT JOIN (SELECT Size_MB, Used_MB, DB 
				FROM #FileGroupSize 
				WHERE FileGroup = 'Log') LogSize ON d.name = LogSize.DB
	LEFT JOIN (SELECT Size_MB = SUM(Size_MB)
					, Used_MB = SUM(Used_MB)
					, DB 
				FROM #FileGroupSize 
				WHERE FileGroup <> 'Log' 
				GROUP BY DB) DataSize ON d.name = DataSize.DB
	LEFT JOIN #DBCC ON d.name = #DBCC.DBName 
WHERE d.name <> 'tempdb'
	AND d.state = 0 --Online

DROP TABLE #FileGroupSize
DROP TABLE #DBCC_Temp
DROP TABLE #DBCC