Powershell, when were my servers rebooted?

One of my servers wasn’t rebooted when it should have been and I never realized it until after the outage was over.  The big deal wasn’t that it didn’t reboot, it was that I never realized it.  These are the things that bug me, so I used it as an excuse to learn a bit of PowerShell.

A little backstory to give the details a better chance of making sense to everyone…  Instead of having a couple overworked sys admins doing windows updates for every server, we each take care of our own with a sys admin checking that all updates were applied.  It’s the first I’ve heard of it being done this way, but I like it.

To keep things consistent, the database team has a policy of rebooting all database servers regardless of WSUS requiring it.  One month most of the servers required a reboot and bounced automatically, except for one that didn’t require a reboot.  After the first 10 or 15 rebooted, I got lazy and started just verifying the servers were up, SQL was responding, all updates were applied, and there was no pending reboot.

It wasn’t until I was reading the logs the next business day I realized that that server was missing all of the chatter from SQL Services starting up.  Like I said, the biggest issue was me being frustrated that it snuck by me, and I didn’t want that to happen again.  I wanted my solution to be independent of SQL Server because not all of the computers I do updates on have a SQL engine.

Wanting an excuse to learn PowerShell better, I talked to our Exchange guy (they know their PowerShell) and got a basic script to determine a computer’s boot time given its name.  I expanded that to be an entire AD group, then a list of AD groups, and finally using wildcards to find AD groups.  Wildcards work best for me because I’m reorganizing my SUS groups and all of mine have “SQL” in the groups name, even for the app servers I’m responsible for.

That resulted in the following code.  Now I’m absolutely certain that a PowerShell or AD guru can poke a ton of holes in my logic, but that’s what we have gurus for!  To be fair, I’d do the same to their T-SQL.  If any of those gurus are out there reading this DBA’s blog, the comment section below is how you fix my code and stop me from spreading bad code like a virus.

Import-Module ActiveDirectory

#either method works for getting a list of groups.  You can type in all of your groups or make a query to find them all.  I prefer the query if possible.

#$groups = "Test SQL Servers", "Prod SQL Servers 1", "Prod SQL Servers 2", "SQL Cluster Servers 1", "SQL Cluster Servers 2"

$groups = Get-ADGroup -Filter {name -like "*SQL *"} | where-object {$_.distinguishedname -like "*OU=SUS Group*"}

ForEach ($group in $groups) {
    $computerlist = Get-ADGroupMember $group -Recursive | SELECT name 

    #If there were any computers in the list, do this.  It skips empty groups.
    if ($computerlist) {
        $computerlistcount = $computerlist.Count

        #It returns an object instead of an array if there was only one, so count would be null
        if (!$computerlistcount) {
            $computerlistcount = 1
        }

        Write-Host ""
        Write-Host "$computerlistcount server(s) in" $group.name
    }

    ForEach ($computer in $computerlist) {
        try {
            Get-WmiObject win32_operatingsystem -ComputerName $computer.name -ErrorAction Stop | select csname, @{LABEL='LastBootUpTime';EXPRESSION={$_.ConverttoDateTime($_.lastbootuptime)}}
        }
        catch {
            #This logic is lacking.  I don't care what error you got, just say you couldn't connect and we'll move on.
            #For this script, it probably means the server is in the middle of a reboot
            Write-Host "Could not connect to" $computer.name
            Continue
        }
    }
}
 

There are relatively equivalent things you can do in SQL Server, but that’s assuming you’re running a SQL Server engine on every server you care about.  Even from the viewpoint of a DBA running the SUS updates on my own servers, this doesn’t hold up.  I still care about my monitoring application server and several others.

That being said, here’s what SQL can do for you:

SELECT create_date FROM sys.databases WHERE database_id = 2

exec xp_cmdshell 'wmic os get lastbootuptime'

There’s also some more bad news with these beyond requiring the SQL Server engine.

The create date of TempDB is useful, but it’s created on startup of SQL Services, not the computer.  This means if you restart SQL Services then you get a new date, even if you specifically state that you want to restart every computer during your SUS outage whether it’s required or not.

The WMIC call gets around that and tells you when the server came up, but it’s relying on xp_cmdshell.  This extended procedure has a bad rep for being a security loophole, even though by default to run the thing you need enough permissions to be able to turn it on.  Because of that, it’s turned off in many environments and flipping it on for a second can be frowned upon.

File Growths

I hinted at file growths in my previous posts about shrinking data and log files.  Then I talked about growing log files in my post Database Log VLFs.  However, there are still some unanswered questions about growing data files.

Should I use autogrowth?  What should I set it to grow by?

Why do manual growths, and how do I know how much to grow by or when a growth is needed?

Can I and should I use Instant File Initialization?  If I can and should, how do I even know if it’s turned on?

Leave Autogrowth On?

Yes, absolutely.  This isn’t a question.

This could be worded as “Do you want a slight pause or do you want your application to fail?”  The answer is a little more obvious when you word it that way.  The only time I’ve seen an appropriate answer of “No” is when you have multiple data files in a filegroup and only want one to grow.

To answer how long this pause will be you have to ask these two questions.  Are you able to use Instant File Initialization?  How much are you growing by if you can’t use Instant File Initialization?  I’ll discuss Instant File Initialization later in this post, but it’s basically the difference between “this space is yours immediately” instead of writing zeros SQL Server doesn’t care about to be paranoid about security.

There are competing factors as to how much you want to autogrowth to be.  The only real consensus is that the defaults are horrible.  You need to balance out how long of a pause it will take for a growth, how long of a pause your users will tolerate, how fast is your database growing, how much fragmentation smaller growths will cause, and how much fragmentation you will tolerate.

By default, data files grow at 1 MB and log files grow at 10%.  I covered log file growths rather well in my post Database Log VLFs, so we’ll focus on the data files here.  Think about it, if you have a 100 GB database that got there by the defaults, it got there in 102,400 pieces.  Not good.

If the defaults are horrible, what’s good?  Tiny is bad, huge is bad, and percentage-based growth just lets you automatically drift from tiny to huge.  Do your balancing act I talked about above.  If growing by 64 MB will keep you good for several weeks, set it to 64 MB.  If 1 GB goes by rather quickly, you can’t use Instant File Initialization, and the users can tolerate the pause it takes to grow by 1 GB, then this might be the best growth rate for you.  No single answer is right, but at least you have a better idea of what the variables are.

There are also cases where there are no good answers.  Say you have an app wouldn’t be uncommon to grow by a couple GB a day, you can’t use Instant File Initialization, and pauses of more than a second wouldn’t go over very well.  You don’t want the fragmentation of setting the growth to 64 MB, but you don’t want the pause of setting it to 1 GB.  Maybe you’ll settle for 256 MB, but you’re not happy about it.  Manual growths are the real answer, but you want a safety net that won’t kill you with pauses or fragmentation.  Not everyone is in that situation, but some of us are.

If I’m going to suggest leaving Autogrowth on and suggest that you be aware of the size of these growths, then the least I can do is give you a script for that.  This will tell you the size of every database file and the autogrowth rate.  I’ll often use it to tweak the growth rates, so I script that change out in the results as well.  If you read the post Database Log VLFs then it will make more sense why I use 1,000 MB here instead of 1,024 MB; it wasn’t a typo.

DECLARE @FilterByDB VarChar(100)
	, @Cmd NVarChar(4000)

SELECT @FilterByDB = ''
--SELECT @FilterByDB = 'tempdb'

IF Object_ID('TempDB..##FileSize') IS NOT NULL BEGIN
	DROP TABLE ##FileSize
END

CREATE TABLE ##FileSize
(	  DB VarChar(128)
	, FileLogicalName VarChar(128)
	, FilePhysicalName VarChar(512)
	, FileGroup VarChar(128)
	, Used_MB VarChar(128)
	, Free_MB VarChar(128)
	, Size_MB VarChar(128)
	, GrowthRate VarChar(128)
	, MaxSize VarChar(128)
)


SELECT @Cmd =  N'use [?]; 

IF ''?'' like ''' + @FilterByDB + '%'' OR ''' + @FilterByDB + ''' = '''' BEGIN 
INSERT INTO ##FileSize
SELECT DB = db_name()
	, FileLogicalName = f.name
	, FilePhysicalName = f.physical_name
	, FileGroup = ISNULL(g.name, f.Type_Desc) 
	, Used_MB = REPLACE(CONVERT(varchar(100), (CAST((FileProperty(replace(f.name, '''', ''''''''), ''SpaceUsed'')) / 128 AS money)), 1), ''.00'', '''')
	, Free_MB = REPLACE(CONVERT(varchar(100), (CAST((f.size - FileProperty(f.name, ''SpaceUsed'')) / 128 AS money)), 1), ''.00'', '''')
	, Size_MB = REPLACE(CONVERT(varchar(100), (CAST((f.size) / 128 AS money)), 1), ''.00'', '''')
	, GrowthRate = Case Is_Percent_Growth
					WHEN 1 THEN '''' + Cast(Growth as VarChar(100)) + ''%''
					ELSE REPLACE(CONVERT(varchar(100), (CAST(Growth/128 AS money)), 1), ''.00'', '''') + '' MB''
					END 
	, MaxSize = Case Max_Size
					WHEN -1 THEN ''---''
					WHEN 268435456 THEN ''---''
					ELSE REPLACE(CONVERT(varchar(100), (CAST(Max_Size/128 AS money)), 1), ''.00'', '''') + '' MB''
					END
FROM sys.database_files f
	LEFT JOIN sys.filegroups g on f.data_space_id = g.data_space_id

END
' 
IF @FilterByDB <> '' AND 100 < (SELECT COUNT(1) FROM sys.databases) BEGIN
	SELECT @Cmd = REPLACE(@Cmd, '?', @FilterByDB)

	EXEC (@Cmd)
END ELSE BEGIN
	exec sp_MSforeachdb @Cmd 
END

SELECT *
	, PctUsed = Cast(100 * replace(Used_MB, ',', '') / (Cast(replace(Size_MB,',', '') as Dec(20,2)) + .01) as Dec(20,2))
	, SetGrowth64MB = 'ALTER DATABASE [' + db + '] MODIFY FILE ( NAME = N''' + FileLogicalName + ''', /*Size = MB,*/ FILEGROWTH = 64MB )'
	, SetGrowthTo256MB = 'ALTER DATABASE [' + db + '] MODIFY FILE ( NAME = N''' + FileLogicalName + ''', /*Size = MB,*/ FILEGROWTH = 256MB )'
	, SetGrowthTo1000MB = 'ALTER DATABASE [' + db + '] MODIFY FILE ( NAME = N''' + FileLogicalName + ''', /*Size = MB,*/ FILEGROWTH = 1000MB )'
FROM ##FileSize
ORDER BY 1, 4, /*cast(replace(size_mb, ',', '') as int) DESC,*/ FilePhysicalName

DROP TABLE ##FileSize

Manual Growths

If you want to be extreme about it, autogrowth is like asking someone about wearing a seatbelt.  Do you want it to hurt a little or a lot if you “crash” by running out of allocated space?  Well, you want it to hurt…wait…why do I have to crash?

This may seem over-the-top, but some of the applications may have databases like the last one I talked about in the autogrowth section.  Typically we’d prefer not to have autogrowth handling everything to avoid the pause while it grows, but there are cases like that where it’s more than a slight preference to avoid the pause.

So the autogrowths are our seatbelts we’re putting on just in case while catching everything with manual growths is our real goal of avoiding that “crash”.  However, how are you supposed to know when to grow a file or how much to grow it?  Seeing how large your files are now doesn’t tell you how much they’ll grow in the next couple months, they could be static or growing exponentially.

The partial answer is to say how much data was used before, how much data is being used now, and predict how much data will be used in the future using that trend.  A better answer is to say what the peak usage was over a week and compare that to past weeks, which would account for things that cause spikes in file usage such as index rebuilds or data loads.

If you have monitoring software, it may have that answer just waiting for you to query it.  That’s the situation I find myself in right now, and I wrote Database Growth Trends – Idera Diagnostic Manager talking about how I handle it right now.  However, I used to work with monitoring software that pretty much only told me OS-level info, so we watched this data manually as was discussed in my post Monitoring Database and Table Sizes.

To be clear, many databases are fine with autogrowth handling everything.  Do what’s best for you for each individual database.  I personally have a mix of databases where some I’d be upset if autogrowth kicked in because it would cause timeouts that the app handles extremely poorly, and others, such as my DBA database, where a pause would only affect my collection jobs or a query I’m personally running.  Autogrowth is not inherently evil, so only be paranoid where paranoia is called for.

Performing Manual Growths

Manually growing the file is just setting a new initial size.  To do that in SSMS, right-click on a database, properties, files, then change the initial size to how large you want it to be.

I’m not a big fan of having SSMS do things for me, so I’ll use “Script Action to New Query Window” and run it form there.  This will give me a script like this:

USE [master]
GO
ALTER DATABASE [DBA] MODIFY FILE ( NAME = N'DBA_Data', SIZE = 256000KB )
GO

One thing I wish I knew about years before I did is that the size in KB also accepts sizes in MB.  I write just about everything in MB, so it’s more natural for me to write that script with Size = 250MB instead.  I know I’m being petty, but I’m petty and comfortable.

This script could take a couple milliseconds or several seconds, even for 250MB.  It all depends on if you have Instant File Initialization turned on and if you’re able to use it for this growth.

Instant File Initialization – What is it?

Short answer, this makes growths exponentially quicker by opening an often-acceptable small security hole.

Of course, I always have a long answer…

By default, all space used by SQL Server is zeroed out before control is handed over.  This makes sure that it’s clean space for two basic reasons.  First, the database requires that space to be clean because of how it plans to use it.  Second, we don’t want to risk the possibility of special commands being able to pull back unencrypted sensitive data that used to be there.

Now the first reason we don’t have much control or say over at this point in the conversation.  If you’re growing a log file then it needs to be zeroed out, and SQL Server will zero out each VLF when it’s marked for reuse, too.  For data files, if you’re using Transparent Data Encryption (TDE) then the way SQL Server stores the data also requires the space to be zeroed out.  These things are outside of the conversation on file growths, it’s just what we’ll have to deal with at this point and SQL Server won’t use IFI even if it’s turned on.

The one thing that will be pulled into this discussion is the security aspect of using uninitialized space.  Kimberly Tripp demonstrates this the best in her post Instant Initialization – Why, Why, and How? in the “Why isn’t this on by default?” section.

I’ll try to make it short and sweet if you don’t want to read yet another post (I recommend you read it). You can ask SQL Server to read specific pages off the disk.  If SQL Server never actually used that page then it contains what was there when it was given to SQL Server.  Zero initialized disks are very boring info, just “0000000000….”.  Not much fun.

Uninitialized space is more fun (depending on your meaning of the word) and could contain old data.  If you had unencrypted sensitive data sitting there before, it could contain enough to steal someone’s identity.  You never know.  These commands will only work if you have the permissions to run them, but who has those permissions and how much do you trust them?  Most organizations say it’s fine, but some won’t.

Instant File Initialization – How do I use it?

This is an odd feature that you can’t just see in a table in SQL Server.  The easiest way is to run an xp_cmdshell command and look for it, but you can optionally turn on trace flags, grow or create a database, turn off the trace flags, then look in your error log.

Paul Randal gets into both methods in his post How to tell if you have instant file initialization enabled?.  I don’t have any issues with xp_cmdshell, so I’ll let you go to Paul’s blog if your company insists on the more difficult method.

As for me, I’ll run the query below as a multiserver query in SSMS and look at the results.  If that sounds foreign to you, check out my Multiserver Queries post, it’s a very simple and useful technique for specific tasks such as this one.  In fact, I wrote that post so I could link to it in this post.

However, even if IFI is turned on, it’s only going to work for data files that aren’t encrypted with TDE.  Logs and TDE data files need the zeros.  Also, if any database on your instance is using TDE then so is TempDB, which means its data files can’t use IFI either.

CREATE TABLE #Output
(
	DataOut VarChar(2000)
)

DECLARE @XP_CmdShell_Enabled int

SELECT @XP_CmdShell_Enabled = CONVERT(INT, ISNULL(value, value_in_use))
FROM master.sys.configurations
WHERE name = 'xp_cmdshell'

IF @XP_CmdShell_Enabled = 0 BEGIN
	EXEC sp_configure 'show advanced options', 1
	RECONFIGURE WITH OVERRIDE 
	EXEC sp_configure 'xp_cmdshell', 1
	RECONFIGURE WITH OVERRIDE 
END 

INSERT INTO #Output 
EXEC xp_cmdshell 'whoami /priv'

IF @XP_CmdShell_Enabled = 0 BEGIN
	EXEC sp_configure 'xp_cmdshell', 0
	RECONFIGURE WITH OVERRIDE 
END 

SELECT IFI_Enabled = COUNT(1) 
FROM #Output 
WHERE DataOut LIKE '%SeManageVolumePrivilege%Enabled%'

DROP TABLE #Output

If IFI is turned off and it’s proper to turn it on for this server, here are your steps to turn it on:

  1. Open Administrative Tools / Local Security Policy
  2. Local Policies
  3. User Rights Assignment
  4. Open “Perform volume maintenance tasks”
  5. Add the user for the SQL Server service account

If IFI can’t be used, here are your steps (which are also good even if IFI is being used):

  1. Monitor database sizes for manual growths
  2. Grow files manually

The SQL Server service account, if you don’t know which account it is, can be found in Administrative Tools / Services / SQL Server (InstanceName) under “Log On As”.  There are easier ways find the service account name in bulk if you’re doing this a lot, such as a slight variation of the query above.

xp_cmdshell Rant

Now, to be clear on xp_cmdshell which I used in the script above…. I do believe in security, but I don’t believe disabling this feature is the security you’re looking for.  When it’s set up properly, disabling xp_cmdshell is like locking the door to your data center’s supply closet with a keyless lock instead of limiting access to the data center.

I do have issue with granting execute permissions on this to non-sysadmin accounts.  I also have issue with granting sysadmin to anyone who asks, including vendors who insist that’s the only way their app will work.  This means the only users that can use it are the very, very few with sysadmin.  Also, any sysadmin can turn it on anyways, so the only extra security you get is a message in the error log saying it was turned on, but not even who did it.

My recommendation is that if you need it, leave it on.  If you don’t, leave it off.  If you have a one-time script that needs it, such as checking for IFI, turn it on just long enough to do what you need to do then automate turning it off.  The script follows this recommendation; it’s left on if it was already on, but turns it off if it was off.

It’s time for me to admit I’m going off on a tangent, so I’ll pass you off to Sean McCown’s (b|tSecurity Theater post if you’re too fired up to stop now.

In Summary

Autogrowth – Yes.  Just as a fail-safe if performance is important to you.  Limit the size to limit the performance impact, but don’t cause undue fragmentation.

Manual Growth – Yes.  Trend how fast you’re growing so you know how to stay ahead of it.

Instant File Initialization – Usually.  Understand the limitations and security risks, and turn it on if it’s the right thing to do.

Entry-Level Content Challenge

This is my fourth post in Tim Ford’s Entry Level Content Challenge.  I went through much of my career with people saying to manually grow files, but it was left at that.  I had no clue how to see how much space was used without clicking through the GUI, and passed it off as being ridiculous.  Now I found ways to make it easy, and I’m betting there’s a lot of people out there doing too much work or skipping this simply for not knowing how.

Read over Tim’s challenge and consider joining in.  It’s a great way to help people get past barriers you had in the past while learning how to clear them a little more efficiently yourself.  Anyone who’s reading this post already knows how to clear some barriers that others are viewing as an impassible brick wall, so you can make a difference for them.

 

 

 

 

Database Growth Trends – Idera Diagnostic Manager

How fast are your databases growing?  Should I panic about a 1 TB database that has 100 GB free on disk?  Am I safe with a 100 GB database that also has 100 GB free?  Based on those sizes…I have no idea.

It’s possible that the 1 TB database is purging data as fast as it takes it in, so it’s been at 1 TB for a long time and may never use the 100 GB it has free.  Looking at the 100 GB database, it may have been a new 1 MB database created two months ago and grew 1 MB at a time to get to where it’s at now. (Bonus points if you panicked reading that last sentence.)

The thing to look for is how big that database was before.  What did you write down for how much space it used last week, last month, and several months ago?  Do you remember all those stock broker commercials saying “Past performance doesn’t predict future results” which is supposed to lead you on to think “Yeah, but it’s the best indicator I have.”  The same goes for database growths.  So, you did write down what it was before, right?

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it if you’ve seen it before.

The biggest issue is the word “before”.  If you weren’t watching your servers before then you have nothing to compare the current state of your server to.  Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is already capturing this type of information to help make all the pretty charts and graphs you see through the GUI.  You need this software to run so you can get alerted when things go wrong, and pulling this data will just make it more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any.  However, that’s taking on a lot of responsibility, time, and effort.  If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool.  Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll make a write up on them and give you credit for being awesome supporting your product like that.  I can only write against what I can test against, it’s nothing personal.

What I’m Watching

I want to know when my files will hit an autogrowth and when they’ll run out of space.

Running out of space is pretty obvious.  If you’re out of space, every command that needs more space will fail, and users will run to the supply closet for torches and pitchforks.  It’s a bad day for the DBA.

Autogrowth is less of an issue and less obvious.  That’s not saying it’s not an issues, especially for larger databases that can’t take advantage of IFI.  The command that needs more space, and any that pile up behind it, will wait patiently (barring any timeouts) for the growth to take place.  However, SQL Server can be more patient than your users or applications.

That being said, if I knew I’d need more space than is allocated in the database using a simple script then I could do it when no one was waiting for the space.  No complaints, just make the files larger and move on with my life.  Oh, happy days.

To get all of this information I need to know a couple things:

  • How much space is used in each filegroup
  • How much space is allocated to each filegroup
  • How much can each filegroup grow according to disk free space and growth rates

Idera DM Captures That

Well, Idera does a pretty good job.  It captures the data size (used), data expansion (growth by growth rate and disk free space), and data file size (allocated).  The problem is that it captures this stuff on the database level, not the filegroup level.  This means that it’s possible for a filegroup to run out of space while the database as a whole has tons of free space.

I’m currently working in an environment where all data files for a single database are on the same drive, so it’s less concerning to me.  It is something to keep in mind, and may be very important to your environment.  If this is critical to you, I touch on how to capture this data yourself below.

If separating filegroups isn’t an issue for you (it often isn’t) then all you have to do to trend growths with Idera is to write a very long script that hits SQLdmRepository..DatabaseSize a couple times.  My script goes a little like this:

/*
MaxUsed is always from a week long range.  The reason is that this captures any spikes in usage such as off-hours index maintenance.
*/

;WITH MostRecent AS
(
	SELECT X.DatabaseID
		, X.UTCCollectionDateTime
		, S.InstanceName
		, D.DatabaseName
		, DataFileSizeMB = DataFileSizeInKilobytes/1024
		, DataFilePotentialSizeMB = (X.DataFileSizeInKilobytes + X.DataExpansionInKilobytes)/1024
	FROM (
			SELECT DS.DatabaseID
				, RowNum = DENSE_RANK() OVER (PARTITION BY DS.DatabaseID ORDER BY UTCCollectionDateTime DESC)
				, UTCCollectionDateTime
				, DataFileSizeInKilobytes
				, DataExpansionInKilobytes
			FROM SQLdmRepository..DatabaseSize DS
			WHERE UTCCollectionDateTime > GETUTCDATE()-7 
		) X
		INNER JOIN SQLdmrepository..SQLServerDatabaseNames D ON X.DatabaseID = D.DatabaseID
		INNER JOIN SQLdmrepository..MonitoredSQLServers S ON S.SQLServerID = D.SQLServerID 
	WHERE RowNum = 1
)
, ThisWeek AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime > GETUTCDATE()-7 
	GROUP BY DS.DatabaseID 
)
, OneMonth AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime BETWEEN GETUTCDATE()-37 AND GETUTCDATE()-30 
	GROUP BY DS.DatabaseID 
)
, ThreeMonth AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime BETWEEN GETUTCDATE()-97 AND GETUTCDATE()-90 
	GROUP BY DS.DatabaseID 
)
, SixMonth AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime BETWEEN GETUTCDATE()-187 AND GETUTCDATE()-180 
	GROUP BY DS.DatabaseID 
)
, OneYear AS
(
	SELECT DS.DatabaseID
		, MaxUsedMB = Max(DataSizeInKilobytes + TextSizeInKilobytes + IndexSizeInKilobytes)/1024
	FROM SQLdmRepository..DatabaseSize DS
	WHERE UTCCollectionDateTime BETWEEN GETUTCDATE()-365 AND GETUTCDATE()-358 
	GROUP BY DS.DatabaseID 
)
, PreResults AS 
(
	SELECT MR.InstanceName
		, MR.DatabaseName 
		, MR.DataFileSizeMB
		, ThisWeek_MaxUsedMB = CAST(W.MaxUsedMB AS INT)
		, OneMonth_MaxUsedMB = CAST(M1.MaxUsedMB AS INT)
		, ThreeMonth_MaxUsedMB = CAST(M3.MaxUsedMB AS INT)
		, SixMonth_MaxUsedMB = CAST(M6.MaxUsedMB AS INT)
		, OneYear_MaxUsedMB = CAST(M12.MaxUsedMB AS INT)
		, NextGrowthInDays_OneMonth = CAST((MR.DataFileSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M1.MaxUsedMB + .0001)*30 AS BIGINT)
		, NextGrowthInDays_ThreeMonth = CAST((MR.DataFileSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M3.MaxUsedMB + .0001)*90 AS BIGINT)  
		, NextGrowthInDays_SixMonth = CAST((MR.DataFileSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M6.MaxUsedMB + .0001)*180 AS BIGINT)  
		, NextGrowthInDays_OneYear = CAST((MR.DataFileSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M12.MaxUsedMB + .0001)*360 AS BIGINT)  
		, OutOfSpaceInDays_OneMonth = CAST((MR.DataFilePotentialSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M1.MaxUsedMB + .0001)*30 AS BIGINT)  
		, OutOfSpaceInDays_ThreeMonth = CAST((MR.DataFilePotentialSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M3.MaxUsedMB + .0001)*90 AS BIGINT)  
		, OutOfSpaceInDays_SixMonth = CAST((MR.DataFilePotentialSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M6.MaxUsedMB + .0001)*180 AS BIGINT)  
		, OutOfSpaceInDays_OneYear = CAST((MR.DataFilePotentialSizeMB-W.MaxUsedMB)/(W.MaxUsedMB - M12.MaxUsedMB + .0001)*360 AS BIGINT) 
	FROM MostRecent MR
		INNER JOIN ThisWeek W ON MR.DatabaseID = W.DatabaseID
		LEFT JOIN OneMonth M1 ON MR.DatabaseID = M1.DatabaseID AND W.MaxUsedMB > M1.MaxUsedMB
		LEFT JOIN ThreeMonth M3 ON MR.DatabaseID = M3.DatabaseID AND W.MaxUsedMB > M3.MaxUsedMB
		LEFT JOIN SixMonth M6 ON MR.DatabaseID = M6.DatabaseID AND W.MaxUsedMB > M6.MaxUsedMB
		LEFT JOIN OneYear M12 ON MR.DatabaseID = M12.DatabaseID AND W.MaxUsedMB > M12.MaxUsedMB
)
SELECT InstanceName
	, DatabaseName 
	, NextGrowthInDays_Min = CASE WHEN NextGrowthInDays_OneMonth <= ISNULL(NextGrowthInDays_ThreeMonth , 1000000)
										AND NextGrowthInDays_OneMonth <= ISNULL(NextGrowthInDays_SixMonth, 1000000) 
										AND NextGrowthInDays_OneMonth <= ISNULL(NextGrowthInDays_OneYear, 1000000)
									THEN NextGrowthInDays_OneMonth 
								WHEN NextGrowthInDays_ThreeMonth <= ISNULL(NextGrowthInDays_SixMonth, 1000000)
										AND NextGrowthInDays_ThreeMonth <= ISNULL(NextGrowthInDays_OneYear, 1000000)
									THEN NextGrowthInDays_ThreeMonth
								WHEN NextGrowthInDays_SixMonth <= ISNULL(NextGrowthInDays_OneYear, 1000000)
									THEN NextGrowthInDays_SixMonth
								ELSE NextGrowthInDays_OneYear
							END
	, OutOfSpaceInDays_Min = CASE WHEN OutOfSpaceInDays_OneMonth <= ISNULL(OutOfSpaceInDays_ThreeMonth, 1000000)
										AND OutOfSpaceInDays_OneMonth <= ISNULL(OutOfSpaceInDays_SixMonth, 1000000)
										AND OutOfSpaceInDays_OneMonth <= ISNULL(OutOfSpaceInDays_OneYear, 1000000)
									THEN OutOfSpaceInDays_OneMonth 
								WHEN OutOfSpaceInDays_ThreeMonth <= ISNULL(OutOfSpaceInDays_SixMonth, 1000000)
										AND OutOfSpaceInDays_ThreeMonth <= ISNULL(OutOfSpaceInDays_OneYear, 1000000)
									THEN OutOfSpaceInDays_ThreeMonth
								WHEN OutOfSpaceInDays_SixMonth <= ISNULL(OutOfSpaceInDays_OneYear, 1000000)
									THEN OutOfSpaceInDays_SixMonth
								ELSE OutOfSpaceInDays_OneYear
							END
	, DataFileSizeMB
	, ThisWeek_MaxUsedMB
	, OneMonth_MaxUsedMB 
	, ThreeMonth_MaxUsedMB 
	, SixMonth_MaxUsedMB
	, OneYear_MaxUsedMB
	, NextGrowthInDays_OneMonth 
	, NextGrowthInDays_ThreeMonth 
	, NextGrowthInDays_SixMonth 
	, NextGrowthInDays_OneYear 
	, OutOfSpaceInDays_OneMonth
	, OutOfSpaceInDays_ThreeMonth
	, OutOfSpaceInDays_SixMonth
	, OutOfSpaceInDays_OneYear 
FROM PreResults
WHERE OutOfSpaceInDays_OneMonth < 100
	OR OutOfSpaceInDays_ThreeMonth < 100
	OR OutOfSpaceInDays_SixMonth < 100
	OR OutOfSpaceInDays_OneYear < 100
	OR NextGrowthInDays_OneMonth < 15
	OR NextGrowthInDays_ThreeMonth < 15
	OR NextGrowthInDays_SixMonth < 15
	OR NextGrowthInDays_OneYear < 15
ORDER BY DataFileSizeMB DESC

What this does is look at the trends based on what the database used 30, 90, 180, and 360 days ago to try to predict the future.  By “predict the future” I mean it’s not perfect, and can’t be perfect.  However, it gives you very good data with a few false positives and very few false negatives.

Speaking of this not being perfect, it double-dips.  If there are two databases growing on the same drive with 10 GB free, it says that each one has 10 GB of growth available.  I was not able to rewrite this to take that into account while also accounting for databases with multiple files in a single filegroup spread across multiple drives.

Now I’m two weeks ahead of projected file growths and over three months ahead of projected out-of-space errors.

This is very basic information, and I’d have trouble imagining any SQL monitoring software not capturing this in a table you can query.

If I Didn’t Have Idera…

If I didn’t have Idera, or any other monitoring software I could query, I’d be going out at least once a day and pulling the allocated and used space from every file on every server.  I may want to do it more often to find peak usage, such as when online index rebuilds are using twice their typical space over the weekends; I take advantage of Idera capturing this hourly.  I’ll capture it using something like this:

SELECT @SQL = 
'USE [?]

INSERT INTO DBA..FileSizes (DateAdded, Database_ID, File_ID, Data_Space_ID, SizeMB, UsedMB)
	select GETDATE() 
		, Database_ID = DB_ID()
		, f.File_ID
		, f.data_space_id
		, SizeMB = ((f.size*8/1024)) 
		, UsedMB = (FileProperty(f.name, ''SpaceUsed'')*8/1024) 
	from sys.database_files f
	'

exec sp_MSforeachdb @SQL

Then I’d need to see how much free space I have on each drive, so I’d go to Ed Wagner’s article Time and Space: How to Monitor Drive Space in SQL Server.  Knowing that we use MountPoints in places, I’d have to change his script to use “volume” instead of “logicaldisk” (also mentioned in his comments).  This minor change is about the best case scenario you’ll run across.

Remember, the scripts you find on the internet are typically designed to work perfect on the AUTHOR’s system, not yours.  It’s not Ed’s fault they don’t use MountPoints, in fact he should probably be congratulated for not making his environment more complex than it needs to be.  He should also be congratulated on not making a change to his article that he can’t properly test.  At least you know you have to be extra careful testing this change instead of being led to believe the author did something they couldn’t do properly.

After getting the drive’s free space, I’d have to look at the file growth rates to see how large my files could grow.  Keep in mind that growth in sys.database_files could be either a percentage or the number of 8kb pages the file will grow by, depending on the value in is_percent_growth.  Static growths are easy, just divide by 128 and that’s how many MBs it will grow by.  It’s easier (and better for your server) to change percentage growths to MB than it is to look up the logarithmic functions used to calculate compound interest that also apply here.

Now that you have the free space on the drive and the growth rates on the files, you can calculate the maximum size the file can grow to.  Use this with trends on the used space to calculate when that size won’t be enough for you.

Make sure you test your results, you’ll have to support it.

So Your Database Is Growing…

Databases get larger, it’s what they do.  A normal DBA then goes through and makes sure that there’s room for that growth.  A better DBA will do that, too, but then follow it up by figuring out why the database is growing and try to justify it.

I have seen way too many databases out there that don’t purge old data that’s useless to the company.  Sometimes the purge was never written, sometimes the code to do that is never being called for several somewhat humorous reasons.  Any way you look at it, the data is growing and it’s easy to fall into the trap of just adding space every month, forever.

The way I handle this is by watching my tables and seeing how they’re growing.  These are treated just like databases where the largest one isn’t necessarily the growth issue, and looking at historical data to see the growth trend is a better solution.

In the past I wrote a post called Monitoring Database and Table Sizes to get this info yourself.  In the future, I’ll write Table Growth Trends – Idera Diagnostic Manager.  At the time of this writing, I’m planning on the Table Growth post coming out three weeks after this one.  Until then, I put this script up on my Idera Diagnostic Manager Scripts page, which is where I keep these posts organized and where I put my scripts to wait for me to blog about them.

 

Multiserver Queries

You can run queries against multiple servers at once, and it’s quite useful for a number of reasons.  I use it to check settings, verify backups and DBCC ran recently on unmonitored servers, make sure all servers rebooted during the maintenance window, and many other reasons.

This is all done through registering servers on SQL Server and opening a new query for the group.  I’ll walk you through that then run a couple queries that I find useful.

Unfortunately, this method is so easy that you’ll only ever have to see it once and won’t be back to visit this blog post again.  Hopefully the scripts below will inspire a return visit.

Setting It Up

Registered servers are local to your SSMS install.  You’re not changing server settings by setting this up, and this will be lost if you reinstall SSMS without exporting your settings.

First, open the Registered Servers pane by going to View / Registered Servers.

View Registered Servers

Add a new server group and call it anything you want.  I’ll call mine blog because that’s what I’m using it for.  I love the idea of having Prod and Non-Prod groups, especially since I can run a query against all my Non-Prod servers that aren’t monitored to verify backups and DBCC checks are being done.

It’s important to note at this point that you can have a server in more than one group and groups can be nested.  So in my prod group I have groups for each data center, then the servers are at that level.  I could also have the same servers in functional groups, such as Finance, IT, and Why_Me.

Create Registered Servers Group

 

Right-click and do a New Server Registration, and the options should be pretty natural to you.  For this example, I used aliases for “Registered Server Name”, but I stick to the default when doing this for myself.

At the most basic level, it should look like this.

Registered Server Group

That’s it, you’re set up.

Running Queries

This is easier than the setup.

Right-click on a group and click on New Query.

Multiserver New Query

It opens a new query window with the only oddity being instead of a yellowish bar saying “Connected. (1/1)”, now you have a pink bar saying “Connected. (2/2)” along with the group name.

Multiserver Connected

This will be connected to all servers directly in the group and in groups nested within that group.  There will be a long delay if one of the servers isn’t available, so it’s worth while to keep these groups cleaned up.

Now you run a query.  Here’s how the results are grouped by default.  Although they showed up in order for me, that was a coincidence.  They’ll actually be in the order they came back, so a server that returned results in 1 second will come before a server that returned results in 2 seconds.

Multiserver Results

You can go to Tools / Options and change a couple things, but the defaults tend to work great.

Multiserver Query Options

Now that it’s set up, right-click on your group and go to Tasks / Export… to save this off somewhere.  The only place this is saved right now is on your workstation, and those words will make any DBA cringe.

Practical Uses

Doing “SELECT 1” like I did above is always a blast, but it’s not all that productive.  As always, I encourage you to play around with it yourself, but here’s some things to get you started.

Server (services) last started

When I do Windows Updates I reboot all the database servers every month, even if the updates don’t say the needed the reboot at that time.  However, can be difficult to tell which servers I rebooted and which ones just aren’t begging for a reboot.  Since TempDB is recreated each time services start, I check to make sure that happened during the WSUS window.

SELECT create_date FROM sys.databases WHERE database_id = 2

If I was doing this for hundreds of servers, I’d simply change the query to say AND create_date < GetDate()-1 then only my problem servers would return results.

DBCC and Backups up-to-date

My non-prod servers aren’t monitored, so things go missed sometimes.  Backup jobs may be failing and I don’t know until I have time to check everything on all my non-prod servers.  Unfortunately, that doesn’t happen very often.  As with everything else in life, I cheat as much as possible to make it easier.  This time by running this as a multiserver query:

--Leaving some overlap to avoid false alarms due to timing issues
--We want to know if there hasn't been:
--	Full backup in the last 8 days (should happen every 7)
--	Full or diff backup in the last 2 days (should happen every 1)
--  Log backup in the last 4 hours (should happen every 1 or more often)
--  DBCC CheckDB run successfully in the last 8 days (should happen every 7)

SET NOCOUNT ON

DECLARE @FullDays Int
SELECT @FullDays = 8

DECLARE @FullDiffDays Int
SELECT @FullDiffDays = 2

DECLARE @LogHours Int
SELECT @LogHours = 4

DECLARE @DBCCDays Int
SELECT @DBCCDays = 8

DECLARE @SQLText NVarChar(4000)

IF OBJECT_ID('tempdb..#Results') IS NOT NULL BEGIN
	DROP TABLE #Results
END

CREATE TABLE #Results 
(
	ResultText NVarChar(2000)
)

IF Object_ID('TempDB..#DBCC_Temp') IS NOT NULL BEGIN
	DROP TABLE #DBCC_Temp
END

CREATE TABLE #DBCC_Temp 
(
	ParentObject VARCHAR(255),
	[Object] VARCHAR(255),
	Field VARCHAR(255),
	[Value] VARCHAR(255)
)

--When was the last full backup?
INSERT INTO #Results (ResultText)
SELECT CAST(serverproperty('ComputerNamePhysicalNetBIOS') as sysname) 
	+ ISNULL(NULLIF('\' + CAST(serverproperty('InstanceName') as sysname),'\MSSQLSERVER'),'') 
	+ '.' 
	+ d.name 
	+ ' hasn''t had a full backup in over ' 
	+ Cast(@FullDays as VarChar(2)) 
	+ ' day(s).  Last full backup was ' 
	+ ISNULL('on ' + Convert(VarChar(16), max(b.backup_finish_date), 120), 'NEVER') + '.'
FROM sys.databases d 
	LEFT JOIN msdb..backupset b ON d.name = b.database_name AND b.type = 'D'
WHERE d.name <> 'tempdb'
	AND d.state = 0
GROUP BY d.name 
HAVING ISNULL(max(b.backup_finish_date), '2000-01-01') < GetDate()-@FullDays 

--When was the last full or differential backup?
INSERT INTO #Results (ResultText)
SELECT CAST(serverproperty('ComputerNamePhysicalNetBIOS') as sysname) 
	+ ISNULL(NULLIF('\' + CAST(serverproperty('InstanceName') as sysname),'\MSSQLSERVER'),'') 
	+ '.' 
	+ d.name 
	+ ' hasn''t had a full or diff backup in over ' 
	+ Cast(@FullDiffDays as VarChar(2)) 
	+ ' day(s).  Last full or diff backup was ' 
	+ ISNULL('on ' + Convert(VarChar(16), max(b.backup_finish_date), 120), 'NEVER') + '.'
FROM sys.databases d
	LEFT JOIN msdb..backupset b ON d.name = b.database_name AND b.type IN ('D', 'I')
WHERE d.name <> 'tempdb'
	AND d.state = 0
GROUP BY d.name  
HAVING ISNULL(max(b.backup_finish_date), '2000-01-01') < GetDate()-@FullDiffDays 

--When was the last log backup?
INSERT INTO #Results (ResultText)
SELECT CAST(serverproperty('ComputerNamePhysicalNetBIOS') as sysname) 
	+ ISNULL(NULLIF('\' + CAST(serverproperty('InstanceName') as sysname),'\MSSQLSERVER'),'') 
	+ '.' 
	+ d.name 
	+ ' hasn''t had a log backup in over ' 
	+ Cast(@LogHours as VarChar(3)) 
	+ ' hour(s).  Last log backup was ' 
	+ ISNULL('on ' + Convert(VarChar(16), max(b.backup_finish_date), 120), 'NEVER') + '.'
FROM sys.databases d
	LEFT JOIN msdb..backupset b ON d.name = b.database_name AND b.type = 'L'
WHERE d.name NOT IN ('tempdb', 'model')--Model is by default in full, rarely backed up, and not modified enough to complain about.
	AND d.recovery_model_desc <> 'SIMPLE'
	AND d.state = 0
GROUP BY d.name  
HAVING ISNULL(max(b.backup_finish_date), '2000-01-01') < DateAdd(Hour, -1 * @LogHours, GetDate()) 

SELECT @SQLText = N'use [?]; 

DECLARE @DBName sysname
DECLARE @HasRetried bit 

SELECT @DBName = CAST(''?'' as VarChar(256)) 
SELECT @HasRetried = 0

TRUNCATE TABLE #DBCC_Temp

IF EXISTS (SELECT d.state FROM sys.databases d WHERE d.name = @DBName AND d.State = 0 /*Online*/ AND d.is_read_only = 0 AND d.create_date < GetDate()-7) AND ''tempdb'' <> @DBName BEGIN
	INSERT INTO #DBCC_Temp
	EXECUTE(''DBCC PAGE (['' + @DBName + ''], 1, 9, 3)WITH TABLERESULTS, NO_INFOMSGS'')

	INSERT INTO #Results (ResultText)
	SELECT CAST(serverproperty(''ComputerNamePhysicalNetBIOS'') as sysname) 
		+ ISNULL(NULLIF(''\'' + CAST(serverproperty(''InstanceName'') as sysname),''\MSSQLSERVER''),'''') 
		+ ''.''
		+ @DBName
		+ '' hasn''''t had a successful DBCC check in the last ' + CAST(@DBCCDays as VARCHAR(2)) + ' day(s).  Last successful check was ''
		+ ISNULL(NULLIF(MAX([Value]),''1900-01-01 00:00:00.000''), ''NEVER'')
	FROM #DBCC_Temp
	WHERE Field = ''dbi_dbccLastKnownGood''
	HAVING MAX([Value]) < GetDate()-' + CAST(@DBCCDays as VARCHAR(2)) + '
END
'

exec sp_MSforeachdb @SQLText 

SELECT ResultText 
FROM #Results

DROP TABLE #Results
DROP TABLE #DBCC_Temp

If I get any results from this, I have something to fix.  This also works for daily full backups because it does diff or full, but you may want to change it around to fit your maintenance schedules.

Version Check

If you’re evaluating what version all of your servers are on so you can make plans to patch them, just run this:

SELECT @@VERSION

You have everything you need to compare it to SQLServerBuilds.BlogSpot.com.  The only thing is that I tend to copy/paste the results into Excel so I can sort them there.

More Uses

I use registered servers and multiserver queries for more than this.  In fact, I wrote this post so I could link to it in a future post on Instant File Initialization.  Once I have that one here, I’ll have four examples of uses.

Other readers and I would benefit from hearing more uses of this feature.  Let us hear the best uses you have, preferably with a script.

Entry-Level Content Challenge

This is my third post in Tim Ford’s Entry Level Content Challenge.  I was tempted to write a post and just tell people to run a script as a multiserver query.  After all, it’s very simple after you see it for the first time.  However, you have to see it for the first time before you get to that point.

This is the exact purpose of his challenge.  Read over it and consider joining in.

 

Recompile Monitoring using XEvents

Recompiles can be a hidden bottleneck on your server, and it may not be too obvious.  I should know, it happened to me.  Unfortunately for me, that was before I was a proponent of tracing or using XEvents to watch for recompiles.  Fortunately for you, the experience changed me, and I enjoy sharing my experiences.

What happened to me was a proc that was called WAY too often was taking too long cumulatively.  I knew the true root cause of my issue was in the application which was calling the proc once per row, often for about hundred rows, when loading a pretty important page on a web application.  However, I just found out the service desk has been getting complaints for a while, and asking the development team to help me rewrite how the page works ended in an answer that didn’t even sound like now.  Well, I guess “no” does kinda sound like “now”…

The proc was pretty simple, taking on average about 0.150 seconds to run.  Sounds great until you remember that it’s being called about a hundred times for a page load on a web application.  100 * 0.15 = 15 seconds for the page to load, not counting anything else running.  Users were threatening to not be users anymore even though they were happy with most other aspects.

I got into it, reworded the query a little, created a perfect index for it, and it got it down to 0.100 seconds to run.  This was a large concession on my part because I HATE perfect indexes for a single query, and feel strongly that indexes should focus more on being reusable by many queries.  However, even that only got the web page to load in 10 seconds.

Traces of the proc confused me.  The statements were running in 0.010 seconds, and there was a 0.090 second gap between runs.  That gap was 9 of my 10 seconds.  At 10 seconds users avoid a website, but at 1 second users have to be prompted to say it’s not as fast as they’d like.

It took me longer than I’d like to admit to figure out that every run was kicking off a recompile that took 0.090 seconds.  This is the point where I started tracing for the recompile reason, and found out it was because statistics were being updated.  However, there were no stats in my database that were updated in the last 15 minutes, so I would have never guessed this was the reason without tracing on this event.

This proc loaded data into a temp table, then did some processing on that temp table, and I figured out that there were enough rows being loaded into the temp table to update the stats on it.  Removing the temp table was a disaster, it raised my execution time by 0.010 seconds.  Yeah, “disaster” can be relative…

Then I came up with a very odd solution that I’ve never seen or heard of before that point, and I haven’t used much since that point.  The temp table was swapped out for a table variable.  No stats to be updated, so no recompile.  Although it assumed there would only be one record in the table variable (it always does), the execution plan was the same.  The only difference was the total 0.100 second duration was dropped to 0.010 after losing the 0.090 recompile.

User experience was now a web page that loaded up in just over 1 second, with most of it still being this query.  There wasn’t anything more I could do except wait until the page could be rewritten by development where I could rewrite it to get all the results at once.  That process took time, and customers were not impatiently watching to see when it would happen.

I wrote this talking about the users, and users care about duration.  However, that’s not the whole story.  The 0.090 recompile was pure CPU time, it happening 100 times per page call added up to 9 seconds of pure CPU time, and the users having to use this page hundreds of times a day added up to a noticeable impact on the server performance when it went away.  Now I admit that the users only used this when they HAD to instead of when they should have, so it was balanced out a little by the users opening the page about twice as often.  Even then, it was still obvious that it was gone when comparing long-term CPU usage.

The Management Response

Side note for all of you managers out there…  My manager pulled me aside the next week saying “step into my office for a second.”.  I admit, that statement always makes you nervous, even when you KNOW everything is going great.  He proceeded to tell me that before I started with the company they pulled developers and a DBA into a task force that spent tons of money to improve performance, upgrading servers, upgrading the SAN, upping the WAN speed, and looking into the database.  Then he said that I did more in one day than they did in 6 months without spending any money, and that I should take my wife out for a nice dinner and bring him the receipt.

Now I know that the hardware upgrades helped out, that I only helped a single process that day, and many other things that could belittle what happened.  I also know that spending $100 was very minor in terms of the effect that change made for the company.  That’s not how I saw things.

What I saw was a response that stuck with me, a lot.  Management stepped back to recognize a difference someone made, and made sure they knew they were appreciated.  Then took it a step further bringing the family into it, having my wife feeling proud and appreciated at the same time.

I know the constraints management is under and all the things they can’t do.  That being said, the only thing that frustrates me about this situation is that I don’t see others in management taking the same approach.  I swore from that point that if I ever became a manager my one odd request would be that I would want a very small budget in writing to use at my digression.  Small things to show appreciation go a long way.

My Challenge

The challenging part for me, and where I felt I didn’t do as well as I should have done, is that this flew under my radar.  I was tracing for anything that took over 1 second duration and was looking at the most expensive queries all the time.  However, this one took 0.150 seconds and never accumulated time in the plan cache due to the recompiles.

Here’s the thing, I pride myself on being specialized in database performance, so I feel I should have seen this before having someone in customer service casually mention it to me in conversation.  It wasn’t even a request to look into it, just mentioning he just got off a call with the customer about the page, and that the page never worked right with no one being able to fix it.

After finding and resolving the root cause of the proc, I also wanted to find and resolve how this flew under my radar.  This came up to two primary responses.  First, the page was taking too long as a whole, so I requested stats on a per-page basis – how long it took to load each time and how long cumulatively.  Second, I started watching for recompiles from time to time.  Either one of these would have led me to this issue well before that point.

This post is not about monitoring performance on a website, so, although it’s a very important thing to do, I won’t be mentioning that part again.

Monitoring Recompiles

This post is about recompiles, how they can drag you down, and how you can find out about them.  I covered how they drug me down, and in past posts I casually mentioned you should use server-side traces or XEvents to monitor them.  Then I left it up to you to figure out how.  Lets try a different approach today.

Recompiles aren’t something I’m watching for constantly, so I’m not going to run any kind of monitoring constantly for them.  What I’ll do is run an XEvent session for 24 hours and analyze what it picked up.  If it didn’t recompile often that day it almost definitely does not belong on my priority list, so this works for me.

Here’s my session.  It starts immediately, writes to five 10MB rollover files, and will not start the next time your services restart.

IF EXISTS (SELECT 1 FROM sys.server_event_sessions WHERE name = 'Recompiles')
	DROP EVENT SESSION [Recompiles] ON SERVER;
GO

CREATE EVENT SESSION [Recompiles] ON SERVER 
ADD EVENT sqlserver.sql_statement_recompile
(	SET collect_object_name=(1)
		, collect_statement=(1)
	ACTION
		(sqlserver.database_id
			, sqlserver.database_name
			, sqlserver.session_id
			, sqlserver.sql_text
			, sqlserver.username))
	ADD TARGET package0.event_file
		(SET filename=N'Recompiles'
			, max_file_size=(10)
)
WITH 
	(MAX_MEMORY=4096 KB
		, EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS
		, MAX_DISPATCH_LATENCY=30 SECONDS
		, MAX_EVENT_SIZE=0 KB
		, MEMORY_PARTITION_MODE=NONE
		, TRACK_CAUSALITY=OFF
		, STARTUP_STATE=OFF)
GO

ALTER EVENT SESSION Recompiles
ON SERVER
STATE = START;

Querying the Recompiles XEvent Session

Then I’ll use this query to look at my session.  It’s a little awkward in two ways.

First, I only leave the session running for a day, but this script needs it to be running to read from it.  The only reason for that is it pulls the file path from the running sessions.  You can get around this by putting the path into the script manually.  As for me, I usually query it and figure out my next steps while it’s running, then I stop the session.

Second, and this is quite humorous to me, it recompiles due to my temp tables every time it runs.  It’s a great example of what to ignore because it’s a low-cost recompile that happens on a seldom-used ad-hoc query where performance is not a major priority.  However, because temp table definitions are cached for proc plans, this goes away if you turn this script into a proc.  I laugh every time…

When you get the results in the #Queries temp table in the end they’re what you’re used to with it just being a table and all the XML out of the way.  The query at the end of the script views them all, but I’ll play around with it doing aggregations and more.  I encourage you to do the same.  This is also why it drops the table if it exists in the beginning instead of dropping them at the end.

DECLARE 
	@SessionName SysName 
	, @TopCount Int = 1000
	
SELECT @SessionName = 'Recompile'

--SELECT @SessionName = 'system_health'
/* 
SELECT * FROM sys.traces

SELECT  Session_Name = s.name, s.blocked_event_fire_time, s.dropped_buffer_count, s.dropped_event_count, s.pending_buffers
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE target_name = 'event_file'
--*/

IF OBJECT_ID('tempdb..#Events') IS NOT NULL BEGIN
	DROP TABLE #Events
END

IF OBJECT_ID('tempdb..#Queries') IS NOT NULL BEGIN
	DROP TABLE #Queries 
END

DECLARE @Target_File NVarChar(1000)
	, @Target_Dir NVarChar(1000)
	, @Target_File_WildCard NVarChar(1000)

SELECT @Target_File = CAST(t.target_data as XML).value('EventFileTarget[1]/File[1]/@name', 'NVARCHAR(256)')
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE s.name = @SessionName
	AND t.target_name = 'event_file'

SELECT @Target_Dir = LEFT(@Target_File, Len(@Target_File) - CHARINDEX('\', REVERSE(@Target_File))) 

SELECT @Target_File_WildCard = @Target_Dir + '\'  + @SessionName + '_*.xel'

--SELECT @Target_File_WildCard
CREATE TABLE #Events 
(
	event_data_XML XML
)

INSERT INTO #Events 
SELECT TOP (@TopCount) CAST(event_data AS XML) AS event_data_XML
FROM sys.fn_xe_file_target_read_file(@Target_File_WildCard, null, null, null) AS F
ORDER BY File_name DESC
	, file_offset DESC 

SELECT  EventType = event_data_XML.value('(event/@name)[1]', 'varchar(50)')
	, ObjectName = event_data_XML.value ('(/event/data  [@name=''object_name'']/value)[1]', 'sysname')
	, ObjectType = event_data_XML.value ('(/event/data  [@name=''object_type'']/text)[1]', 'sysname')
	, UserName = event_data_XML.value ('(/event/action  [@name=''username'']/value)[1]', 'sysname')
	, Statement_Text = ISNULL(ISNULL(event_data_XML.value ('(/event/data  [@name=''statement'']/value)[1]', 'NVARCHAR(4000)'), event_data_XML.value ('(/event/data  [@name=''batch_text''     ]/value)[1]', 'NVARCHAR(4000)')), event_data_XML.value ('(/event/data[@name=''wait_type'']/text)[1]', 'NVARCHAR(60)'))
	, Recompile_Cause = event_data_XML.value ('(/event/data  [@name=''recompile_cause'']/text)[1]', 'sysname')
	, TimeStamp = DateAdd(Hour, DateDiff(Hour, GetUTCDate(), GetDate()) , CAST(event_data_XML.value('(event/@timestamp)[1]', 'varchar(50)') as DateTime2))
	, SPID = event_data_XML.value ('(/event/action  [@name=''session_id'']/value)[1]', 'BIGINT')
	, Database_Name = DB_Name(event_data_XML.value ('(/event/action  [@name=''database_id'']/value)[1]', 'BIGINT'))
	, EventDetails = event_data_XML 
INTO #Queries
FROM #Events

SELECT q.EventType
	, q.ObjectType
	, q.ObjectName
	, q.Statement_Text
	, q.Recompile_Cause
	, q.TimeStamp
	, q.SPID
	, q.UserName
	, q.Database_Name
	, q.EventDetails
FROM #Queries q
ORDER BY TimeStamp DESC 

Hope It Helps

Let me know how this helps.  If anything can be done to make my approach or my scripts better, I’m always open to improvement.

 

I/O Baseline Comparison – Idera Diagnostic Manager

What servers need the most attention when it comes to I/O?  When I increase the memory on a server what effect does it have on I/O?  What was it before the change and what was it afterwards?

These are tough questions, and I have a script that I hope will help you out.

Most monitoring tools, including Idera, are great at looking at what a single server is doing right now, but not so much with comparing performance and loads between multiple larger windows of time and seeing which servers are doing the most work.  So I wrote a script to see how much I/O load I’m doing on each server and compare that to how much I did in that same time window in previous weeks.

Your servers will vary, a lot.  They’ll vary from my servers, and they’ll vary from each other as well.  Depending on the workload, specifically the difference between OLTP and OLAP, higher I/O may be expected and additional memory will have less of an impact.  For OLAP, if you have 32 GB of memory and read a 64 GB table followed by a different 92 GB table then it’s all physical reads.  If you up your memory to 48 GB, they’re still all physical reads.  The cost to up this server (or VM) is the same as the cost to upgrade an OLTP server from 32 GB to 48 GB, but the story will end quite a bit differently there.

The problem is that this change, like any other change, should be measured before, measured afterwards, then have the costs of the change analyzed and justified.  The results you’ll see below coupled with my Wait Stats Baseline Comparison using Idera post will help you figure out where to make a change like that and how to measure it.

There are other reasons you’d want to look at this data as well.  Even a new index that dramatically dropped the physical reads for one query could have the opposite effect on your server as a whole.  The point is, even something very straightforward needs to be verified that it’s doing what you expect it to.  Looking at the details you see in the canned reports is great, but so is the ability to stand back and look at the big picture that I’ll get into here.

The solutions I’ve put into place using this type of query to varied from query tuning, timing of larger processes, adding memory like I mentioned above, buying faster disks, and many other options including acknowledging we were doing a ton of reads and leaving it be.

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it:

The biggest issue is the word “before”.  If you weren’t watching your servers before then you have nothing to compare the current state of your server to.  Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is capturing this type of information to help make all the pretty charts and graphs you see through the GUI.  You need this software to run so you can get alerted when things go wrong, but pulling this data will just make in more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any.  However, that’s taking on a lot of responsibility, time, and effort.  If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool.  Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll tack them on here and give you credit for writing them.  I can only write against what I can test against, it’s nothing personal.

I/O Differences Script

What I look for are the true apples-to-apples comparisons, so I’m typically looking at a longer time range (I like 24 hours) during peak hours (12:00 to 21:00 UTC on weekdays for me, yours will differ, even server-to-server) comparing this week to the last couple weeks.

DECLARE @StartTime DateTime
	, @EndTime DateTime
	, @InstanceName sysname
	, @Weekdays bit
	, @BusinessHours bit
		
SET @EndTime = GetUTCDate()
SET @StartTime = DateAdd(Hour, -24, @EndTime)
SET @InstanceName = NULL --Do 'Server\Instance' for individual server
SET @Weekdays = 1
SET @BusinessHours = 1

SELECT S.InstanceName
	, StartTime = @StartTime 
	, EndTime = @EndTime
	, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
	, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
	, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
	, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
	, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
	, AvgPLE = Avg(St.PageLifeExpectancy)
	, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
	INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
	AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
	AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
	AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC

SELECT @StartTime = @StartTime - 7 
	, @EndTime = @EndTime - 7

SELECT S.InstanceName
	, StartTime = @StartTime 
	, EndTime = @EndTime
	, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
	, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
	, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
	, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
	, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
	, AvgPLE = Avg(St.PageLifeExpectancy)
	, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
	INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
	AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
	AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
	AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC

SELECT @StartTime = @StartTime - 7 
	, @EndTime = @EndTime - 7

SELECT S.InstanceName
	, StartTime = @StartTime 
	, EndTime = @EndTime
	, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
	, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
	, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
	, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
	, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
	, AvgPLE = Avg(St.PageLifeExpectancy)
	, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
	INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
	AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
	AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
	AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC


SELECT @StartTime = @StartTime - 7 
	, @EndTime = @EndTime - 7

SELECT S.InstanceName
	, StartTime = @StartTime 
	, EndTime = @EndTime
	, Reads_GB = CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) 
	, ReadAhead_GB = CAST(SUM(St.ReadAheadPages)/128/1024.0 AS DEC(20,1))
	, Writes_GB = CAST(SUM(St.PageWrites)/128/1024.0 AS DEC(20,1)) 
	, Lookups_GB = CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) 
	, PctPhysical = CAST(CAST(SUM(St.PageReads)/128/1024.0 AS DEC(20,1)) / CAST(SUM(St.PageLookups)/128/1024.0 AS DEC(20,1)) * 100 as DEC(20,1))
	, AvgPLE = Avg(St.PageLifeExpectancy)
	, AvgCache_MB = AVG(St.BufferCacheSizeInKilobytes)/1024
FROM SQLdmRepository..MonitoredSQLServers S
	INNER JOIN SQLdmRepository..ServerStatistics St ON S.SQLServerID = St.SQLServerID
WHERE UTCCollectionDateTime BETWEEN @StartTime AND @EndTime 
	AND (DATEPART(WEEKDAY, UTCCollectionDateTime) BETWEEN 2 and 6 or @Weekdays = 0)
	AND (DATEPART(HOUR, UTCCollectionDateTime) BETWEEN 12 and 21 OR @BusinessHours = 0)
	AND (UPPER(S.InstanceName) = UPPER(@InstanceName) OR @InstanceName IS NULL)
GROUP BY S.InstanceName
ORDER BY 4 DESC

The Results

This is, by default, going to show you your biggest I/O users that are being monitored.  It’s great to know, but it’s not the whole picture.  Some servers are going to be almost pure OLTP, others are going to be almost pure OLAP, and most will be some kind of mix.  The more OLAP you do, the more physical reads (Reads_GB and ReadAhead_GB) should be expected.  It’s really not a bad thing, it’s how your databases are used.

Then there’s the realistic version of the users’ expectation (not always the version they tell you) to consider.  Are the queries coming back as fast as they should?  Large annual reports can take a bit, but a website that takes more than a couple seconds isn’t going to be used.

This report shows the load you’re throwing at the disks, but user expectations are measured in seconds, not gigabytes.  Yeah, more gigabytes means more seconds, but it’s not like there’s a solid ratio.

That being said, the true value of this information is when you combine it with the type of server it is, the reasonable expectations of the users, the amount of time the server is waiting on I/O, and so much more.  To start you out, look at the Wait Stats Baseline Comparison using Idera, too.

If there are OLTP servers that stand out as doing a ton of physical reads and the wait stats show that it’s an issue as well, look into more memory and more time for tuning.  Take it a step further and use all of the trending info (not just this report) to split out I/O load and plan for faster disks.  There’s a lot of potential to use this info to do a lot of good.

My Idera Scripts

I keep all the scripts that query Idera Diagnostic Manager in one place, on my Idera Diagnostic Manager Scripts page.  Before this script had it’s own post, it was posted there.  As I make posts for each script, that page will live on as a directory of what scripts I have to get more use out of this product.

This is the first time I’m doing something like this, but I’ll eventually make other pages that pull together topics like this as well.  It feels good to find a great way to take blogging to the next level like this.

Speaking of taking it to the next level, let me know your thoughts on this script, others you’d like to see, and anything else in the comments below.

Wait Stats Baseline Comparison – Idera Diagnostic Manager

Database servers have to wait on different resources, and these waits are huge to the performance of SQL Server.  Sometimes something changes without our knowledge and is running differently than normal, this is the first place you look to see what’s going on.  On the other side, you may have made a change that could have been anything from increasing memory, tuning a query, moving to different hardware, changing indexes, etc..  All of these things should reduce waits, but how effective were they?

The trick is to know what things looked like before so you have something to compare it to.  If you know what it looked like before then you’ll know how much of a difference you made.

My Default Monitoring Message

I’ll copy/paste this section on multiple posts, so feel free to skip it if you’ve seen it before.

The biggest issue is the word “before”.  If you weren’t watching your servers before then you have nothing to compare the current state of your server to.  Also, if you’re watching it consistently then you’ll be able to make an apples to apples comparison.

Your production servers should have some kind of monitoring software in place, and the monitoring software is already capturing this type of information to help make all the pretty charts and graphs you see through the GUI.  You need this software to run so you can get alerted when things go wrong, and pulling this data will just make it more valuable.

Monitoring software doesn’t perform magic, so you can pull this data on your own if you don’t have any.  However, that’s taking on a lot of responsibility, time, and effort.  If something is collecting the data already, take advantage of that.

For me, I use Idera Diagnostic Manager, so my queries are going to look at the data collected using this tool.  Other monitoring tools are going to collect the same basic data in a database you can read.

If anyone from one of these companies wants to provide me with scripts that fill the same need, I’ll make a write up on them and give you credit for being awesome supporting your product like that.  I can only write against what I can test against, it’s nothing personal.

The Script

There are 3 parameters at the start, and that’s typically all I ever change.  A couple things stick out with these parameters.  The times are in UTC, so I default to GetUTCDate().  I want the time range to be at least an hour so I’m not letting a single query that ran skew the results too much.  Then I’m also converting my instance name to upper case because Idera DM may very well be your only database that’s case sensitive, so I cheat by putting everything in upper case.

My upper case trick isn’t a recommended trick.  It causes scans and possible bad joins.  In this case I know it’s against small tables and that I don’t have two servers with the same name using different cases.  It’s safe here, but not everywhere.

DECLARE @StartTime DateTime
	, @EndTime DateTime
	, @InstanceName sysname
		
SET @EndTime = GetUTCDate()
SET @StartTime = DateAdd(Hour, -24, @EndTime)
SET @InstanceName = UPPER('Server\Instance')

IF OBJECT_ID('tempdb..#IgnoredWaits') IS NOT NULL BEGIN
    DROP TABLE #IgnoredWaits
END

IF OBJECT_ID('tempdb..#Now') IS NOT NULL BEGIN
    DROP TABLE #Now
END

IF OBJECT_ID('tempdb..#Yesterday') IS NOT NULL BEGIN
    DROP TABLE #Yesterday
END

IF OBJECT_ID('tempdb..#LastWeek') IS NOT NULL BEGIN
    DROP TABLE #LastWeek
END

IF OBJECT_ID('tempdb..#TwoWeeks') IS NOT NULL BEGIN
    DROP TABLE #TwoWeeks
END

CREATE TABLE #IgnoredWaits (Wait_Type NVarChar(60) PRIMARY KEY) 

--Values taken from Paul Randal of SQLskills
--http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/
INSERT INTO #IgnoredWaits 
SELECT N'BROKER_EVENTHANDLER'            
UNION SELECT N'BROKER_RECEIVE_WAITFOR'
UNION SELECT N'BROKER_TASK_STOP'               
UNION SELECT N'BROKER_TO_FLUSH'
UNION SELECT N'BROKER_TRANSMITTER'             
UNION SELECT N'CHECKPOINT_QUEUE'
UNION SELECT N'CHKPT'                          
UNION SELECT N'CLR_AUTO_EVENT'
UNION SELECT N'CLR_MANUAL_EVENT'               
UNION SELECT N'CLR_SEMAPHORE'
UNION SELECT N'DBMIRROR_DBM_EVENT'             
UNION SELECT N'DBMIRROR_EVENTS_QUEUE'
UNION SELECT N'DBMIRROR_WORKER_QUEUE'          
UNION SELECT N'DBMIRRORING_CMD'
UNION SELECT N'DIRTY_PAGE_POLL'                
UNION SELECT N'DISPATCHER_QUEUE_SEMAPHORE'
UNION SELECT N'EXECSYNC'                       
UNION SELECT N'FSAGENT'
UNION SELECT N'FT_IFTS_SCHEDULER_IDLE_WAIT'    
UNION SELECT N'FT_IFTSHC_MUTEX'
UNION SELECT N'HADR_CLUSAPI_CALL'              
UNION SELECT N'HADR_FILESTREAM_IOMGR_IOCOMPLETION'
UNION SELECT N'HADR_LOGCAPTURE_WAIT'           
UNION SELECT N'HADR_NOTIFICATION_DEQUEUE'
UNION SELECT N'HADR_TIMER_TASK'                
UNION SELECT N'HADR_WORK_QUEUE'
UNION SELECT N'KSOURCE_WAKEUP'                 
UNION SELECT N'LAZYWRITER_SLEEP'
UNION SELECT N'LOGMGR_QUEUE'                   
UNION SELECT N'ONDEMAND_TASK_QUEUE'
UNION SELECT N'PWAIT_ALL_COMPONENTS_INITIALIZED'
UNION SELECT N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP'
UNION SELECT N'QDS_SHUTDOWN_QUEUE'
UNION SELECT N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP'
UNION SELECT N'REQUEST_FOR_DEADLOCK_SEARCH'    
UNION SELECT N'RESOURCE_QUEUE'
UNION SELECT N'SERVER_IDLE_CHECK'              
UNION SELECT N'SLEEP_BPOOL_FLUSH'
UNION SELECT N'SLEEP_DBSTARTUP'                
UNION SELECT N'SLEEP_DCOMSTARTUP'
UNION SELECT N'SLEEP_MASTERDBREADY'            
UNION SELECT N'SLEEP_MASTERMDREADY'
UNION SELECT N'SLEEP_MASTERUPGRADED'           
UNION SELECT N'SLEEP_MSDBSTARTUP'
UNION SELECT N'SLEEP_SYSTEMTASK'               
UNION SELECT N'SLEEP_TASK'
UNION SELECT N'SLEEP_TEMPDBSTARTUP'            
UNION SELECT N'SNI_HTTP_ACCEPT'
UNION SELECT N'SP_SERVER_DIAGNOSTICS_SLEEP'    
UNION SELECT N'SQLTRACE_BUFFER_FLUSH'
UNION SELECT N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP'
UNION SELECT N'SQLTRACE_WAIT_ENTRIES'          
UNION SELECT N'WAIT_FOR_RESULTS'
UNION SELECT N'WAITFOR'                        
UNION SELECT N'WAITFOR_TASKSHUTDOWN'
UNION SELECT N'WAIT_XTP_HOST_WAIT'             
UNION SELECT N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG'
UNION SELECT N'WAIT_XTP_CKPT_CLOSE'            
UNION SELECT N'XE_DISPATCHER_JOIN'
UNION SELECT N'XE_DISPATCHER_WAIT'             
UNION SELECT N'XE_TIMER_EVENT'

SELECT s.InstanceName
	, t.WaitType
	, WaitTime_Minutes = SUM(WaitTimeInMilliseconds)/1000/60
	, dStart = Min(UTCCollectionDateTime)
	, dEnd = Max(UTCCollectionDateTime)
INTO #Now
FROM SQLdmrepository..WaitStatisticsDetails wsd
	INNER JOIN SQLdmrepository..WaitTypes t ON wsd.WaitTypeID = t.WaitTypeID
	INNER JOIN SQLdmrepository..WaitStatistics ws ON ws.WaitStatisticsID = wsd.WaitStatisticsID
	INNER JOIN SQLdmrepository..MonitoredSQLServers s ON ws.SQLServerID = s.SQLServerID
WHERE ws.UTCCollectionDateTime BETWEEN @StartTime and @EndTime
	AND	UPPER(s.InstanceName) = @InstanceName
	AND t.WaitType NOT IN (SELECT Wait_Type COLLATE SQL_Latin1_General_CP1_CI_AS FROM #IgnoredWaits)
	--AND t.WaitType LIKE 'PAGEIO%'
GROUP BY s.InstanceName, t.WaitType

SELECT @StartTime = @StartTime - 1
	, @EndTime = @EndTime - 1

SELECT s.InstanceName
	, t.WaitType
	, WaitTime_Minutes = SUM(WaitTimeInMilliseconds)/1000/60
	, dStart = Min(UTCCollectionDateTime)
	, dEnd = Max(UTCCollectionDateTime)
INTO #Yesterday
FROM SQLdmrepository..WaitStatisticsDetails wsd
	INNER JOIN SQLdmrepository..WaitTypes t ON wsd.WaitTypeID = t.WaitTypeID
	INNER JOIN SQLdmrepository..WaitStatistics ws ON ws.WaitStatisticsID = wsd.WaitStatisticsID
	INNER JOIN SQLdmrepository..MonitoredSQLServers s ON ws.SQLServerID = s.SQLServerID
WHERE ws.UTCCollectionDateTime BETWEEN @StartTime and @EndTime
	AND	UPPER(s.InstanceName) = @InstanceName	
	AND t.WaitType NOT IN (SELECT Wait_Type COLLATE SQL_Latin1_General_CP1_CI_AS FROM #IgnoredWaits)
	--AND t.WaitType LIKE 'PAGEIO%'
GROUP BY s.InstanceName, t.WaitType

SELECT @StartTime = @StartTime - 6
	, @EndTime = @EndTime - 6

SELECT s.InstanceName
	, t.WaitType
	, WaitTime_Minutes = SUM(WaitTimeInMilliseconds)/1000/60
	, dStart = Min(UTCCollectionDateTime)
	, dEnd = Max(UTCCollectionDateTime)
INTO #LastWeek
FROM SQLdmrepository..WaitStatisticsDetails wsd
	INNER JOIN SQLdmrepository..WaitTypes t ON wsd.WaitTypeID = t.WaitTypeID
	INNER JOIN SQLdmrepository..WaitStatistics ws ON ws.WaitStatisticsID = wsd.WaitStatisticsID
	INNER JOIN SQLdmrepository..MonitoredSQLServers s ON ws.SQLServerID = s.SQLServerID
WHERE ws.UTCCollectionDateTime BETWEEN @StartTime and @EndTime
	AND	UPPER(s.InstanceName) = @InstanceName	
	AND t.WaitType NOT IN (SELECT Wait_Type COLLATE SQL_Latin1_General_CP1_CI_AS FROM #IgnoredWaits)
	--AND t.WaitType LIKE 'PAGEIO%'
GROUP BY s.InstanceName, t.WaitType

SELECT @StartTime = @StartTime - 7
	, @EndTime = @EndTime - 7

SELECT s.InstanceName
	, t.WaitType
	, WaitTime_Minutes = SUM(WaitTimeInMilliseconds)/1000/60
	, dStart = Min(UTCCollectionDateTime)
	, dEnd = Max(UTCCollectionDateTime)
INTO #TwoWeeks
FROM SQLdmrepository..WaitStatisticsDetails wsd
	INNER JOIN SQLdmrepository..WaitTypes t ON wsd.WaitTypeID = t.WaitTypeID
	INNER JOIN SQLdmrepository..WaitStatistics ws ON ws.WaitStatisticsID = wsd.WaitStatisticsID
	INNER JOIN SQLdmrepository..MonitoredSQLServers s ON ws.SQLServerID = s.SQLServerID
WHERE ws.UTCCollectionDateTime BETWEEN @StartTime and @EndTime
	AND	UPPER(s.InstanceName) = @InstanceName	
	AND t.WaitType NOT IN (SELECT Wait_Type COLLATE SQL_Latin1_General_CP1_CI_AS FROM #IgnoredWaits)
	--AND t.WaitType LIKE 'PAGEIO%'
GROUP BY s.InstanceName, t.WaitType

SELECT TOP 10 n.InstanceName
	, n.WaitType
	, Now_Minutes = n.WaitTime_Minutes
	, Yesterday_Minutes = d1.WaitTime_Minutes
	, OneWeek_Minutes = d7.WaitTime_Minutes
	, TwoWeeks_Minutes = d14.WaitTime_Minutes  
FROM #Now n
	LEFT JOIN #Yesterday d1 ON n.WaitType = d1.WaitType
	LEFT JOIN #LastWeek d7 ON n.WaitType = d7.WaitType
	LEFT JOIN #TwoWeeks d14 ON n.WaitType = d14.WaitType 
ORDER BY n.WaitTime_Minutes DESC

The Results

The results pretty much speak for themselves.  What were the top 10 things you waited on recently and how much did you wait on them in the past.  It’s great to have a view of it where you’re comparing totals side-by-side.  Seeing the same thing in charts just doesn’t have the same effect for me.

These numbers are going to have natural variances to them.  Did a user run a couple large queries during one interval but not another?  Was the shared storage under different loads?  Really anything can come up.  The longer the period the more you smooth out these numbers.

So, the real question is what do these waits mean?  I’m not going to pretend like I can cover all the wait stats, their meanings, and how to affect them in this post.  What I can say is that there are good resources out there for you.

Paul Randal (b|t) has the best source I know of on his post Wait Statistics, or please tell me where it hurts.

CXPACKET

One wait type I like to talk about, and discredit a bit, is CXPACKET.  This wait type is just saying you used parallelism, nothing more.  I first covered this topic in my post Why worry about CXPACKET.

Think of it in terms of a manager who delegates a large project to multiple employees.  As the employees are hard at work accomplishing the task, the manager goes about their day and records their time as CXPACKET.  The task is typically completed in less overall duration because it was a group effort, but the total amount of work done goes up some because there’s overhead and small duplicated tasks when you delegate to a group.

That less overall duration could make or break your application, so the idea of having CXPACKET isn’t necessarily bad (setting MAXDOP = 1 is rarely a good idea).  However, a task being so large that it needed to be delegated to multiple employees may mean it’s not efficient, making query and index tuning a good approach.  Also, your manager may be too eager to delegate their tasks, so possibly adjust your Cost Threshold for Parallelism as suggested in Jeremiah Peschka’s (b|t) post Five SQL Server Settings to Change.

As I said before, I can’t cover all the wait types here, so go to Paul Randal’s post I suggested above for a better list.

What I Skipped

This is looking specifically at WaitTimeInMilliseconds, which is one of two technically correct things to do.  It does not look at ResourceWaitTimeInMilliseconds, which is the other technically correct thing to do.  I wrote it one way and was happy with the results so I didn’t change it or clutter it with returning both.

Here’s the difference so you can decide if you want to change it or talk me into changing mine.  It’s a conversation I’d enjoy…I’m strange like that.

ResourceWaitTimeInMilliseconds is how long you’re waiting on your actual resource.  If you’re waiting for a lock, this is how long you waited for that lock to be made so you could get back in line for CPU time.

WaitTimeInMilliseconds also includes the time from when you get in line for CPU time until you’re using the CPU.  This additional time is called your signal wait time.

The reason I like using WaitTimeInMilliseconds is that the reason you gave up the CPU and had to get back in line is due to the fact you weren’t ready due to what you were waiting on.  In that way, it’s the whole picture.

The argument going the other way is that you were only waiting for the resource.  It’s not the resource’s fault there was a line all queued up at the CPU when it got there.  Why should I tack on extra time on a non-CPU wait for that?

I’m undecided on this part, but having the info both ways allows you to figure out the signal wait time on your entire instance and see if you have too much CPU contention.  There’s no arguing about that being a good thing to know.

Steal my query, make it your own, and write that part the way that works best for you.

When Idera Finds Out What I’m Doing….

Someone asked me for queries pulling from Diagnostic Manager, and I posted on Twitter when I made my Idera Diagnostic Manager Scripts page.  The outcry and rage from this was…very positive.

Vicky Harp (t), who is now Idera’s Corporate Strategist wrote pieces of Diagnostic Manager and its schema, took time out of her day to meet with me and talk about the details of querying the data they collect.

Scott Stone (t) who is Diagnostic Manager’s Product Manager also jumped in very quickly to see if I’d like to be a beta tester for them, bringing real power users into the loop to make their products better.

This is the level of support you want from your monitoring software.

While I don’t officially endorse or recommend any specific software at this time, I do recommend you look for responses like this.

 

Shrinking Database Log Files

Files in SQL Server need to grow as the database grows, and in very specific circumstances need to be shrunk as well with additional maintenance.  There is quite a bit different between log files and data files, so I created a separate post for Shrinking Database Data Files.

When should you shrink log files

First, if a large process that will never happen again caused your log file to grow to a ridiculous size then it’s reasonable to shrink the file down to about 150% of the most space you expect to use in the next year.  Cleaning this up will also cut down on restore times.

Second, if your log file grew in small increments then you’re left with a large number of VLFs like I talked about in file growths.  The only way to get rid of these is to shrink your log files down then have them grow in larger increments.  When I do this, I’m also growing them manually immediately afterwards in the same script.

What happens when you shrink log files

The log file will find a group of empty VLFs and delete them.  The end, your file is shrunk.

Of course there’s a little more to it than that.  The parts I’m most worried about are why did you need that space in the first place, why do you need the space freed up, and what happens when you need it again?

To understand why you needed it we need to talk about what a log file really is for a second.  It is every change made to your database since the last point-in-time where it didn’t need all the changes saved off.  This is typically either the oldest open transaction (so it can roll back any change if the transaction fails) or, if your database is in full or bulk-logged recovery, the last time you ran a log backup   It can also be due to other functionality such as database mirroring, but you typically have an experienced DBA on hand for that.

Why is this common to do?

There is typically at least once in every large database’s life where the log needs to be shrunk for one of a couple reasons.

Going off of the two reasons above, the first two are because a large process ran that won’t run again.  The third reason I go over here is because of small VLFs.

1. Log backups weren’t set up

The first large process is when a database is set up in full or bulk-logged recovery, but the log backups weren’t set up until the drive filled.  This is because the log needs to be backed up before it can be flushed, which means it will grow until it runs out of space.  It’s common to miss this when the backup plan needs you to specify which databases to back up, as opposed to doing all databases.  Once you take the first log backup it leaves you with a huge log file where you don’t expect to hit that size again, and it’s ok to shrink it.

2. Runaway transaction

The second large process is when someone runs something that didn’t go as planned.  A transaction, even an implied transaction for a single statement, stayed open forever, or close enough to forever to cause us a headache.  This caused a ton of excessive growth in the log files, and that space will only ever be used when someone says “oops” again.  In this case, evaluate if you’ll need that space for another process before you shrink because someone will say “oops” again someday.  If the drive is dedicated to log files and you only have one large database, consider shrinking it just enough to clear the warnings in your monitoring software.

3. VLF cleanup

The final reason is because you have too many VLFs.  When a log file is created or grows it splits up the new space into smaller chunks called Virtual Log Files that SQL Server will cycle through when writing logs.  If the database is growing in small increments it creates tons of these, slowing SQL Server down.  The point here is that you resolve the issue by shrinking the log (delete smaller VLFs), then grow the log to the desired size in larger chunks (create bigger VLFs).  The details on finding what size your VLFs are, what size new VLFs will be, and just about everything else VLF related is in my post Database Log VLFs.

What happens when log files grow?

If you’re going to shrink a log file then you’re increasing the chances that it will have to grow again, so we need to talk about growth for a second.

When a data file grows it can often take advantage of instant file initialization where it just reserves unallocated space on the disk, making it a quick process.  Without this being turned on or if you’re using Transparent Data Encryption, SQL Server would have to zero out the disks to get rid of data that used to be there.  Kimberly Tripp (b|t) goes into a lot more details than this in her post Instant Initializations – What, Why and How?.

Log files can’t use this feature.  They write those 0’s every time the log grows, and SQL Server is patient enough to wait for that to happen.  Here’s how that went over the last time I overheard one of my users “discussing” this with my biggest server.

SQL Server: My log’s full, hold on while I get some more space.

User: I don’t care, I just want my query to finish.

SQL Server: I’m told to grab 8,000 MB at a time, please be patient.

User: Not a problem, there’s 8,000 MB right there, please hurry.

SQL Server: I only use clean space, hold on while I ask the OS to write about 8,388,608,000 zeroes for me.  This isn’t a problem, I’m patient.

User: What?  NO!  Well, at least that gives me enough time to open a ticket for the DBA.

Ticketing System: Hello again, I haven’t seen you for several minutes.

The moral of this story is that you don’t want autogrowth to happen.  You want it turned on for emergencies and one-time growths, but it’s better to grow manually in a maintenance window.  Here’s how that maintenance window conversation goes:

DBA: I see you’ve hit 90% utilization on your log a couple times, how about we add 8,000 MB.

SQL Server: Oh, I didn’t notice, but I can add that space now.  I only use clean space, hold on while I ask the OS to write about 8,388,608,000 zeroes for me.  This isn’t a problem, I’m patient.

User: Zzzzzzzz….

DBA: There’s no rush, everyone can use the existing log space while you work on that.

Ticketing System: Zzzzzzz…..

The 8,000 MB I used here isn’t a made-up number, it’s a best practice due to VLF sizes on larger databases.  Most DBAs set autogrowth to less than 8,000 because they’d rather deal with smaller VLFs than irate users, and I’d recommend 1,000 MB for autogrowth on larger databases.  To be fair, that number drops to 500 MB starting in SQL 2012, but that’s still a lot of waiting during peak hours.

On smaller databases you aren’t going to want to grow log files by 8,000 MB off hours, or even 1,000 MB for autogrowth.  The growth should be something rather significant for the database yet size appropriate, and I typically like to do close to 50% of the current file size.  So, if I have a 500 MB log, I’ll set it to 250 MB autogrowth and do manual growths of the same size.  For smaller databases where 50% log growth would be less than 100 MB, I’m not always worried about manually growing the log files since there are diminishing returns on my time and effort.

My post Database Log VLFs gets into the details of why those are best practices and what it means to you.

Entry-Level Posts

Tim Ford (b|t) noticed how we moved to advanced topics and asked us to remember where we came from in his Entry-Level Content Challenge.  This is my second post to take on his challenge, and I hope others will take him up on this!

 

 

 

Database Log VLFs

Virtual Log Files (VLFs) split a physical database log file into smaller segments, which are required for how log files work in the background.  These are created automatically, but automatically doesn’t always mean perfect.  Here’s a practical view of what you need to know and how you can set them up properly.

What VLFs Do

My initial view of how a transaction log worked was that each change was written to a log file, and, as the documentation of those changes was no longer required, it was deleted from the log file.  I think this is a common way to view it, and it’s close to the truth.  However, it’s not close enough for a conversation on VLFs.

What really happens is that changes are written to the first VLF, which is just a segment of the log file.  When that VLF fills up, it moves on to the next VLF in a systematic order.

Each change made to the database is assigned a Log Sequence Number (LSN), and SQL Server keeps track of the oldest LSN it still needs for any purpose.  This purpose can be many things, but is typically the oldest LSN of the following:

  • Oldest LSN on a log backup (full or bulk-logged recovery)
  • LSN at the start of the oldest active transaction
  • LSN last replicated when using certain types of replication

Here’s my extremely complicated script to tell you what that reason is for your database:

SELECT name
	, log_reuse_wait_desc 
FROM sys.databases

If the last LSN in a VLF is older than what you need to keep, that VLF can be reused.  This makes is very efficient because this becomes something like a set-based operation for SQL Server, the entire VLF is cleared and marked for reuse at once instead of line-by-line.

To help you picture how the VLFs work, someone at Microsoft drew you a picture on Transaction Log Physical Architecture.  Then you can look at the results of DBCC LogInfo, and it will make a lot more sense when you see a VLF on each line there along with its status.

Only One Log File, Dedicated Drive

Adding more than one log file on a single database isn’t going to help your performance.  SQL Server is writing to a single VLF at a time, and that single VLF is part of a single log file.  If you have multiple log files then you’re writing to one while the other sits idle.  Once you fill up the last VLF on the first file it starts writing to the second file while the first sits idle.  If these are on separate drives that means each drive has to be able to handle the I/O, but it’s hit or miss if you’re using the I/O which leads to wasted resources and inconsistent performance.

As for how you’re writing to the log, most of the work done is writing to the tail end of it so spinning disks do really well just keeping the head in one place and writing to the file.  That is unless you have other types of files on this disk so the head writes a little to the log, jumps over to write a little to a data file, writes a bit more to the log, then jumps over to read a couple indexes.  This is a lot of where the recommendation to keep your data and log files on separate disks come from, but there is more to it than I’ll get into here.

However, we’re going to use RAID for redundancy making the drives too large for just a log, then we’ll put it on a SAN with a write cache, and do so many other things to make it more complicated.  If a server is extremely stressed and highly critical, look into dedicated spindles, SSDs, or other options.  On general shared disk arrays, it’s nice if you can have an array dedicated to just log files, even if it’s log files for multiple servers.

VLF Size Matters

When you’re writing to VLFs there can be two issues.  First, the VLFs are very small and SQL Server is jumping all over the place maintaining small files and figuring out where it should write the next entry – this is common and a big performance hit.  Second, the VLFs are too large and SQL Server is reserving a ton of space when only a small part of the tail end of the VLF is in use, then trying to clear it all at once – this is less common and less of an issue.

In addition to this, VLFs each need to be processed when restoring or recovering a database.  Recovering is part of restarting SQL services, so you should be hitting this issue at least once a month with your Windows Updates.  Every VLF adds some overhead to this process, and a huge number adds a lot of overhead to lengthen this process.

Details of this are on the Microsoft Customer Service and Support blog post How a log file structure can affect database recovery time, and it includes this eye-opening quote, “The first phase of recovering a database is called discovery where all the VLFs are scanned (in serial and single threaded fashion) before actual recovery starts.”

Are your log files pre-sized to take up the entire drive even though they only ever use 5% of that space?  Is that hurting you on recovery?

How Big Are My VLFs?

I have a script for that.  This is set to filter only logs that I want to look at, but you can comment out there WHERE clause on the final statement to see it all.  Also, it would be extremely rare for me to look at individual VLFs, so this is only looking at the sums and averages for each file.

I used to manually comment out a line on this depending on the version of SQL Server until I read a post on VLFs by Andy Galbraith (b|t) Counting Your VLFs, or, Temp Tables Inside IF…ELSE Blocks.  Thank you, Andy.  The timing of your post was perfect to help me make this easier for everyone.

CREATE TABLE #VLF_temp 
(
	RecoveryUnitID int
	, FileID varchar(3) 
	, FileSize numeric(20,0)
	, StartOffset bigint 
	, FSeqNo bigint 
	, Status char(1)
	, Parity varchar(4) 
	, CreateLSN numeric(25,0)
)

CREATE TABLE #VLF_db_total_temp
(
	DatabaseName sysname 
	, LogiFilename sysname
	, PhysFileName sysname
	, AVG_VLF_Size_MB DECIMAL(12,2)
	, vlf_count int
	, log_size_mb FLOAT
	, log_growth_mb FLOAT
)

IF (SELECT LEFT(cast(serverproperty('ProductVersion') as varchar),2)) in ('8.','9.','10') BEGIN
	ALTER TABLE #VLF_temp DROP COLUMN RecoveryUnitID
END

DECLARE db_cursor CURSOR READ_ONLY FOR 
SELECT name FROM sys.databases
WHERE State = 0

DECLARE @name sysname, @stmt varchar(40)

OPEN db_cursor

FETCH NEXT FROM db_cursor INTO @name

WHILE (@@fetch_status <> -1) BEGIN
	IF (@@fetch_status <> -2) BEGIN
		INSERT INTO #VLF_temp
		EXEC ('DBCC LOGINFO ([' + @name + ']) WITH NO_INFOMSGS')

		INSERT INTO #VLF_db_total_temp (DatabaseName, LogiFilename, PhysFileName, vlf_count, AVG_VLF_Size_MB, log_size_mb, log_growth_mb)
		SELECT DatabaseName = @name
			, LogiFilename = mf.name
			, PhysFileName = mf.physical_name
			, vlf_count = COUNT(*)
			, AVG_VLF_Size_MB = (AVG(FileSize)/1024)/1024
			, log_size_mb = (mf.size * 8)/1024 
			, log_growth_mb = CASE mf.is_percent_growth
									WHEN 1 THEN (mf.size * 8)/1024 * mf.growth/100
									WHEN 0 THEN  (mf.growth * 8)/1024 
									END
		FROM #VLF_temp vt
			INNER JOIN sys.master_files mf ON mf.database_id = db_id(@name) AND mf.file_id = vt.fileid
		GROUP BY mf.name, mf.physical_name,mf.size,mf.growth,mf.is_percent_growth;           

		TRUNCATE TABLE #VLF_temp           
	END

	FETCH NEXT FROM db_cursor INTO @name
END

CLOSE db_cursor
DEALLOCATE db_cursor

SELECT * 
	, USEDB = 'USE [' + DatabaseName + ']' 
	, CP = 'CHECKPOINT'
	, ShrinkZero = 'DBCC SHRINKFILE (N''' + LogiFileName + ''' , 0)'
	, GrowBack = 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', SIZE = ' + CAST(log_size_mb AS VARCHAR(20)) + 'MB ) --I split this up when over 10,000 MB'
	, ChangeGrowth = CASE --Note, never gets up to the 8000 MB I recommend for manual growths, but still 250 MB VLFs
						WHEN DatabaseName = 'tempdb' THEN ''
						WHEN log_size_mb <= 128 THEN 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 32MB )'
						WHEN log_size_mb <= 512 THEN 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 128MB )'
						WHEN log_size_mb <= 4000 THEN 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 1000MB )'
						WHEN log_size_mb <= 16000 THEN 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 1000MB )'
						ELSE 'ALTER DATABASE [' + DatabaseName + '] MODIFY FILE ( NAME = N''' + LogiFileName + ''', FILEGROWTH = 1000MB )'
						END
FROM #VLF_db_total_temp
WHERE (vlf_count > 50 
		AND Avg_VLF_Size_MB < vlf_count / 3) 
	OR (vlf_count > 250
		AND AVG_VLF_Size_MB < vlf_count / 2)
ORDER BY vlf_count DESC 

DROP TABLE #VLF_temp 
DROP TABLE #VLF_db_total_temp

What’s the Right Size?

The most common thing you’ll see on the internet is people talking about having too many with the limit being 1,000 VLFs and ideally at 50 VLFs.  I’m a little different and focus more on size than number.  The concept I’m going for is that excessive context switching is what causes me grief, and a reasonable size VLF will keep that under control.

Now there’s no magic number that will work noticeably better in all cases, so I won’t give you static numbers and say they’re magic.  What I can say is that too small is basically fragmentation with the overhead of context switching, too many increases recovery time, too big is huge chucks, and too few could cause extra growths as unused space is still being reserved.

It’s easy to visualize how smaller and excessive numbers of VLFs can cause issues, but that doesn’t mean huge VLFs are the way to go.  Tammy Richter Jones (b) gets into details with her post If > 50 VLFs are bad, then 16 is good, right?.  I’d love to see more on the affects of huge VLFs to have a more complete understanding if a 1 TB log file is better off with 50 20GB VLFs or 2,000 500MB VLFs.  I’m not convinced that >50 is bad.

Personally, I like formulas a lot more than static numbers, especially since they allow me to avoid extremes without stressing out.  The formulas I used broke it down into three categories:

  1. Below 50 VLFs, this is a reasonable number, they get a free ride no matter how big they are.
  2. Between 50 and 250 VLFs, the average size in MB has to be at least 1/3 of the count of VLFs.
    • For a file with 100 VLFs, the average VLF has to be over 33 MB.
  3. Over 250 VLFs, the average size in MB has to be at least 1/2 of the count of VLFs.
    • For a file with 500 VLFs, the average VLF has to be over 250 MB.

This is good enough for almost any database.  I don’t care about there being a couple small VLFs mixed in there because I’ll spend most of my time in the large ones.  I don’t care about the numbers getting larger because the average VLF is large enough to avoid too much context switching.  You’re not out to micromanage, so take a relaxed approach to this unless you have a reason to not be relaxed.

What’s Default?

Remember how I said there were no magic numbers?  Well, that doesn’t mean there aren’t wrong numbers.

By default, every database is based off of model, which has a 1 MB log files growing in 10% increments.  So 1/10th of a MB for the first growth, which is just a tiny fragment.  If you leave it this way on large databases you could end up with a 1 TB log trying to grow 100 GB at a time, which your users will notice.

Be default, you get the worst of both worlds!  The goal here is to avoid extremes, so you’ll want to change the growth rates away from the default.

How Do I Change the Size?

There’s only one way to change the size of your VLFs.  Delete them and recreate them.

For every database EXCEPT TempDB, this means shrinking the logs (it deletes VLFs to get to the size you want), then growing the logs again.  Both manual and automatic growths will split the new physical space into VLFs, but that depends on your version of SQL Server.

Here are the growth rates I pulled from Paul Randal’s post Important change to VLF creation algorithm in SQL Server 2014.

SQL 2012 and earlier uses this formula:

<= 64 MB growth is 4 VLFs

64 MB and <= 1 GB = 8 VLFs

1 GB = 16 VLFs

SQL 2014+ uses this formula:

Is the growth size less than 1/8 the size of the current log size?

Yes: create 1 new VLF equal to the growth size

No: use the formula above

So if you have too many VLFs you can shrink your log file.  It will delete VLFs that are not in use to make it smaller.  Then grow the file in increments that give you VLFs sized how you want them.  If you’re going for 500 MB VLFs then you grow your file 8,000 MB at a time.

The VLFs that weren’t deleted in this process, because they were in use or you didn’t shrink the file as far as you could have, will not be affected.  This means you’ll have different sized VLFs throughout your file, but that doesn’t really matter.  What does matter is that you don’t have any ridiculously large VLFs and you’re spending most of your time in properly sized VLFs.

Here’s how I do it.  Shrink it as much as possible.  Shrinking is never as easy as it should be, so verify it shrunk it a significant amount.  If it didn’t, take log backups and try again, repeating this process a handful of times if needed.  Then, if I want the log to be 32,000 MB, I just grow it by 8,000 MB 4 times.

--Run a log backup before this if database is not in simple and log backups run less frequently than every 5 minutes.
USE [DatabaseName]
GO
CHECKPOINT
GO
DBCC SHRINKFILE (N'LogicalFileName_Log' , 0)
GO
USE [master]
GO
ALTER DATABASE [DatabaseName] MODIFY FILE ( NAME = N'LogicalFileName_Log', SIZE = 8000MB )
ALTER DATABASE [DatabaseName] MODIFY FILE ( NAME = N'LogicalFileName_Log', SIZE = 16000MB )
ALTER DATABASE [DatabaseName] MODIFY FILE ( NAME = N'LogicalFileName_Log', SIZE = 24000MB )
ALTER DATABASE [DatabaseName] MODIFY FILE ( NAME = N'LogicalFileName_Log', SIZE = 32000MB )
GO

Assuming you were able to shrink this down to 10 VLFs remaining and you want it to be 32,000 MB, this would give you 74 VLFs with all new ones being 500 MB.  It goes over the 50 VLF limit I’ve seen elsewhere, but I’m very happy with these sizes and numbers.

With 8,000 MB growths the new 2014 calculation won’t kick in until the log was already 64,000 MB, and at that point an 8,000 MB VLF probably isn’t a bad thing.

Autogrowth Rates

I recommend manually growing larger log files between 8,000 and 16,000 MB at a time, but not autogrowths.  The process that causes a file to grow, and any other process that needs that space, will sit around and wait for an autogrowth to complete.  That could take a minute on descent disks, which is longer than many timeouts I’ve worked with.

Manually growing files is best, but almost every database will rely on autogrowth.  Know your timeouts, know how long it takes to grow a file, and size it accordingly.  I like to do 1,000 MB here, but will do as less if I have to.  If you can’t make a descent sized growth based on this, be more paranoid about manually growing files while leaving autogrowth set to smaller sizes.

The reason for 1,000 MB is that this is the largest size that will still give you 8 VLFs instead of being split into 16 VLFs, with the actual limit being at 1,024 MB.  If you did 2,000 MB, you’d still end up with 125 MB VLFs.  You can get that up to 250 MB each at 4,000 MB, but you’re telling your users to wait while 4,000 MB is zeroed out before they can move on.  Although the calculation changes for larger logs on SQL 2014, I’d still stick with 1,000 MB for a good autogrowth size.

Note, there is a resolved bug that would cause issues when log files grew in 4 GB increments, so you’ll see a lot of people using 8,000 MB instead of 8,192 MB.  We aren’t using an effected patch level anymore, but DBAs are made to be paranoid.  We’re not just lazy at math, although this doesn’t rule out that possibility.

TempDB

TempDB is different.  The database is recreated every time the SQL service restarts, so the log is recreated as well, following the formulas above.  If the log is over 1 GB, it starts out with 16 VLFs.  Jonathan Kehayias got into the details in his post TSQL Tuesday #11 – Misconceptions – The TempDB Log File and VLF Counts.

So TempDB never has too many on startup, but is this too few?  What if TempDB’s log is 40 GB, do you want a 2.5 GB VLF?  Is setting up TempDB to have 5 log files that are 8,000 MB each so it starts up with 500 MB VLFs a better idea?

Seriously, someone answer this one, I’m curious myself.

Changes Have Risk

The script above creates scripts to make changes.  It’s not perfectly safe to make those changes.  Test it, understand it, and be careful running it.

One of the most important pieces is that the scripts it generates will try to shrink the log as much as possible before growing it again.  In terms of cleaning up VLFs, this is perfect.  In terms of trying to run this during the day, it risks transactions failing because they’re trying to grab the next VLF as you’re running a shrink job to delete all unused VLFs.

You might be able to get away with this during the day most of the time, but it’s best to wait for a maintenance window to be safe.

What Others Did

Linchi Shea showed performance degradation from having too many small VLFs on Performance impact: a large number of virtual log files – Part I.  Updates can take almost 10x as long, and I would assume its due to the size and not number of VLFs.

SQLskills.com has blog posts that read like a series on this.  Kimberly Tripp wrote 8 Steps to better Transaction Log throughput, then followed that up with Transaction Log VLFs too many or too few?, which was updated to link to Paul Randal’s post Important change to VLF creation algorithm in SQL Server 2014.  All are must-reads if you want to really understand this topic.

Tony Rogerson (b|t) at the time of this writing has only put one post on his new blog, Transaction Log Concepts: Part 1.  If this is his typical quality of work, we have a lot to look forward to.  I’m personally looking forward to the other 3 parts of this 4 part series.

Grahm Kent (b|t) ran tests on the performance differences between two log files in his post Slow recovery times and slow performance due to large numbers of Virtual Log Files.  Both were 5 GB, but one was 16 VLFs (320 MB each) while the other was 20,480 VLFs (0.25 MB each).  It was good to see transaction performance as well as recovery performance covered.

Test It

If your database is absolutely critical to the point that squeezing an extra couple milliseconds of performance throughout the day would be noticed or if a server coming back online 15 seconds faster would save thousands of dollars, test it.  See how it performs in load tests and service restarts using your test servers and your database.

If you have a 1 TB log file and verified it needs to be about that big, what happens when you size it so you have 2,000 VLFs that are 500 MB each?  What happens when you have 100 VLFs that are 10 GB each?  Did it change recovery, any replication you’re doing, backup times, etc.?

Know that this can make a difference and that this is just another configuration change you can test out.  Crazy numbers on either side can be bad, and perfect numbers in the middle aren’t going to come from some stranger on the internet that never saw your server.

TempDB Excessive Memory Usage Example

I see TempDB using more memory than I feel it should and found a way to resolve it.  Previously I dove into technical details and questioning what I saw in my post TempDB Memory Leak? which is great for proving and debugging an issue.  Now I want to step back and talk about it from a practical perspective.

To keep it practical, I’m going to walk through a single server, step-by-step.  What I saw that caught my eye and how I applied my workaround.  This is a real, production server with screen shots from Idera Diagnostic Manager for both before and after comparisons.

What I’m Seeing

A large portion of my buffer pool is being utilized by TempDB pages which are unallocated on disk.  The easiest way to see this is this quick query:

SELECT Unallocated_MB = COUNT(1)/128
FROM sys.dm_os_buffer_descriptors bd
	LEFT JOIN sys.allocation_units au ON bd.allocation_unit_id = au.allocation_unit_id
WHERE bd.database_id = 2
	AND au.allocation_unit_id IS NULL

The details of everything in memory can be seen on my post Query the Buffer Pool.

Since these are all pointing to real, but unallocated TempDB pages on disk, I was able to develop a workaround of simply having fewer TempDB pages on disk.  AKA, I made TempDB data files smaller.

Sizing TempDB

Here we’re talking about being hurt by TempDB being too large, but we can’t lose sight of TempDB being too small hurting you as well.  Specifically, when your server is the busiest with a small TempDB it will have to stop and grow the files, then deal with any fragmentation that process caused.  So, while I’m not going to recommend filling your dedicated drive, I’m not going to recommend making TempDB tiny, either.

My recommendation is to monitor your server, find out how much space TempDB uses, and make your TempDB significantly larger than that.  The longer you have watched your server, the better.  The goal is still to avoid almost any growth possible, but I’m adding in a “don’t go crazy” clause.

You can do this yourself as I talked about in Monitoring Database and Table Sizes using the function call FileProperty(<filename>, ‘space used’).  If you do it this way, make sure you have this running for a minimum of several weeks at least once an hour, day and night.  It’s possible, not difficult to do, and not recommended for servers that have a reasonable monitoring system already in place.

If you’re doing this to your production servers you should already have monitoring software that’s watching file sizes for you, even if it’s only tracking this in the background with no obvious way to find the information in the GUI.  This is the situation I find myself in using Idera Diagnostic Manager.  The data’s there, but the GUI is set up for more immediate concerns.  I need this query to get what I want out of it.

SELECT TOP 100 S.InstanceName
	, DS.UTCCollectionDateTime
	, D.DatabaseName 
	, DataUsed_GB = CAST((DataSizeInKilobytes + IndexSizeInKilobytes) / 1024 / 1024.0 AS DEC(20,1)) 
	, LogUsed_GB = CAST(LogSizeInKilobytes / 1024 / 1024.0 AS DEC(20,1))
FROM SQLdmRepository.dbo.DatabaseSize DS
	INNER JOIN SQLdmRepository.dbo.SQLServerDatabaseNames D ON DS.DatabaseID = D.DatabaseID
	INNER JOIN SQLdmRepository.dbo.MonitoredSQLServers S ON S.SQLServerID = D.SQLServerID 
WHERE UPPER(S.InstanceName) = 'SERVER\INSTANCE'  
	AND D.DatabaseName = 'tempdb'
	AND UTCCollectionDateTime > GetUTCDate()-180
ORDER BY 4 DESC

My recommendation is to take the peak size outside of a one-time process and make the total TempDB data files at least 150% of that size.  If you have 4 TempDB data files and the peak size used is 4 GB, 150% of that is 6 GB, so make each file at least 1.5 GB.  Then set autogrowth to something reasonable because one-time processes will happen.

My Before Observations

All of the screen shots and trending information are from Idera Diagnostic Manger, but you should be able to get the same basic information out of any monitoring software.

I had a server with rather low PLE and a lot of physical I/O.

Server_Oversized_TempDB

The physical I/O is difficult to see here because this server has a lot of writes.  When you look closer you can see that it rather regularly went to 400 page reads per second.

PhysicalIO_Oversize_TempDB

All of those reads caused lots of waits on the server.  In my 12-hour sample period I saw a total of 34,000.  The units shown for this field are fine for comparison reasons, but they appear to be a total of the ms/s on the chart for values collected every 6 minutes, not the total waits.  You have to query the tables behind Idera Diagnostic Manager to get the real values.  In this case, the total waits were 138 minutes for shared and 49 for exclusive latches, for a total of 187 minutes of waiting for the 12 hours.

Waits_Oversize_TempDB

Just seeing this I wanted to request more memory on the VM, after all it only had 16 GB.  However, I made it a rule to never add memory without justifying what’s in memory first.  To do that, I used my query on Query the Buffer Pool, and this was the first line in the results:

Cache_Oversized_TempDB

Of the 8.5 GB of memory used by the buffer pool, about 6 GB of that was used by unallocated space in TempDB.  I can accept some because of how TempDB caches temp tables and other stuff, but to have 70% of the buffer pool used up while I’m seeing low PLE and high I/O is not acceptable.

My Fix

I had the files set to practically fill the dedicated drive according to best practices, so each of the 4 files was 7,000 MB adding up to almost 28 GB on a 30 GB drive.

Using the query in the Sizing TempDB section above, this server has never used more than 1 GB of data in TempDB at one time.  While that makes for some interesting questions on how it’s using 6 GB of cache, it does give me a lot of room to reduce the size.

There were no complaints about performance due to what this server is being used for, so I didn’t try to shrink TempDB.  Instead I took the patient approach an resized TempDB, which took effect the next time SQL Services restarted.  Then I waited for our scheduled server reboots for Windows Updates.

ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'tempdev',SIZE = 512MB , FILEGROWTH = 128MB )
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'tempdev2',SIZE = 512MB , FILEGROWTH = 128MB )
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'tempdev3',SIZE = 512MB , FILEGROWTH = 128MB )
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'tempdev4',SIZE = 512MB , FILEGROWTH = 128MB )

Note that my recommendation above would have been to make the files at least 1.5 GB and I made them 2 GB here.  I’m not worried about exact numbers, I’m worried about being reasonable sizes that also avoid growths.  With peak observed usage at 1 GB and TempDB sized at 2 GB, any growths should be very rare.

Overall, the data files will grow at 512 MB at a time, which, when it happens, is large enough to avoid extreme fragmentation yet small enough to avoid extreme waits.  This should only happen during “I didn’t mean to do that” or “how’d it get that execution plan” moments, but it will happen.

My After Observations

After the change I waited a couple days to let the cache stabilize after a reboot.  We rebooted on Saturday, and all of these numbers are from the Tuesday after.

When I looked again, TempDB was using 1/10 of the space it was before for unallocated pages in memory.  It’s not perfect, but it’s much better.

Cache_RightSize_TempDB

TempDB using less cache allowed PLE be a lot more stable, regularly staying well above 2,000.  Using Jonathan Kehayias’s formula, PLE should be above 1,050 on a server with 14 GB of cache, so this is looking good.

Server_RightSize_TempDB

With data staying in cache more, physical I/O also dropped.  Before it was normal to see spikes to 400 page reads per second, now you’re seeing about half of that.

PhysicalIO_RightSize_TempDB

Less activity also meant less waits.  The couple spikes still went almost as high, but those are things you’ll fix more through tuning than memory management.  The totals at the bottom are what I’m focused on right now, the PageIOLatch_xx waits combined added up to about 11,000 for the day, so about 1/3 of what it was before.

Again, I don’t like how Idera Diagnostic Manager calculates the totals.  When I queried the tables I saw that there was 60 minutes of waits on shared latches and 13 minutes on exclusive latches, for a total of 73 minutes.  This is in comparison to 138 + 49 = 187 minutes before.

Waits_Rightsize_TempDB

The Downside

Even with the new, smaller size, TempDB is a lot larger than it needs to be so it would be very rare to see any growths in the near future.  However, I have to keep in mind that the load on my servers will always be increasing, so I may end up hitting autogrowth down the road.  Then I’ll reboot the servers each month with Windows Updates, it’ll shrink TempDB back down for me, and it will have to grow again the next month.

Because of this, I’ll need to visit every one of my servers from time to time (semi-annually?) to see if this is occurring and increase the size of TempDB if it is.  I’ll use the same script for determining the peak sizes because I have software that’s capturing that already, but I could find other methods to get that information if I needed to.

The only other downside I see is with servers that hit a high peak TempDB usage overnight then never come close to it again during the day.  This workaround won’t do much for those servers.  You need that space overnight, so trying to make the size smaller than what it needs at 2 AM will just make everything worse.  However, knowing about this will drive you nuts enough to look into the overnight processes to see if  you can lower peak usage.  Perhaps you can spread out some concurrent tasks or tune some bad queries that never mattered due to their timing.  It’s not a bad thing to do this work, it’s just that this probably wasn’t a good time for something to climb up your priority list.

Take a Look

Take a look at your servers.  How much cache is TempDB using?  Does TempDB ever even use half of the space allocated to it?  Is it causing issues on your servers?

Never make a change just because you saw someone say it’s a good idea, especially if they’ve never seen your servers before.  Look at the numbers on your servers for yourself, ask questions to understand it, and make the best decision for those specific servers.

If you do make this change, check TempDB before your next reboot.  Did it have to grow?  If it did, you need to change the size manually so it doesn’t have to hit an autogrowth every month.  If you shrunk it down to 2 GB total then it grew to 2.5 GB, consider making it 3.5 GB to avoid future growths while keeping the files reasonably sized.

If this was an issue on your servers and you feel SQL Server shouldn’t have functioned this way, please visit the Connect Item on this.  An up-vote is great, a comment or anything else is even better.

 

 

 

 

Tim Ford’s Entry-Level Content Challenge

All bloggers started as beginners.  We picked things up, learned them, and got to the point where we wanted to start sharing our knowledge.  The problem is that most of us moved on to talking about advanced topics and left people to fend for themselves on the difficult entry-level learning curve.  My blog, Simple SQL Server, was actually created with me saying I was going to fill that gap, right up until the time I figured out the best way to learn advanced topics was to write things down as I learned.

Entry-Level Content Challenge

Tim Ford (b|t) noticed how we moved to advanced topics and asked us to remember where we came from in his Entry-Level Content Challenge.

It’s very simple.  Create one blog post a month that someone just learning your job could understand and benefit from, use the #EntryLevel tag on your blog and notification on twitter, and link to his original post.

This is for everyone, I don’t care what job you have.  I’m a SQL Server DBA, Tim’s a SQL Server DBA, but every profession has beginners that could use some help.

We’re Outnumbered

One of the things, if not THE biggest thing, you hope to do with a blog is to make a difference.  Then you talk about advanced topics to help other people near your level keep moving forward.  You also realize that teaching the details on these topics accelerates your learning and confidence faster than you thought possible.  It’s great; I recommend everyone does it.

However, there are a couple problems with this logic.  By the time people feel confident enough to blog, they aren’t talking about the fundamentals anymore.  Also, for every advanced person in any field, there are many, many more entry-level people who could benefit from learning the fundamentals properly from the start.

Bloggers aren’t talking to the largest portion of their potential audience, who just happen to be going through the most difficult part of the learning curve.

It’s the Right Thing To Do

In no specific order, here’s why it’s the right thing to do.  This list contains selfish reasons, moral reasons, and everything in-between.

  • You want to reach a larger audience, and there’s more people who need this content.
  • You were a beginner once and learned off of others.
  • You wish you learned the fundamentals right from the start, but you didn’t always.
  • You learn everything you teach better, even if you consider it simple.
  • Tim challenged you, and it’s hard to say no to a challenge.
  • You always wanted to make a difference.

New Bloggers

This is also a great opportunity for new bloggers.  Talk about subjects you know well, even if you’re not at the level you feel comfortable teaching advanced topics.

The content is the only difficult part of blogging.  Setting up a blog and creating a post is practically identical to setting up a new email account and sending an email.  If you want to get into blogging and need help, let me know.

If Einstein Can Do It…

Einstein said “If you can’t explain it simply, you don’t understand it well enough.”  I took this quote to heart enough to name my blog after it.  Try explaining the fundamentals simply, you’ll see how much your understanding of the topic increases to accomplish that feat.

 

 

Querying Deadlocks From System_Health XEvent

It’s easy to query XEvents to see some of the basic info in deadlocks, including the system_health session which is already capturing this information by default.

This script will, by default, read from system_health.  However, change the name of one parameter at the start to have it read from any active session.  I pull the file name automatically, so the session has to be active to avoid many changes to the script.

The Script

DECLARE @SessionName SysName 

SELECT @SessionName = 'system_health'

/* 
SELECT  Session_Name = s.name, s.blocked_event_fire_time, s.dropped_buffer_count, s.dropped_event_count, s.pending_buffers
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE target_name = 'event_file'
--*/

IF OBJECT_ID('tempdb..#Events') IS NOT NULL BEGIN
	DROP TABLE #Events
END

DECLARE @Target_File NVarChar(1000)
	, @Target_Dir NVarChar(1000)
	, @Target_File_WildCard NVarChar(1000)

SELECT @Target_File = CAST(t.target_data as XML).value('EventFileTarget[1]/File[1]/@name', 'NVARCHAR(256)')
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE s.name = @SessionName
	AND t.target_name = 'event_file'

SELECT @Target_Dir = LEFT(@Target_File, Len(@Target_File) - CHARINDEX('\', REVERSE(@Target_File))) 

SELECT @Target_File_WildCard = @Target_Dir + '\'  + @SessionName + '_*.xel'

--Keep this as a separate table because it's called twice in the next query.  You don't want this running twice.
SELECT DeadlockGraph = CAST(event_data AS XML)
	, DeadlockID = Row_Number() OVER(ORDER BY file_name, file_offset)
INTO #Events
FROM sys.fn_xe_file_target_read_file(@Target_File_WildCard, null, null, null) AS F
WHERE event_data like '<event name="xml_deadlock_report%'

;WITH Victims AS
(
	SELECT VictimID = Deadlock.Victims.value('@id', 'varchar(50)')
		, e.DeadlockID 
	FROM #Events e
		CROSS APPLY e.DeadlockGraph.nodes('/event/data/value/deadlock/victim-list/victimProcess') as Deadlock(Victims)
)
, DeadlockObjects AS
(
	SELECT DISTINCT e.DeadlockID
		, ObjectName = Deadlock.Resources.value('@objectname', 'nvarchar(256)')
	FROM #Events e
		CROSS APPLY e.DeadlockGraph.nodes('/event/data/value/deadlock/resource-list/*') as Deadlock(Resources)
)
SELECT *
FROM
(
	SELECT e.DeadlockID
		, TransactionTime = Deadlock.Process.value('@lasttranstarted', 'datetime')
		, DeadlockGraph
		, DeadlockObjects = substring((SELECT (', ' + o.ObjectName)
							FROM DeadlockObjects o
							WHERE o.DeadlockID = e.DeadlockID
							ORDER BY o.ObjectName
							FOR XML PATH ('')
							), 3, 4000)
		, Victim = CASE WHEN v.VictimID IS NOT NULL 
							THEN 1 
						ELSE 0 
						END
		, SPID = Deadlock.Process.value('@spid', 'int')
		, ProcedureName = Deadlock.Process.value('executionStack[1]/frame[1]/@procname[1]', 'varchar(200)')
		, LockMode = Deadlock.Process.value('@lockMode', 'char(1)')
		, Code = Deadlock.Process.value('executionStack[1]/frame[1]', 'varchar(1000)')
		, ClientApp = CASE LEFT(Deadlock.Process.value('@clientapp', 'varchar(100)'), 29)
						WHEN 'SQLAgent - TSQL JobStep (Job '
							THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(Deadlock.Process.value('@clientapp', 'varchar(100)'),32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(Deadlock.Process.value('@clientapp', 'varchar(100)'), 67, len(Deadlock.Process.value('@clientapp', 'varchar(100)'))-67)
						ELSE Deadlock.Process.value('@clientapp', 'varchar(100)')
						END 
		, HostName = Deadlock.Process.value('@hostname', 'varchar(20)')
		, LoginName = Deadlock.Process.value('@loginname', 'varchar(20)')
		, InputBuffer = Deadlock.Process.value('inputbuf[1]', 'varchar(1000)')
	FROM #Events e
		CROSS APPLY e.DeadlockGraph.nodes('/event/data/value/deadlock/process-list/process') as Deadlock(Process)
		LEFT JOIN Victims v ON v.DeadlockID = e.DeadlockID AND v.VictimID = Deadlock.Process.value('@id', 'varchar(50)')
) X --In a subquery to make filtering easier (use column names, not XML parsing), no other reason
ORDER BY DeadlockID DESC

System_health session

The system_health XEvent by default is turned on, captures deadlocks, and is limited to four 5MB files.  You can change all three of those, but it’s not common for people to do that.

Yes, you’re limited to more recent events, but the information is there.  If you find the limitations are too much, you have a couple options that should be considered in this order.

  1. Is anything else capturing deadlocks?  For me, I have Idera Diagnostic Manager capturing deadlocks as well for my production servers.  This is my main source of looking at deadlocks, so I actually rarely use system_health for this purpose in production.  The first goal is to avoid doing anything additional.
  2. Consider setting system_health to use bigger files (not too big) or have more rollover files. The more you do the more resources it will take to read it all at once, so don’t go crazy.  However, if you have 5 days of data when you wish you had 7, it’s not horrible to double the size of data you retain.  The second goal is to increase disk usage before you add more overhead.
  3. Consider a separate XEvent session.  Ugh…I wish I didn’t have to throw this out there as an option.  Love your servers, don’t add more overhead if you don’t have to.  The last resort is to not end up here at your last resort.

Idera Diagnostic Manager

At the time of me writing this, I have a script to look at all of the deadlocks in Idera DM that I use quite regularly for my production servers.  However, they just came out with a new major version and I want to make sure my query runs there before posting my script publically.  If you don’t care and just want the script, email me.

Viewing the Deadlock

The details of this script give you a great overview of the deadlock.  You know the code involved on each side, the tables and applications involved, etc.   It’s enough to look for trends, know if a deadlock was one you expected, and other basics.

However, this is just a summary list of your recent deadlocks, it’s not the details.  For me, I’m very happy just looking at the XML output to see the details.  This is how I look at it from XEvents and my monitoring software.  That makes me odd, and I’m ok with that.

Jonathan Kehayias (b|t) talks about viewing them in SQL Sentry Plan Explorer Pro in his post Graphically Viewing Extended Events Deadlock Graphs.  There’s no single right answer on how to view deadlocks, so do what works best for you.

Troubleshooting Deadlocks

This topic goes well beyond what I’m going for in this post.  Jonathan Kehayias gets into it some in his post The Accidental DBA (Day 29 of 30): Troubleshooting Deadlocks, which is just a start.  The book he co-authored with Ted Krueger (b|t), Troubleshooting SQL Server – A Guide for the Accidental DBA, is available as a free PDF and has an entire chapter on this subject.

 

Shrinking Database Data Files

Normal DBAs will sum up shrinking a database data file in one word: NO!

I’ll talk to you about this subject because I’m not normal.  Still, my goal is that you’ll go from “why can’t I shrink my files” to “I don’t want to shrink my files.”

Truth be told, even when I HAVE to shrink files, I don’t want to.

Basically, you needed the space at one point in time, so you’ll probably need it again.  Unless you changed a process or something extremely rare happened, you’ll use the space again and there’s no reason to take on the issues involved with shrinking just to have a file grow again.  While this advice is mostly the same for log files, it’s different enough where I’m making a separate post for shrinking database log files

When should you shrink data files

First, a process changed where you’ll never need that amount of space again, such as the database used to keep 10 years of history but now an automated process keeps it pruned to 6 months.  This change also left you with a very significant amount of free space that you need for another purpose.

Really, that’s about it.

What happens when you shrink data files

The most common way to shrink a file is to have it reorganize pages before releasing free space, so I’ll cover that.  It basically has two steps:

  1. Fragment the file as much as possible (killing performance)
  2. Truncate the end of the file

By “reorganize” it means take the tail end of the file after the size you said you’d like the file to be shrunk to, find every populated page of data after that point, then move those pages to the first available spot you find in the file.  When the end of the file is empty, truncate it.

It’s an expensive process that leaves a big mess behind, but there are times you’ll need to go through with it (kicking and screaming).  If you do, schedule an index maintenance job immediately after this to undo most of the damage you just did and make sure the server still loves you.

Think about this, you’re considering kicking off a process where you have to tell the server you’re sorry afterwards.  You better have a good reason for this.  A 100 GB database with only 2 GB used may be a good reason to shrink.  A 100 GB database with 75 GB used is normal and healthy.

If you are going to shrink, make sure the database won’t be using that space again.  If you have a 100 GB database with only 2 GB used, does it populate, process, and purge 90 GB in an overnight process?  Find out before you shrink.

Manual Shrinking

If I haven’t scared you off yet (and I hope I did) then here’s how you shrink.

Here’s the link how to shrink a file using T-SQL: DBCC ShrinkFile

Here’s the link how to shrink a file using the GUI: SSMS Shrink File

There are Shrink Database options as well.  I don’t use it and don’t recommend it.  You know what files you want to shrink, don’t shrink them all.  When this is used it’s typically because people want to shrink their data files, but they mess up their log files while their at it.  Be nice to yourself, don’t use shrink database.

What about the Auto-Shrink option

No.

Just no.

It’s not a feature, it’s a threat.

If you have a certain amount of free space it will automatically shrink your database causing all the problems I just discussed, but it will probably kick in during your busiest time of day.  Then it won’t follow that up with index maintenance, so it just left you hopping all over your disk to find data.  The best part is that it didn’t check to see if you would need that space again, so it’ll probably grow tomorrow to get that space back.

In the example above of a 100 GB database that only uses 2 GB during the day but populates 90 GB for overnight processing, those 2 GB will be horribly fragmented, you’ll spend the resources shrinking, then you’ll take the performance hit growing the database every night as it puts the growth anywhere it can which typically means physical file fragmentation as well.  In the mean time your monitoring software will show that you have tons of free space, right up until you get the disk full error.  You can auto-shrink that database and get all those issues, or you can just let it do its job as you do yours without a maintenance nightmare.

The Connect Item on this is marked as fixed, stating that they’ll consider removing it in future versions.  This wasn’t a joke opened by some wannabe DBA.  This was serious and opened by someone who went on to become the president of the Professional Association for SQL Server.

To be fair, we all word our feelings on this differently.  I say no, Tom LaRock says never, Paul Randal says Turn It Off!, and Brent Ozar is colorful enough to come up with the term Hamster Wheel of Death.

As of the time of this writing, no one has used the term “Magical Unicorn” to describe this feature.

Sum it up

  1. Don’t shrink data files.
  2. If you didn’t listen to #1, why?
  3. Follow it up with index maintenance.
  4. Every time you try to automate this process a unicorn dies.

Entry-Level Content Challenge

I’m taking Tim Ford’s (b|tEntry-Level Content Challenge, and this is my first post with the #EntryLevel tag.  The challenge is to have one post that earns that tag each month, but I hope to exceed that.

 

SQL Saturday Cleveland #SQLSat473

On February 6, 2016, Cleveland is hosting a free training event for SQL Server.  This has a lot of the great stuff from the big, paid events, and skips some of the negatives.

There’s a great team from the North Ohio SQL Server Users Group that took on a very difficult task to put all of this together.  They hand selected every presentation, being forced to turn away over half of the abstracts and many speakers.  What they came up with was a variety of topics that span the breadth of SQL Server and the experience levels of the professionals working with databases.

What is this SQL Saturday?

For those of you who have never been to a SQL Saturday, here’s how this one works.  Other SQL Saturdays will vary from this slightly, but not too much.

There are 6 training rooms with hour-long presentations running simultaneously in each.  You move from room to room between each session choosing the ones that apply to you the most.  The schedule for the sessions is posted in advance, and it’s impressive.

An optional $10 lunch, which is typically very obvious that no one is making a profit off of, is the closest thing the event has to an entrance fee. The stuff you’d expect to cost you is provided by volunteers and sponsors who pull together the venue, presenters, and community for you.

There’s time to network with vendors, attendees, and speakers before the event, during breaks and lunch, and often end up in conversations that last throughout the next session.  This differs from the big events in that it’s much smaller and personable with most attendees being relatively local.  The people you meet at a regional event like this are going to be more relevant to your life because they’re part of your local community.

The event at Hyland Software concludes with prizes ranging from $100 gift cards to new laptops being handed out to random attendees.  Yes, you can basically get paid to go to free training, which I still don’t quite understand.

Then there’s the after-party, also provided by the SQL Saturday volunteers and sponsors.  There’s food, fun, and more time to get to know some great people.

The venue can’t hold everyone who should come, and often can’t hold everyone who wants to come.  Register now to make sure there’s room for you, and make sure to cancel if your plans fall through.

What is SQL Saturday?

Not just this event, what is SQL Saturday as a whole?  It’s a program specifically set up to help everyone in the community to develop.

The obvious people benefiting are the attendees who get a chance to see all the sessions and meet the vendors.  You can see how they would benefit from here, but not quite how much until you’re at the event.

The less obvious are the speakers, some speaking publically for the first time.  As one of the speakers, I can personally say that I’ve grown more as a person and as a professional than I thought possible.  It’s a step I highly recommend, and one I’m very grateful to have with SQL Saturday.

The even less obvious are the vendors.  They’re speaking to the public, not just their established customers, getting candid feedback on what works, what doesn’t work, and how they can make their offerings better.  Even if they don’t make a single sale from attending the event, they can always come out ahead.

SQL Saturday didn’t just pop out of nowhere, and someone told me we should give thanks to the people who started it all.

Thank you:

Steve Jones (b|t)

Andy Warren (b|t)

Brian Knight (b|t)

These event are around the world with one or more occurring almost every weekend of the year as an annual event in most major cities.  See SQL Saturday’s website for a full list of what’s coming up.  If your city isn’t coming up soon, look in past events because most cities hold it about the same time every year.

Who’s speaking?

Where do I start?  Well, by listing EVERYONE on the current schedule, along with the blogs and twitter handles I’m aware of.

  • Deji Akomolafe (b|t)
  • Jim Arko (b|t)
  • Adam Belebczuk (b|t) <– Organizer
  • Chris Bell (b|t)
  • Delora Bradish (b|t)
  • Mindy Curnutt (b|t)
  • Aaron Cutshall (t)
  • Joey D’Antoni (b|t)
  • David Eldersveld (b|t)
  • Cassandra Faris (t)
  • Kevin Feasel (b|t)
  • Frank Gill (b|t)
  • Amy Herold (b|t)
  • Paul Hiles (t) <– Organizer
  • Steve Hood (b|t) <– What am I doing on a list this awesome?
  • Michael John
  • Jonathan Kehayias (b|t)
  • Dave Mattingly (t)
  • David Maxwell (b|t)
  • Evelyn Maxwell <– bright future ahead for her
  • Eugene Meidinger (b|t)
  • Kon Melamud (t)
  • Ben Miller (b|t)
  • Jeff Moden (b)
  • Colleen Morrow (b|t) <– Organizer
  • Matt Nelson (b|t)
  • Wendy Pastrick (b|t)
  • Pat Phelan (t)
  • Justin Randall (b|t)
  • Wayne Sheffield (b|t)
  • Peter Shore (b|t)
  • Warren Sifre (b|t)
  • Julie Smith (b|t)
  • Erin Stellato (b|t) <– Organizer
  • Michael J. Swart (b|t)
  • Allen White (b|t) <– Organizer
  • Andy Yun (b|t)

Note that the schedule may change some before the event, but it never changes too much.

What do we ask of you?

Show up.

Despite the amazing list of speakers, we’re all volunteers.  That means we do it because we want to do it, and truly enjoy it.  If you want to help us enjoy it even more, be involved.  Talk to us and the other attendees, ask questions, and give feedback.

If you want to absolutely blow us away, get in touch afterwards and let us know the difference it made for you.  Yes, we want this, it will absolutely make our day!  Let the organizers know what you think of the event.  Tell the presenters how you were able to use their advice at work, and even ask questions down the road.  Get in touch with the vendors to discuss their products, even if you aren’t planning a purchase, to give them real feedback and see what’s out there.  I used to think this was imposing, but now I realize this is the one of the most awesome compliments you can get.

The most important part….

Show up!

Hope to see you there!

UPDATE – 2016-01-13 – On the waiting list

If you registered before today, you’re in.  If you’re registering now, you’re on the waiting list.

If you can’t make it, please cancel your registration so someone on the waiting list can get in.  We’ll miss you, but at least there will be someone else we won’t be missing.

TempDB memory leak?

I found a bug where I’m seeing TempDB use more memory than it should on multiple versions of SQL Server, especially on servers set up with common best practices. There’s a workaround that has a profound affect on server performance, adding to my belief that this is a legitimate bug and leading me to open a Connect Item on the issue.

Querying the Buffer Pool

I have a query to show me what’s in the buffer pool for the entire server, showing the expected results as well as excessive space being used by TempDB.  The newest version showing these details is on my post Querying the Buffer Pool.

It shows number of pages in the buffer pool grouped by the database, table, and index.  The query makes use of left joins so it can see space in memory that’s not currently allocated to a specific object.

The results are surprising in many ways.

The good surprises are seeing what indexes are hogging up your buffer pool so you have an idea of where to start tuning.  I’m a huge fan of this and have blogged about it in Cleaning Up the Buffer Pool to Increase PLE, although the name of my older post is misleading because it does more than just help memory management in SQL Server.

The Bug

The bad surprise was a bug which has been harassing me for quite some time now.  As I mentioned, the query will return all the space in the buffer pool, specifically the contents of sys.dm_os_buffer_descriptors, and does a left join to the tables leading up to and including sys.indexes so space not currently allocated to a table will show up.  The problem is that the space that shows up as unallocated for TempDB is much larger than expected, in this case taking up 1/3 of my buffer pool.

QueryBufferPool_TempDB

On this post I’m talking about a single server, but the problem wasn’t limited to a single server.  It showed up at the same time, caused by the same change (implementing a common best practice), partially resolved by the same partial rollback (undoing the best practice) on SQL 2008 R2, SQL 2012, and SQL 2014.

Details About the Bug

So the query I have on yesterday’s post, Querying the Buffer Pool, showed I had unallocated space in TempDB in memory, and a lot of it.  However, it doesn’t show details.

To start looking at the details, what kind of pages are these that exist in sys.dm_os_buffer_descriptors, but not in sys.allocation_units?

SELECT bd.page_type
	, MB = count(1) / 128
FROM sys.dm_os_buffer_descriptors bd
	LEFT JOIN sys.allocation_units au ON bd.allocation_unit_id = au.allocation_unit_id
WHERE bd.database_id = 2 --TempDB
	AND bd.is_modified = 0 --Let's not play dirty, only clean pages
	AND au.allocation_unit_id IS NULL --It's not even allocated
GROUP BY bd.page_type 
ORDER BY 2 DESC

TempDB_BufferPool_Unallocated

Ok, so we’re dealing with typical data in TempDB.  Well, other than it not being allocated, of course.

So I run another query to get more details.  This time I want to look inside the pages to see if they tell a different story.

SELECT TOP 100 bd.*
FROM sys.dm_os_buffer_descriptors bd
	LEFT JOIN sys.allocation_units au ON bd.allocation_unit_id = au.allocation_unit_id
WHERE bd.database_id = 2 --TempDB
	AND bd.is_modified = 0 --Let's not play dirty, only clean pages
	AND au.allocation_unit_id IS NULL --It's not even allocated

TempDB_PageLevel_Unallocated

Then I follow that up with Paul Randal’s How to use DBCC PAGE, which comes with all the disclaimers about using an undocumented and unsupported trace flag and command.  This one isn’t horrible in my mind or Paul’s comments, but remember the undocumented and unsupported parts.

DBCC TRACEON (3604);

DBCC PAGE (2, 5, 502219	, 0)
DBCC PAGE (2, 5, 374929	, 0)
DBCC PAGE (2, 5, 69868	, 0)
DBCC PAGE (2, 5, 453687	, 0)
DBCC PAGE (2, 5, 214988	, 0)
DBCC PAGE (2, 5, 440966	, 0)

DBCC TRACEOFF (3604);

The results all looked about the same to me.

DBCC_Page_Results

There are several important parts to me.  The m_objId is a negative value I can’t find in TempDB.sys.objects, so it WAS a temporary object that no longer exists.  Across the board, these are “NOT ALLOCATED”, “NOT CHANGED”, “NOT MIN_LOGGED”, “0_PCT_FULL”, so there’s nothing there.

To me it looks like temp objects made it into memory and remained in memory after the temporary objects were dropped.  I have no idea what objects these were or how they were dropped, but I’m imagining these were temp tables automatically dropped when the session was either closed or reset.

A Recent Change (A CLUE)

I found this by noticing that PLE for several servers was lower now than it has been in the past, so I was peeking in the buffer pool to see who was playing nice.  Going off of “when did PLE start to be lower” I noticed that I implemented a change around that time to use a common best practice.

That change was presizing TempDB data files to take up a vast majority of the dedicated LUN instead of letting them grow as needed.  It avoids waiting for file growth, especially if you’re using TDE (I’m not) and can’t use IFI (I can), but for several other reasons as well, including file fragmentation and the slight pause even IFI causes.  So at the start of all these festivities, I took the 4 TempDB data files from 100 MB each to 12 GB each, using up 48 GB of the 50 GB available.

A Workaround

Seeing this, I wanted to partially roll back the change the next opportunity I had.  100 MB was too small and I was aware that it invoked file growths every month (we reboot monthly for OS updates).  48 GB wasn’t right though, we just have that much space on the drive due to server build standards and paranoia (I’m a DBA).  So I went through our Idera Diagnostic Manager monitoring software and found the most space TempDB used, which is captured once an hour.  I found that 4.8 GB was the peak usage with several incidents of usage going over 4.5 GB.

With that information available and still not wanting an autogrowth for all the reasons listed above, I decided that all 4 files should be 1.5 GB, so 6 GB total.  That means peak usage was about 75% full, leaving plenty of room for error, especially with my baseline only being captured once an hour.  Autogrowth is set to 256 MB, so it’d add 1 GB total each growth.  I can live with that.

I can’t say it eliminated the issue because I still have 2 GB of unallocated TempDB space in cache, but it’s better than 8 GB.  It can be considered more acceptable than other issues I need to tackle right now, but it still bugs me.

What’s the Best Practice?

It’s a best practice to have TempDB data files on their own LUN, drive, array, however you want to word it.  Then it just make sense to have the total size of your data files add up to 90% or more of the drive size.  I see this advice everywhere, with these two standing out:

  • Solar Winds – Configuration Best Practices for SQL Server Tempdb–Initial Sizing
    • “Next, if you can give tempdb its own disk, then configure it to almost fill the drive. If nothing else will ever be on the drive, then you’re better off setting it to be larger than you’ll ever need. There’s no performance penalty, and you’ll never have to worry about autogrow again.”
  • Brent Ozar – SQL Server 2005/2008/2012/2014 Setup Checklist
    • “Notice that I don’t have filegrowth enabled.  You want to proactively create the TempDB files at their full sizes to avoid drive fragmentation.”

Jonathan Kehayias does it a little bit differently in his post SQL Server Installation Checklist saying to add space to TempDB files in 4 GB increments.  Although he doesn’t fill the drive by default, this isn’t mentioned by him, either.

Now I need to be perfectly clear on this, I trust these three sources.  I trust Jonathan and Brent more than I trust myself with setting up SQL Server.  I also feel the same about the authors I know on the Solar Winds post.  This does not change that.

Sizing TempDB like that often means it’s much larger than you need.  The workaround I’m using is to right-size these files instead.  For me, for now, I’m going to stick with seeing how large TempDB gets and make it slightly larger than that until I have a solid answer to my problem.

What Was It?

I still don’t know.  The workaround managed to knock it off of my priority list enough where I’m not actively working on it.  However, my drive to understand SQL Server better won’t leave me alone.

This post is my solution.  I have some very intelligent people reading this who I hope will at least lead me further down the rabbit hole, even if they don’t have a conclusive answer.  There’s a good chance I’ll be asking for help on Twitter with #sqlhelp or opening a connect item on this, for which I have a very well documented description of the issue that I can link to.

Updates:

2016-01-06 – Caching of Temporary Objects

Due to a comment, I started looking into the caching of temporary objects to see if this was the root cause.  The comment specifically mentioned Paul White’s (b|t) post Temporary Object Caching Explained, and I also read over Itzik Ben-Gan’s (b|t) post Caching Temporary Objects.

Both of these left me with the impression that smaller amounts of data would be left in the cache linked to temporary objects linked to the proc cache.  What I’m seeing is large amounts of data in the buffer pool that did not drop when I ran DBCC FREEPROCCACHE (on a test server that wasn’t in active use) as I expected if this was the full explanation.

While it’s very likely this is related to the issue on hand, I’m not ready to accept it as a full explanation.  If the memory associated with TempDB dropped when clearing the proc cache (on a test server) then it would have been a great explanation with a poor side effect of going too far with the memory being used.

2016-01-07 – Opened a Connect Item

I mentioned this issue on the comments of Paul White’s blog post mentioned in the last update and comments below on this post.  His response concluded with this:

So, when memory pressure is detected, I would expect memory use like this to be freed up for reuse in pretty short order, by design. If it is not, and bad things happen because memory for unallocated tempdb is not released/reused, that would be a bug.

While I was already leaning that way, it pushed me over the edge to decided it was time to open up a connect item on this issue.  I feel it’s well worth the read going to Paul’s post and the connect item.  Also, if you’re seeing this as well, an upvote on connect is very appreciated.

https://connect.microsoft.com/SQLServer/feedback/details/2215297

Query the Buffer Pool

DBAs are known for asking for more memory, but often can’t say what’s in memory.  While I agree that many database servers can use more memory, I’m a firm believer in knowing how you’re using your resources before asking for more.  The script below allows me to do just that.

What It Returns

This will return every index that is using at least 1 MB of memory for every database on your server.  It also returns space in memory that is associated with unallocated space in the tables which shows up as NULL for everything except the size of the space and the table name.

I’ll warn you now that the unallocated space can be surprisingly high for TempDB, and I talk about that in TempDB Memory Leak?.  Hopefully we can get a good comment thread going on that post to talk through what we’re seeing and how common the issue really is.

The Script

IF OBJECT_ID('TempDB..#BufferSummary') IS NOT NULL BEGIN
	DROP TABLE #BufferSummary
END

IF OBJECT_ID('TempDB..#BufferPool') IS NOT NULL BEGIN
	DROP TABLE #BufferPool
END

CREATE TABLE #BufferPool
(
	Cached_MB Int
	, Database_Name SysName
	, Schema_Name SysName NULL
	, Object_Name SysName NULL
	, Index_ID Int NULL
	, Index_Name SysName NULL
	, Used_MB Int NULL
	, Used_InRow_MB Int NULL
	, Row_Count BigInt NULL
)

SELECT Pages = COUNT(1)
	, allocation_unit_id
	, database_id
INTO #BufferSummary
FROM sys.dm_os_buffer_descriptors 
GROUP BY allocation_unit_id, database_id 
	
DECLARE @DateAdded SmallDateTime  
SELECT @DateAdded = GETDATE()  
  
DECLARE @SQL NVarChar(4000)  
SELECT @SQL = ' USE [?]  
INSERT INTO #BufferPool (
	Cached_MB 
	, Database_Name 
	, Schema_Name 
	, Object_Name 
	, Index_ID 
	, Index_Name 
	, Used_MB 
	, Used_InRow_MB 
	, Row_Count 
	)  
SELECT sum(bd.Pages)/128 
	, DB_Name(bd.database_id)
	, Schema_Name(o.schema_id)
	, o.name
	, p.index_id 
	, ix.Name
	, i.Used_MB
	, i.Used_InRow_MB
	, i.Row_Count     
FROM #BufferSummary AS bd 
	LEFT JOIN sys.allocation_units au ON bd.allocation_unit_id = au.allocation_unit_id
	LEFT JOIN sys.partitions p ON (au.container_id = p.hobt_id AND au.type in (1,3)) OR (au.container_id = p.partition_id and au.type = 2)
	LEFT JOIN (
		SELECT PS.object_id
			, PS.index_id 
			, Used_MB = SUM(PS.used_page_count) / 128 
			, Used_InRow_MB = SUM(PS.in_row_used_page_count) / 128
			, Used_LOB_MB = SUM(PS.lob_used_page_count) / 128
			, Reserved_MB = SUM(PS.reserved_page_count) / 128
			, Row_Count = SUM(row_count)
		FROM sys.dm_db_partition_stats PS
		GROUP BY PS.object_id
			, PS.index_id
	) i ON p.object_id = i.object_id AND p.index_id = i.index_id
	LEFT JOIN sys.indexes ix ON i.object_id = ix.object_id AND i.index_id = ix.index_id
	LEFT JOIN sys.objects o ON p.object_id = o.object_id
WHERE database_id = db_id()  
GROUP BY bd.database_id   
	, o.schema_id
	, o.name
	, p.index_id
	, ix.Name
	, i.Used_MB
	, i.Used_InRow_MB
	, i.Row_Count     
HAVING SUM(bd.pages) > 128  
ORDER BY 1 DESC;'  

EXEC sp_MSforeachdb @SQL

SELECT Cached_MB 
	, Pct_of_Cache = CAST(Cached_MB * 100.0 / SUM(Cached_MB) OVER () as Dec(20,3))
	, Pct_Index_in_Cache = CAST(Cached_MB * 100.0 / CASE Used_MB WHEN 0 THEN 0.001 ELSE Used_MB END as DEC(20,3))
	, Database_Name 
	, Schema_Name 
	, Object_Name 
	, Index_ID 
	, Index_Name 
	, Used_MB 
	, Used_InRow_MB 
	, Row_Count 
FROM #BufferPool 
ORDER BY Cached_MB DESC

Where’d the Script Come From

I’ve had a script similar to this one around for a while.  It’s originally based off of Jonathan Kehayias’s script on his post Finding What Queries in the Plan Cache Use a Specific Index, and I couldn’t have done this without having his script to start with.

Then I originally posted a version of this script on my post Cleaning Up the Buffer Pool to Increase PLE, which was great to see the index usage for a single database.  It runs slower than this, only returns a single database, and does not show unallocated space in memory.  Those changes warranted either an update to that post or a completely new post…I opted for the latter.

What It Means

Now you can see what’s in your memory. Hopefully you’ll see one or two things that stand out on here that don’t make sense; those are your easy tuning opportunities.

If an index is 100% in cache then you’re scanning on it, and that may be an issue.  Yes, you can find when you did scans on indexes using the scripts in my Indexes – Unused and Duplicates post, but it helps to have another view of what that means in your memory.

One thing the index monitoring scripts in the post I just mentioned can’t do is tell you when you’re doing large seeks as opposed to small seeks.  With the typical phone book example, you could ask for all the names in the phone book where the last names begins with anything from A to Y, giving you 98% of the phone book as results.  Index usage stats will show you did a seek, which sounds efficient.  The script on this post will show that you have 98% of your index in cache immediately after running the query, and that gives you the opportunity to find the issue.

When you see an index that looks out of place here, dive back into the scripts on Cleaning Up the Buffer Pool to Increase PLE to see what’s in cache using that index.  If the query isn’t in cache for any reason, you may be able to look at the last time the index had a scan or seek against it in sys.dm_db_index_usage_stats and compare that to results from an Extended Events session you had running to see what it could have been.

The main point is that you have something to get you started.  You have specific indexes that are in memory, and you can hunt down when and why those indexes are being used that way.  It’s not always going to be easy, but you have a start.

We’re All On a Budget

It’s not too uncommon for this process to end in asking for more memory, and I view memory like being on a budget.  The amount of memory you have right now is your current budget.  Asking for more memory should be viewed like asking for more money in a financial budget.  For a financial budget increase, here are the questions I’d be prepared to answer:

  1. What did you spend the money we already gave you on?
  2. Did you spend that money as efficiently as possible?
  3. What else do you want to spend money on?

Now you can answer these questions in database form:

  1. Here’s what I have in cache at multiple times, specifically right after PLE dropped.
  2. I went through the queries that pulled the data into cache and tuned what I could.
  3. When I checked what’s in cache multiple times, these indexes fluctuated a lot in how much was in there.  I believe adding more memory would allow them to stay in cache instead of kicking each other out to make room.

Be it Virtual or Physical environments, there’s only so much memory that can be made available to us.  We’re on budgets of how much memory the host has, how many memory slots a server has, and how large the memory chips are that those slots can handle.  Prove you’re doing it right and it’s a lot harder to say no to you.

I have an odd habit of getting the memory I ask for because I answer these questions up front in the initial request for memory.

Thank You for More Than You Realize

Blogging and presenting have changed me for the better, and in ways I didn’t see coming.  However, I know I wouldn’t have had them if it wasn’t for three things.  Getting it all started, having help getting established, and having an audience for my work were supported by several groups of people, with a couple people really standing out.

Getting it started

This all started because Tom Zimmerman noticed I had an answer for everything, even if I didn’t know it right away.  I learned which bloggers and forums to trust, and would look up any issue quickly and easily.  However, he pointed out that I was a “leech”, always taking and never contributing.

Tom was right, so I eventually started what I thought was an altruistic path to find ways to contribute.  I settled on blogging even though I thought it would be complex to set it up and maintain.  Although it turned out it was as easy as writing an email, I wouldn’t have gotten past the hurdle of starting if it wasn’t for Tom.

Thank you, Tom.

Establishing the blog

So I created my blog, decided that wording things towards more junior level DBAs was the best thing for me and an area that could use more content, and was on my way.  However, I didn’t know how to promote myself, get people to read what I wrote, or really much of anything.  If only a couple people were reading it then I would have given up, quickly.

To get established, as I blogged in the past, I was very fortunate to come across Brent Ozar’s blogs on bogging and presenting.  To me it’s even more significant than his work blogging SQL Server.  Some of the stuff he talked about:

  • Having your blog reach established audiences.
  • Improve search engine results with SEO.
  • Writing consistently to establish your own audience.
  • Picking an audience and talking to a person (2010 version of me).

People were reading my syndicated work from the start, and some of my posts were creeping their way up in the search engines to get readers there, too.  My words were being heard, and I needed that to keep going.

That part worked but he was also talking about presenting.  How to do it, how to get better at it, where to do it, and how it changes you.  Well, if one part worked for me, lets try the next part.  That, too, worked out rather well.

Thank you, Brent.

Having an audience

So I mentioned syndicating my blog and doing presentations, but those both required an established audience where unknown people were accepted and encouraged to participate.  For that I found a couple places.

SQL Server Central and Toad World (formerly SQLServerpedia.com) were the two places to syndicate that I found very early on, and to this day the only two I use.

SQL Server Central is the more important of the two for me for several reasons.  I used and abused this site in the past to the point I was called a leech, so it was an honor to be there.  Then they do daily and weekly emails highlighting blogs and articles with both of them selecting a couple blog posts syndicated to SQL Server Central.  To this day it makes my day seeing my work on there.  The vote of confidence and boost in readers is a real boost to my morale and dedication to write more, better content.

You can’t discount Toad World which is slow to get reads, but gets many of them over time.  It’s a lot like the SEO work for search engines where you need to do good work and be patient.  I didn’t start with patience (still not doing too well there, either), but I’m very glad I have this going as well.

Then there were presentations, which I jumped right into doing at SQL Saturdays without ever presenting to a user group first.  I didn’t do too well the first time or two, but I’m also an introvert to the point that my oldest brother told me he would have bet money against me publically speaking.  However, the audience was there, the organizers encouraged new people to participate, and I was able to refine myself to the point of standing toe-to-toe with names I’ve known for years.  Better yet, now they’re not names to me, they’re people I’ve met and continue to know better.

Yeah, you know that these things started somewhere by a couple people, but I never put much thought into that.  Not until I read a post thanking Steve Jones.  SQL Server Central and SQL Saturday, the two places I thank the most for hooking me up with an audience, were both started by the same person.  Sure, he had help, such as Andy Warren and Brian Knight for SQL Saturday, but he was key in starting two organizations that are still key to me.

Thank you, Steve.

Skipping people

Oh the people I would name if I thought I could name them all.  Coworkers, mentors, SQL Saturday volunteers, bloggers, presenters, readers, attendees, and more should all be mentioned.  Even if I was overconfident enough to think I could name them all, the readers would move along like people in a theater during the credits.  To keep it short and simple….

Thank you.

Where am I now?

I am SOOO glad that I had all of this established.  This past year has been the worst.  I wish that last sentence needed a filter or something saying it was the worst for certain areas, but it was pretty all-inclusive.  Work, home, life in general beat me down to nothing and I needed to be picked back up.  I didn’t blog for 17 months, skipped submitting to some SQL Saturdays I love, and in general completely fell apart.

When I started to get it back together, I had some things established that helped more professionally than I could have ever hoped for.  My blog was still pulling in views, especially with being in the top 5 search results for a common search term on Google.  I didn’t really have an established audience for my blog because I was inconsistent before I took 17 months off, but it was syndicated to an audience happy to see me return.  SQL Saturdays were coming up waiting for an abstract.

My confidence was just waiting for me to say that I’m back.  So, confidence, I’m back.

Now I’m ending the year with a blog that has 125,000 lifetime views not counting syndication that takes it closer to a total of 200,000.  I’m also pulling together a presentation for SQL Saturday Cleveland that I’ll blog about in January, and I’ll pin their speaker evaluation forms on my cube wall, too.

I have goals of blogging better to make a bigger difference for the readers, blogging more consistently to earn followers, and do more with presentations to reach more people with better content.  The details I have for these goals that I didn’t list here aren’t easy, but it’s not easy to describe what these accomplishments do to a person, either.

I understand what it really means to me, especially when I’m down.  Hopefully, I’ll never need it like that again, but I’m so glad all of this was there when I did.

Thank you.

Advent of Code – Day 4

Day 4 of the Advent of Code has us mining for data in MD5 hashes, which is out of the comfort zone for a SQL DBA to say the least.

Some would say it’s a poor use for SQL Server, and I’d have to agree.  However, sometimes things just need to get done, so I’m playing along with a single tool approach.

Now SQL Server isn’t supposed to do well with row-by-row processing, right?  Andy Warren was posing the same question in his post pondering possible solutions to this exact problem. Well…the problem is that row-by-row processing is typically referencing a table every row.  Earlier today I mentioned that I didn’t have a good example of set based being slower, but now I found a way.  Sure, some could say that it’s poor examples on poor ways to use SQL Server, but come on…we can all find production code that meets that criteria.

So, typical DBA fashion, I made it set-based evaluating values in batches of up to 50,000.  It worked.

Here’s my code for Part 2.  It’s so close to Part 1 that you’ll have trouble telling the difference.

IF OBJECT_ID('TempDB..#Numbers') IS NOT NULL BEGIN
	DROP TABLE #Numbers 
END

CREATE TABLE #Numbers
(
	ID INT IDENTITY(1,1) PRIMARY KEY
	, b BIT
)

DECLARE @StartString varchar(100) = 'iwrupvqb'
DECLARE @RowCount Int 

INSERT INTO #Numbers (b)
SELECT 0

SELECT @RowCount = @@RowCount 

WHILE 0 = (SELECT COUNT(1) 
			FROM (SELECT TOP (@RowCount) ID 
					FROM #Numbers 
					ORDER BY ID DESC
				) X 
			WHERE CONVERT(VarChar(10), HASHBYTES('MD5', @StartString + CAST(ID as VarChar(1000))), 2) LIKE '000000%') 
	BEGIN

	INSERT INTO #Numbers (b)  
	SELECT TOP (50000) b
	FROM #Numbers
	
	SELECT @RowCount = @@RowCount 
END

SELECT ID = MIN(ID)
	, MD5_Hash = CONVERT(VarChar(1000), HASHBYTES('MD5', @StartString + CAST(min(ID) as VarChar(1000))), 2) 
FROM (SELECT TOP (@RowCount) ID 
		FROM #Numbers 
		ORDER BY ID DESC
	) X 
WHERE CONVERT(VarChar(10), HASHBYTES('MD5', @StartString + CAST(ID as VarChar(1000))), 2) LIKE '000000%'

Then I realized I didn’t even need a table and just ran the code. As you can see, the code was much easier to write, and it also worked. More importantly, it worked in 36 seconds as opposed to 113 seconds in the set-based approach for part 2. Part 1, in case you’re wondering was 1 second compared to 3 seconds, so the non-set based was still about 3 times faster.

DECLARE @i Int = 0
	, @StartString varchar(100) = 'iwrupvqb'

WHILE CONVERT(VarChar(10), HASHBYTES('MD5', @StartString + CAST(@i as VarChar(1000))), 2) NOT LIKE '000000%' BEGIN
	SET @i = @i + 1
END

SELECT @i 

I’m saving the link to my own post to give to others as an example of when set-based is worse. Hint: it’s ONLY because I wasn’t referencing a table in the loop, and I could probably still find a way to make it more efficient in a set.  It’s still true that row-by-row would be slower, I’m cheating by not using rows.

Advent of Code – Day 3

Continuing with the language-independent code challenge on Advent of Code – Day 3, I’m tasked with figuring out how well a drunken elf can guide Santa to houses.  Since his instructions had Santa stumbling into the same house multiple times they then added in a robot to help Santa using the same instruction set, which went on to prove automation can be harmful if not done properly.

Wow, I bet you never thought you’d read that paragraph…

The elf gave Santa instructions on how to get around a grid of houses using ^, v, <, and > to point the way.  When that failed, they tried again with a robot assistant using the same instructions with Santa taking the odd numbered instructions and his automated assistant taking the even numbered instructions.

My code did not have to change much to partially automate Santa’s job, which is actually kinda disturbing.  However, since it didn’t work out so well I went back and added a parameter for them to easily roll back the change.  Now it’s easier to go back to happy and jolly instead of cold and…terminator.

SET NOCOUNT ON

IF OBJECT_ID('TempDB..#visited') IS NOT NULL BEGIN
	DROP TABLE #visited
END

DECLARE @input varchar(max) = '>^^v^<>v<<<v<v^>>v^^^<v<>^^><^<<^vv>>>^<<^>><vv<<v^<^^><>>><>v<><>^^<^^^<><>>vv>vv>v<<^>v<>^>v<v^<>v>><>^v<<<<v^vv^><v>v^>>>vv>v^^^<^^<>>v<^^v<>^<vv^^<^><<>^>><^<>>><><vv><>v<<<><><>v><<>^^^^v>>^>^<v<<vv^^<v<^<^>^^v^^^^^v<><^v><<><^v^>v<<>^<>^^v^<>v<v^>v>^^<vv^v><^<>^v<><^><v^><><><<<<>^vv^>^vvvvv><><^<vv^v^v>v<<^<^^v^<>^<vv><v<v^v<<v<<^^>>^^^v^>v<><^vv<<^<>v<v><><v^^><v<>^^>^^>v^>^<<<<v><v<<>v><^v>^>><v^^<^>v<vvvv<>>>>>^v^^>v<v<^<vv>^>^vv^>vv^^v<<^<^^<>v>vv^v>><>>>v^>^>^^v<>^<v<<>^vv>v^<<v>v<<><v>^vvv<v<vvv^v<vv<v^^^>v><<^<>><v^^>^v^>>^v<^<><v<>>v^<>>v<>>v^^^><^>>vvvv>^v<^><<>>^<>^>vv><v<<>>^^>v^^^><^<<^^v>v<^<<>v>^^vvv^v^>v^<>^^<>v^v>v>v<v^>vv>^^v<>v>>^<>><>v>v^<<vvvv<vvv><v^<^>^v<>>^><v>><>^<v>v<v>vv^>>vvv<>v>v<v^>>^>>v<<>^<>^<>>>^v<<<^<^v>vv^>><<><v^>^v^^^v<>^^vv><>><>>^>v^<v<>v<>>^<<^v>^^^<>^v^><>v<<v>vv^>vv<<>>><<^v^<>v<vv>>>^^<>^><<^>vv>>^<<v^^vv<>>><v>v><^<v<<>>>^^<>>^<^v><>vv^^^v>vvv>^><<>^^>^<<v^<v<^v<<>vvv<^<<>^>^v<vv<^>vvv>v>vv^<v^><>>^vv<^^^vv><^vv<v^<><v^vvv><<^>^^><v<<vv^>v<vv<v>^<>^v<<>v<v^v^>^>^>v<<^vvv<<<v>^^>^<<<<>vv>>^<>^>>>v<v>^^<v^<v<>>>vv>^^v<<>>>^^v><<<v<v<^v<>^^><v<^v<<v^><><^<><v<^^v>>><v^^v<<v^><^<><<v^>><^<>v>v^<><^<v>^v^>^>^vv^>^^<<vv^>vv<^vvv<>>^^<^>v^>^>^<v^><v<v>>>v<<<><^v<<><^<vv^v^^^>v<^^<v^vvv<v<><v<vv<^vv<>vv<v^<>>vvvvv<<>^v^v>vv>>>vvv^^<^<^<><>v<v>><^v><^<<<>><<<v>^>v<>^>^v>>^<>v^<^>><<>^<v>^>^^^>^^<v>>>><>^v^v><<<<vv^<vv<>vv>v<>v^<v^>v><>>>v^<><^vvv>vv^<^<<^<^^v>^>>>v<^<^v^^<^<^>>><v>vv>^<<><>^>>v>^<<>><^<>v<>vv^^>^>vvv^v<<^<^^<vv<>^vvv<^^v^vv^>>v<^>^^<v^<>v<^<^vv>v<<vv>vv>^>vvv>>>^^>v<>^v>v^<^>>v>^^v>>>>v^<v>v<^>v<v<<>>^v<^^<v><^<>>^<<vv^>>v<<v>^v<>><^>vv<v<^>>^^<vvvvvvvvv>>>v<v<>v^<>>^vv<v^^v<<^vvv^<<^><>vv<><<>>v>vv^><>>^^v^>>v^v^><<<>>^^<^v<<^<>>>>^<^>v^><<^>v<^v<^>>^^<<<<><^<^v^v<>>^v<^<<vv^<><^^vv><v^v^v>^>>^>^vv^>^v<v^v<<vvv^><>>^v^^><>v>vv><^>>vv<vvv<<<<^<>vvv^v<v>^<v<^>^<^<v<><>v^^^^<<vv<^^vv<v>><<v^><>>><v^>^v><^>^><vv^<><^<v>><<^vv<>>v^<<v<>v><v<><><vv>^>>v^<^<v>^><>>><^><v^v<>>>^^<^>v<v>vvv<>^<<><v^^>^>>v<^v>^>v>>>vv>v>>v^^^<^<vvv^<>^>^<v^<v^v>v>^>vv>vvv<>v<^>v>^^>>^<vv^^v>v^^^^^v^vv><^<><>^>vv<^>>^vvvv^^^>^<vv>^v<<^><^^>^<>^^>^<<v<^>>>^><<^^>v^v>>^>vvvv>^^v><v>>vv><<<vv<^>v>^^^<v>v^vvv<^><<^>^<>^><<<<<v^<<vv^v>^<>v<v>^>^>><>v^v<^vv^^>vv<<v^v>vv^vvv<<<<>^v<v^^v^v>v<<v>^^<>^vv^^>^>^v^vv^>>v^vv^^<vv><<v^v^^v><vv<^vvv<vv^^<<v>v^v^^^^v<^<^>v>^>v>^vv^v^^<v<^vvvv<<<>^<^^^<^^<>^<><vv<^^<<^>>><v^vvvv>^<>>^^>v^^v^<<v^^^<<<><^<v^v^^v<v^<>v><<v<>^v>v<^><^>vv^^<vvv<^v>>v>^<><v^><^^^<v^>>vv<<<<<^<>^v^v>^vv^<>v>v<^>vv<<^vv>vv<v<><>>v>><v<^<^^>><<v^v<<^><v<^<vv<v<<vv^>^<<><^^>^<^>>^<vv>><v<<vvv<^^v^>^^<^v>^v<v<>v><v^v^<<^<><<v<<^v>v<<>>^>v>>v>>v<^<<^<^>>>v>^^^v><^>^^>>v<<>^v><v>vvv^vv<<<>vvv<<>^>>>v<v<v^<^<^>^<^>v^^v<^^<v<>v<>>^^>^v^>v<<<<^<>v^><<<v>>>><<v^<^vv>v>><>>^<<<^<^^>v<>>v<>vv<<^<<><<^>v^^^vv^>vvvv>>v>v^><<v<>vv^<<><<vvv>^>>>^<<<^<^<<v>^>v<>>v>>vv^^><<<<^^^v>><<^><v><v^^><v<<v^^v^^v>>v<><><<>^><v><^<vv>><^v<>v<vvv<>^>><v>>v<^><<v>^<>^v><^><^^<v>^><^^v^<<><>>^>v^<^v^vv<><^>vv^>v^vvv^<>>^><^<^<>^<<v^v<^v><>^v<v>>^>>^v^vv>><vv><v^^<<^v^<>^v<<>^><^>><v>>v<<<v^^vv<>^^v>>><><><<v^<<<v^<^^><v^>v^^vv<v^<>>vv^<^v<>^v>>v^v>v<^^vv><>^v<<>v^<>v^>>v>vvv<^><><^^>^vv^>>v^>^<^^<><>><<>^^^><^v^v><<<><<^v^vv>v>><^>>><v^>v<v><><v^v<>v^^>>v<<>v>v<v<v<^^<><>v^^<>>v<^v<v>v<><v<v>^<<>v>vv^^<>>^^^<>^^>^v>v>>>^v^v><v^^<><v>^^v^v<^<^^><<v<^<^<>^<>><<>^>>^>^^><v><>v<><>><<<>>>>vv>>>^>>^v<^>v^^^v<<vv>><<<^<<<>>>>>^>vv<^v^<>^<v^>^v><v>vvv<>>>^v^^^v<<<<>>^^<vv<^<^^>^<>v<^<<<>><>>v<^<>^<vvv<^<>><><<v>^^^>^^<<v<v^>^^v^>><<^vv><v>^v>>^<v>v>^^>^v>^vvv<>v^v^^<><vv>vv^>>><>v<^><v<v^<><<<>^v>^v<<<^>^>^>v^v<<><vvv<<v^^<><v>^>>><vv>><v>>v^<vv>>vv<<^v^v<<><^v<vv>>>vv<>>>>^vv>v^<>vv>v^v<v^><v<^^^^^>vv<><<vvv^<v><^<vv><^^^vv^<>^^^^<^><^<>v^<v^v<<^v<<^^<>>^<v^^>>>vv<vvv<>v<<>><^vvv^<<^^<<>>>^<>>>v^^><>><<>><v^v>>>>>><>>><v^<<vvv^>v<>>v^<>vv<><^^^^v^<<^<v^vv><<^^>v<^vvv^v>>v>^>>v>^^><<v^<>v<>vv<^v^vv><v><<vv^v>>v^>>v<^^^>^><<v<>^><>v>>>vvv<v<vv<^>>^v<v>^<^^^^^v><>v><>v^v^v<v^vv^v>vvvv<>vv<<<vv<v<<>^<^>^^v^<<>^<v><^><v<v<><<>v^<<^<><vv>v<<^v>>^v<><v>^>>^^><>v^<^<vvv^>^>^<<<<>vv>^v^v<^^^<vv>><>^^<<v<^<^^>>>v^v<<^^^<v<v<^<>^v<v><v^vv^^v^^v^^<vv<>^<><vv^<^v^<<^><<vvv>^^<^^^<^v>^>^vv><<<^v<v>vv>v<>v^v<v^>v^>>>v^v<>^v<<>^vv>v>v>v^<^>v^^<^>^^^^vv>^^><^>vv^>>^^v>><<<<^><>v<>^<v<vv^>^^><<^><v>v^>^^<^>>><>><v^v<v^<v<vv^v^<<^<vvv>>><vv<^^>>^>^><<v^<>>v>v^v^^><<>vv^v>v^<v><^<>^^<^>v>^<><<<v>^<^<^>^>^>^^v^<<^^v^^<^<>><^>v>>^^<>^^^<<<<v^>^v<^vv>^<<<v<><<v<>vv>>>v><>>><>>v<<<vv><>^v>v<^>><^><><v<>^v^>^v>^v<<><<^<>>v>^><>^>><>><^<v^><v^^<><v><^^>^v^^<>v^<v^<^v<v^^^^^v^<<^>^^^<^v><>^^<<<><<<<<^^>v^vvvv>v<>>vv<^>^v^>v<^vv^v<<><<v>v^v>^^><><^<v^>v><vv><>>><<>^vv<>v>>v<^v>>>v<v>v>v>^vv<<>^^vv<v<^v^<v<v>vv<>^<^<vv<v^<^v^^><<>^>><^v>vv^^v<<^^><<>v^^<><><v^^<v^v>^>^>^>v<^<v>^v^^>v<>vvv<^v<v^v><<v^><<^^><^<<v^v^>v<>^>v><><v>^<v<v>^<^^^>^v<<><<><>vv>v^<>v^><v^v<v><><<v>v<vv><<v>>v>^<<<>vv>>vvv>^^vv^v^^<^^<>v^^<>v>>^^>^>^>v>><^>><>>^<<>><^>v<<<<<<<^v^v<v^<v^^>^<><<v<^>v^>v^vv<<^^vv^>>>>^<>v<^v<>v<vv<^>>v^vv>vv><vv<<^>v>><vv>>>vv^<<<<vv^>v<<<<^^>^^v^><<^<v^>v^>^^<v<>vvv^>^<>vvv<v<^^>v^<<v>><>v<v<>^^<vvv>^>vv><><<<^^vv<v^<v<>v<>><<v><^vv^>^<^>^^^<<<v>vv^<^<<>^>^<vv>v><v<<^><^>^^<vv^v^^>>>>vv^><^^vv><>^<v^v>v<vv>v><<<v>v<v>^><v^^><v>v<^v^>>^^<v^>^^>vv>>vv^><^vv^vv<<^>vv>^v<v><vv><v<vvvvv>^^v^v><v>>>^vv<>v>^^^^<^>><>^v^^^>v<^^<<^^v<vv<>vvv<^>><><^>>^><^<>v<v<<><<v><v^v<>><^>v><<v^<v>v<^<vv^v^v^>vvv^^>v>^<vv^>v^v^<>v>^>>vv>><^^<v<<>^vv<><><<^v<v>v<<vv><>><^v<v>>v^>vvv^v^<<^><v<>^vv^>v^<v<^>>v<v><v><v>>^<<<v^<><<>v>^>^^<v<>>^<>^>^><<<^<<^<<^>^v>>><vvv>><<<<v>>>>>>>^<^v<^>v<>vv<><>v>>^>>^>vv^^><<^<v<v>>^^<<^>v<^>>vv>^<>v><^>v<vv>>>>>>^v<^<<<v^><vv<<>>vv<<><v<><<<v<^<v<>>v<^^^^v^^<^^^<^<vv><<^>><>v<<>v<v<>>>><>v^vv>^>^>>vv^v<v<<><^v>vv^><v<<>v^v<^>vv<<^^v><^>>^^vv<^<>>v^^>><v>^v>>>^>>v>v<>v<^vv><>^<<^>vv>>><><>v^><>v^>v>v><^v<><v<v>^v<<^vv^><^^>><^^^<<<^>v>^v>>><^>><^>>>^^^<^>vv<><<<v^>^<^^>>^^^v^v^v>v<v>>>><^>>>v>^vv<<^^^<^^vv>v<<><v<<^^>v>><<v^^><^>^<^>^v^>v><^<^vv>v>><>^<<vv<<v>v<vv<v>^>^>><^^<v>^v^v<><<>vvv<^<v>^><>^>vvv>>>^><<>><v^^<^<<^v>>^v<v<vv>vv^v^>v<<vvv<^^v^v>^<^>>^>v<^>^v<<><<<^>^<^^^>vv<^^^^vv<v<^^v<<<<v<^v^<><v<<^><<>vv>>><^<^<>>>^>^>>^<<<<<^^v>^>^<>vvv^^<^><^>^^v>^vv^><v^<^<<v^<vvv<<^v<><^><^>>>v>^v>^>^v<vv^v>><v><^><v^^>v^>^<><<><>v<v^>vvv^>^>>v<>^><^>^><vvv>^^v^v>v<>^v^><^>>v>v^><<<^>>^<>^<>>v><>>v^>^>^^<>>v^>^<vvvv<^vvvv^>>vv^<v^v>^vv<>v<>^<v<v>v>^^><^>vv^<^v^<<^<^<><vv<^v<^v><>>>^v^<<^><^>vv<v>v<^>vv^>v<<<>^<><v<^^^>v><^^<>^<^<v^vv^<<^>><<v^v<^vvv<<<>>vvvv^v^^^>v<>>><<>vvv<<^^^>v>v>>v<<v<v^v^>^^v>^><^<><<v^<v<v^^^><>v^^^<v>vv<>^>^^vv>^<<^v<^v><v>>>^>>><^<<>^v>>^>vv<<<v<>^<v><v^<^<>v>v^^v^>><<^v<<<<>v>v>v^^<^><>^^<<<v>vv<>>>^>>v<><v^>^<><vv>v>v^v<v^<^>>^>><<^^<^^v<vv<>><<<v<^<<^^^>vvv^<vvv<^>vv><>><<<^<v^v^^<<^vvv^^<^<><<>^<^<>>vvv<>^<>v^v<><>>v^v><<>>>vvv>v<>^>>^><^>vv<<>>v<<^><>v>>^^<v>^>^<<>><^<<vv<^<vv^vv><>>>><^<v>^>vv<v><>^<>vvvvv^vv<<v<>>>^<<><>^^vvv>>>vv<<^^><^v^^v<>^^>^><^>v^^^^v<^<<vv<vv<>vv^^>v^vv>v><>>vv>^<^<v^v^>>v^v^^v>^>vv^>v<vvvv<^v<^v>^v>^^v<<^>^^<<>^><^v>>>vv^>^^>vvvv>>v<^<v>^>>>v^<><^<^^<v>vv^^><v>v^<>^^^>>><^^v>v>^<<>^<v^>vvv^>^^^><v<^>>v<v>>^v><<><<>v<^<<>^><>^>vv>^<v>^^v<<^v^vvv^^>^vv^<^>^>^^v>v^>^<<><<^>v>>vv^vv><v>>^<<^<v^^<^<v^^vv^><^^<^^><v^^>v^^^<^<>^<>>^v<^vvv^^v^<><^>>>>>v><><<<>vv<^v>><<>vvv<><<vv<<<^>v^^>>^>^v>><><^^v<>><>>v^>^<vv><<<>><><<v>^^<>>v<><^<vv>vv<^v>^<<<<v<^<<^^>>^<><^>><<>^>v>^^^v>>^<^^v><v^v>^><<><>>^>>^<<v<>^v<>^>^<v>>vv>^vvv<<v<<^>^>^<<^^<>^^^^vvv<>^vv<vvvvv^^>^^<^>>><>v^<><^<<^>v^^v<>>^vv<>v^^<>>v^vvvvv<<v^<v^^>>><vvvvv>><^>vv>v^v^<v<^>^^><^>^^^^v<><^v<<>v^>v>>vv<<>^<v^^>vvv>^^<v^<>vv^><>><v^^v<>^>>^>v><>>^^v>^>^>>>^>v<^v>v>^<^^^^^>>v<v<>>v<<^>^<v<<>^^>><<^><>v<>^^^vv<>^^>><<^^>v>vv>vv>v^>^v>v^^<>>><<v><v<<>>v><>vvv^^v>^^>^vvvv^>^<>^vvvv><v><v<>>><>^<^vv<>^v<^v<>^vvv<<>><vvv^>>^><<vv^<v^>^<v<<^^>^^<^^v^>v<>v^v><>><v^^>>^vvv><^vv>v^<^<^v>>v^^>^vvv^<v^^v^^>v<^<>>^<>>>^^<><^^vv<>^vv^<>>>>^^<<^^<>vv^^><>^^<v<<v>^<v^^>^v<><><>vvv>^v^>>vv<<^v<<>><v>^><^>>>^<^<^^>vv^<<^<>>^^><><<v>^^<v>>v<<vvvv>^v^vv>><^^<<^>>v>v<^^^<^><^^vv>^vv<^<vv<>v><^<><v><^^^>>^<><^<v>>>>v^<v>>>>>v<><^^>v<^<^>><v<>^>vv>^^v^v^<<v<><<<^v^><<^<><<<<v<^>><<<>v>>vv><vv<><<^<^<><vv>^^^^<>v<<<<v>vv<>vv^^^>><>vv^><>>^vv<<><^^vv<>v^>>^<<>^<v^<^>v<'

DECLARE @sx smallint = 0
	, @sy smallint = 0
	, @rx smallint = 0
	, @ry smallint = 0
	, @useRobot bit = 0
	, @x smallint = 0
	, @y smallint = 0
	, @i smallint = 0
	, @i_max smallint = 0
	, @char char(1)

SELECT @i_max = len(@input)

CREATE TABLE #visited
(
	x smallint
	, y smallint
	, PRIMARY KEY (x,y)
)

WHILE @i <= @i_max BEGIN
	SELECT @char = substring(@input, @i, 1)

	IF @i % 2 = 0 AND @useRobot = 1 BEGIN
		SELECT @x = @sx
			, @y = @sy
	END ELSE BEGIN
		SELECT @x = @rx 
			, @y = @ry
	END

	IF @char = '>' BEGIN
		SELECT @x = @x + 1
	END ELSE IF @char = '<' BEGIN
		SELECT @x = @x - 1
	END ELSE IF @char = '^' BEGIN
		SELECT @y = @y + 1
	END ELSE IF @char = 'V' BEGIN
		SELECT @y = @y - 1 
	END

	IF 0 = (SELECT COUNT(1) FROM #visited WHERE x = @x AND y = @y) BEGIN
		INSERT INTO #visited(x, y) SELECT @x, @y 
	END

	IF @i % 2 = 0 AND @useRobot = 1 BEGIN
		SELECT @sx = @x
			, @sy = @y
	END ELSE BEGIN
		SELECT @rx = @x 
			, @ry = @y
	END

	SELECT @i = @i + 1
END

SELECT COUNT(1)
FROM #visited 

I’m not the only one doing this with T-SQL.  Andy Warren also posted his very different, yet same results, version in separate posts for the two parts: Day 3 Part 1 and Day 3 Part 2.  His came out more complex, but more set-based.

My problem is that I develop T-SQL for the problem at hand with the expected size of the data.  Andy is looking ahead and developed a method that will probably scale much better than mine.  I’m hedging my bets and hoping the elf in charge of this either sobers up or gets fired.

Advent of Code

If you want your skills to be sharp, you practice.  If you want to get yourself to actually do practice, you call it a “challenge”.  This is what the Advent of Code is.

Now let me be perfectly clear about how I feel about practice.  I saw Jeremiah’s (b|tpost on this and ignored it.  Then I saw Steve Jones’s (b|tpost and ignored it.  Then I saw Andy Warren’s (b|tpost on it that also contained the answer, so I took a look.

It was intriguing and refreshing to have a question like that asked.  The refreshing part was all the requirements were all there in one place.

The intriguing part was me sitting there asking myself how I’d answer that question, realizing I would have done SOOO much more work than Andy.  However, I had to read Andy’s answer 3 times for me to realize I wasn’t able to read it, which increase my level of intrigue.  I rewrote it adding in some legibility for my sake and posted it on his comments.

Now there’s more to it.  I cheated.  I only rewrote someone else’s answer and didn’t really do the work.  Yeah, that’s what half of computer science is, copy/paste others work intelligently, add just enough to call it your own, then ask for a raise.  However, could I do one on my own?

Well, there’s 25 days on the Advent calendars I’m used to, and this one is no different.  So I tried it out on Day 2.

SPOILER ALERT: My answer to Day 2 is below.  Don’t cheat like I did.

The one odd thing you’ll see is that I copy/pasted the dimensions from the Advent of Code site, then I held in ALT and drug the mouse to write on more lines than one at a time.  (even I read that as I drugged the mouse…oh well)  However, that introduced an issue where there were trailing spaces and tabs I had to take care of as well, which I fixed in code as well.

IF OBJECT_ID('tempdb..#input') IS NOT NULL BEGIN
	DROP TABLE #input;
END

CREATE TABLE #input  
(
	dims varchar(100)
);

INSERT INTO #input 
SELECT '4x23x21'
UNION ALL SELECT '22x29x19 '
UNION ALL SELECT '11x4x11	 '
UNION ALL SELECT '8x10x5	 '
UNION ALL SELECT '24x18x16 '
UNION ALL SELECT '11x25x22 '
UNION ALL SELECT '2x13x20	 '
UNION ALL SELECT '24x15x14 '
UNION ALL SELECT '14x22x2	 '
UNION ALL SELECT '30x7x3	 '
UNION ALL SELECT '30x22x25 '
UNION ALL SELECT '29x9x9	 '
UNION ALL SELECT '29x29x26 '
UNION ALL SELECT '14x3x16	 '
UNION ALL SELECT '1x10x26	 '
UNION ALL SELECT '29x2x30	 '
UNION ALL SELECT '30x10x25 '
UNION ALL SELECT '10x26x20 '
UNION ALL SELECT '1x2x18	 '
UNION ALL SELECT '25x18x5	 '
UNION ALL SELECT '21x3x24	 '
UNION ALL SELECT '2x5x7	 '
UNION ALL SELECT '22x11x21 '
UNION ALL SELECT '11x8x8	 '
UNION ALL SELECT '16x18x2	 '
UNION ALL SELECT '13x3x8	 '
UNION ALL SELECT '1x16x19	 '
UNION ALL SELECT '19x16x12 '
UNION ALL SELECT '21x15x1	 '
UNION ALL SELECT '29x9x4	 '
UNION ALL SELECT '27x10x8	 '
UNION ALL SELECT '2x7x27	 '
UNION ALL SELECT '2x20x23	 '
UNION ALL SELECT '24x11x5	 '
UNION ALL SELECT '2x8x27	 '
UNION ALL SELECT '10x28x10 '
UNION ALL SELECT '24x11x10 '
UNION ALL SELECT '19x2x12	 '
UNION ALL SELECT '27x5x10	 '
UNION ALL SELECT '1x14x25	 '
UNION ALL SELECT '5x14x30	 '
UNION ALL SELECT '15x26x12 '
UNION ALL SELECT '23x20x22 '
UNION ALL SELECT '5x12x1	 '
UNION ALL SELECT '9x26x9	 '
UNION ALL SELECT '23x25x5	 '
UNION ALL SELECT '28x16x19 '
UNION ALL SELECT '17x23x17 '
UNION ALL SELECT '2x27x20	 '
UNION ALL SELECT '18x27x13 '
UNION ALL SELECT '16x7x18	 '
UNION ALL SELECT '22x7x29	 '
UNION ALL SELECT '17x28x6	 '
UNION ALL SELECT '9x22x17	 '
UNION ALL SELECT '10x5x6	 '
UNION ALL SELECT '14x2x12	 '
UNION ALL SELECT '25x5x6	 '
UNION ALL SELECT '26x9x10	 '
UNION ALL SELECT '19x21x6	 '
UNION ALL SELECT '19x4x27	 '
UNION ALL SELECT '23x16x14 '
UNION ALL SELECT '21x17x29 '
UNION ALL SELECT '24x18x10 '
UNION ALL SELECT '7x19x6	 '
UNION ALL SELECT '14x15x10 '
UNION ALL SELECT '9x10x19	 '
UNION ALL SELECT '20x18x4	 '
UNION ALL SELECT '11x14x8	 '
UNION ALL SELECT '30x15x9	 '
UNION ALL SELECT '25x12x24 '
UNION ALL SELECT '3x12x5	 '
UNION ALL SELECT '12x21x28 '
UNION ALL SELECT '8x23x10	 '
UNION ALL SELECT '18x26x8	 '
UNION ALL SELECT '17x1x8	 '
UNION ALL SELECT '2x29x15	 '
UNION ALL SELECT '3x13x28	 '
UNION ALL SELECT '23x20x11 '
UNION ALL SELECT '27x25x6	 '
UNION ALL SELECT '19x21x3	 '
UNION ALL SELECT '30x22x27 '
UNION ALL SELECT '28x24x4	 '
UNION ALL SELECT '26x18x21 '
UNION ALL SELECT '11x7x16	 '
UNION ALL SELECT '22x27x6	 '
UNION ALL SELECT '27x5x26	 '
UNION ALL SELECT '4x10x4	 '
UNION ALL SELECT '4x2x27	 '
UNION ALL SELECT '2x3x26	 '
UNION ALL SELECT '26x29x19 '
UNION ALL SELECT '30x26x24 '
UNION ALL SELECT '8x25x12	 '
UNION ALL SELECT '16x17x5	 '
UNION ALL SELECT '13x2x3	 '
UNION ALL SELECT '1x30x22	 '
UNION ALL SELECT '20x9x1	 '
UNION ALL SELECT '24x26x19 '
UNION ALL SELECT '26x18x1	 '
UNION ALL SELECT '18x29x24 '
UNION ALL SELECT '1x6x9	 '
UNION ALL SELECT '20x27x2	 '
UNION ALL SELECT '3x22x21	 '
UNION ALL SELECT '4x16x8	 '
UNION ALL SELECT '29x18x16 '
UNION ALL SELECT '7x16x23	 '
UNION ALL SELECT '13x8x14	 '
UNION ALL SELECT '19x25x10 '
UNION ALL SELECT '23x29x6	 '
UNION ALL SELECT '23x21x1	 '
UNION ALL SELECT '22x26x10 '
UNION ALL SELECT '14x4x2	 '
UNION ALL SELECT '18x29x17 '
UNION ALL SELECT '9x4x18	 '
UNION ALL SELECT '7x22x9	 '
UNION ALL SELECT '19x5x26	 '
UNION ALL SELECT '27x29x19 '
UNION ALL SELECT '7x13x14	 '
UNION ALL SELECT '19x10x1	 '
UNION ALL SELECT '6x22x3	 '
UNION ALL SELECT '12x21x5	 '
UNION ALL SELECT '24x20x12 '
UNION ALL SELECT '28x2x11	 '
UNION ALL SELECT '16x18x23 '
UNION ALL SELECT '2x13x25	 '
UNION ALL SELECT '11x7x17	 '
UNION ALL SELECT '27x21x4	 '
UNION ALL SELECT '2x10x25	 '
UNION ALL SELECT '22x16x17 '
UNION ALL SELECT '23x22x15 '
UNION ALL SELECT '17x13x13 '
UNION ALL SELECT '23x24x26 '
UNION ALL SELECT '27x18x24 '
UNION ALL SELECT '24x7x28	 '
UNION ALL SELECT '30x12x15 '
UNION ALL SELECT '14x28x19 '
UNION ALL SELECT '2x15x29	 '
UNION ALL SELECT '12x13x5	 '
UNION ALL SELECT '17x22x21 '
UNION ALL SELECT '27x10x27 '
UNION ALL SELECT '17x6x25	 '
UNION ALL SELECT '22x2x1	 '
UNION ALL SELECT '1x10x9	 '
UNION ALL SELECT '9x7x2	 '
UNION ALL SELECT '30x28x3	 '
UNION ALL SELECT '28x11x10 '
UNION ALL SELECT '8x23x15	 '
UNION ALL SELECT '23x4x20	 '
UNION ALL SELECT '12x5x4	 '
UNION ALL SELECT '13x17x14 '
UNION ALL SELECT '28x11x2	 '
UNION ALL SELECT '21x11x29 '
UNION ALL SELECT '10x23x22 '
UNION ALL SELECT '27x23x14 '
UNION ALL SELECT '7x15x23	 '
UNION ALL SELECT '20x2x13	 '
UNION ALL SELECT '8x21x4	 '
UNION ALL SELECT '10x20x11 '
UNION ALL SELECT '23x28x11 '
UNION ALL SELECT '21x22x25 '
UNION ALL SELECT '23x11x17 '
UNION ALL SELECT '2x29x10	 '
UNION ALL SELECT '28x16x5	 '
UNION ALL SELECT '30x26x10 '
UNION ALL SELECT '17x24x16 '
UNION ALL SELECT '26x27x25 '
UNION ALL SELECT '14x13x25 '
UNION ALL SELECT '22x27x5	 '
UNION ALL SELECT '24x15x12 '
UNION ALL SELECT '5x21x25	 '
UNION ALL SELECT '4x27x1	 '
UNION ALL SELECT '25x4x10	 '
UNION ALL SELECT '15x13x1	 '
UNION ALL SELECT '21x23x7	 '
UNION ALL SELECT '8x3x4	 '
UNION ALL SELECT '10x5x7	 '
UNION ALL SELECT '9x13x30	 '
UNION ALL SELECT '2x2x30	 '
UNION ALL SELECT '26x4x29	 '
UNION ALL SELECT '5x14x14	 '
UNION ALL SELECT '2x27x9	 '
UNION ALL SELECT '22x16x1	 '
UNION ALL SELECT '4x23x5	 '
UNION ALL SELECT '13x7x26	 '
UNION ALL SELECT '2x12x10	 '
UNION ALL SELECT '12x7x22	 '
UNION ALL SELECT '26x30x26 '
UNION ALL SELECT '28x16x28 '
UNION ALL SELECT '15x19x11 '
UNION ALL SELECT '4x18x1	 '
UNION ALL SELECT '20x14x24 '
UNION ALL SELECT '6x10x22	 '
UNION ALL SELECT '9x20x3	 '
UNION ALL SELECT '14x9x27	 '
UNION ALL SELECT '26x17x9	 '
UNION ALL SELECT '10x30x28 '
UNION ALL SELECT '6x3x29	 '
UNION ALL SELECT '4x16x28	 '
UNION ALL SELECT '8x24x11	 '
UNION ALL SELECT '23x10x1	 '
UNION ALL SELECT '11x7x7	 '
UNION ALL SELECT '29x6x15	 '
UNION ALL SELECT '13x25x12 '
UNION ALL SELECT '29x14x3	 '
UNION ALL SELECT '26x22x21 '
UNION ALL SELECT '8x3x11	 '
UNION ALL SELECT '27x13x25 '
UNION ALL SELECT '27x6x2	 '
UNION ALL SELECT '8x11x7	 '
UNION ALL SELECT '25x12x9	 '
UNION ALL SELECT '24x30x12 '
UNION ALL SELECT '13x1x30	 '
UNION ALL SELECT '25x23x16 '
UNION ALL SELECT '9x13x29	 '
UNION ALL SELECT '29x26x16 '
UNION ALL SELECT '11x15x9	 '
UNION ALL SELECT '11x23x6	 '
UNION ALL SELECT '15x27x28 '
UNION ALL SELECT '27x24x21 '
UNION ALL SELECT '6x24x1	 '
UNION ALL SELECT '25x25x5	 '
UNION ALL SELECT '11x1x26	 '
UNION ALL SELECT '21x4x24	 '
UNION ALL SELECT '10x5x12	 '
UNION ALL SELECT '4x30x13	 '
UNION ALL SELECT '24x22x5	 '
UNION ALL SELECT '26x7x21	 '
UNION ALL SELECT '23x3x17	 '
UNION ALL SELECT '22x18x2	 '
UNION ALL SELECT '25x1x14	 '
UNION ALL SELECT '23x25x30 '
UNION ALL SELECT '8x7x7	 '
UNION ALL SELECT '30x19x8	 '
UNION ALL SELECT '17x6x15	 '
UNION ALL SELECT '2x11x20	 '
UNION ALL SELECT '8x3x22	 '
UNION ALL SELECT '23x14x26 '
UNION ALL SELECT '8x22x25	 '
UNION ALL SELECT '27x1x2	 '
UNION ALL SELECT '10x26x2	 '
UNION ALL SELECT '28x30x7	 '
UNION ALL SELECT '5x30x7	 '
UNION ALL SELECT '27x16x30 '
UNION ALL SELECT '28x29x1	 '
UNION ALL SELECT '8x25x18	 '
UNION ALL SELECT '20x12x29 '
UNION ALL SELECT '9x19x9	 '
UNION ALL SELECT '7x25x15	 '
UNION ALL SELECT '25x18x18 '
UNION ALL SELECT '11x8x2	 '
UNION ALL SELECT '4x20x6	 '
UNION ALL SELECT '18x5x20	 '
UNION ALL SELECT '2x3x29	 '
UNION ALL SELECT '25x26x22 '
UNION ALL SELECT '18x25x26 '
UNION ALL SELECT '9x12x16	 '
UNION ALL SELECT '18x7x27	 '
UNION ALL SELECT '17x20x9	 '
UNION ALL SELECT '6x29x26	 '
UNION ALL SELECT '17x7x19	 '
UNION ALL SELECT '21x7x5	 '
UNION ALL SELECT '29x15x12 '
UNION ALL SELECT '22x4x1	 '
UNION ALL SELECT '11x12x11 '
UNION ALL SELECT '26x30x4	 '
UNION ALL SELECT '12x24x13 '
UNION ALL SELECT '13x8x3	 '
UNION ALL SELECT '26x25x3	 '
UNION ALL SELECT '21x26x10 '
UNION ALL SELECT '14x9x26	 '
UNION ALL SELECT '20x1x7	 '
UNION ALL SELECT '11x12x3	 '
UNION ALL SELECT '12x11x4	 '
UNION ALL SELECT '11x15x30 '
UNION ALL SELECT '17x6x25	 '
UNION ALL SELECT '20x22x3	 '
UNION ALL SELECT '1x16x17	 '
UNION ALL SELECT '11x5x20	 '
UNION ALL SELECT '12x12x7	 '
UNION ALL SELECT '2x14x10	 '
UNION ALL SELECT '14x27x3	 '
UNION ALL SELECT '14x16x18 '
UNION ALL SELECT '21x28x24 '
UNION ALL SELECT '14x20x1	 '
UNION ALL SELECT '29x14x1	 '
UNION ALL SELECT '10x10x9	 '
UNION ALL SELECT '25x23x4	 '
UNION ALL SELECT '17x15x14 '
UNION ALL SELECT '9x20x26	 '
UNION ALL SELECT '16x2x17	 '
UNION ALL SELECT '13x28x25 '
UNION ALL SELECT '16x1x11	 '
UNION ALL SELECT '19x16x8	 '
UNION ALL SELECT '20x21x2	 '
UNION ALL SELECT '27x9x22	 '
UNION ALL SELECT '24x18x3	 '
UNION ALL SELECT '23x30x6	 '
UNION ALL SELECT '4x18x3	 '
UNION ALL SELECT '30x15x8	 '
UNION ALL SELECT '27x20x19 '
UNION ALL SELECT '28x29x26 '
UNION ALL SELECT '2x21x18	 '
UNION ALL SELECT '1x23x30	 '
UNION ALL SELECT '1x9x12	 '
UNION ALL SELECT '4x11x30	 '
UNION ALL SELECT '1x28x4	 '
UNION ALL SELECT '17x10x10 '
UNION ALL SELECT '12x14x6	 '
UNION ALL SELECT '8x9x24	 '
UNION ALL SELECT '8x3x3	 '
UNION ALL SELECT '29x8x20	 '
UNION ALL SELECT '26x29x2	 '
UNION ALL SELECT '29x25x25 '
UNION ALL SELECT '11x17x23 '
UNION ALL SELECT '6x30x21	 '
UNION ALL SELECT '13x18x29 '
UNION ALL SELECT '2x10x8	 '
UNION ALL SELECT '29x29x27 '
UNION ALL SELECT '27x15x15 '
UNION ALL SELECT '16x17x30 '
UNION ALL SELECT '3x3x22	 '
UNION ALL SELECT '21x12x6	 '
UNION ALL SELECT '22x1x5	 '
UNION ALL SELECT '30x8x20	 '
UNION ALL SELECT '6x28x13	 '
UNION ALL SELECT '11x2x23	 '
UNION ALL SELECT '14x18x27 '
UNION ALL SELECT '6x26x13	 '
UNION ALL SELECT '10x24x24 '
UNION ALL SELECT '4x24x6	 '
UNION ALL SELECT '20x8x3	 '
UNION ALL SELECT '23x11x5	 '
UNION ALL SELECT '29x5x24	 '
UNION ALL SELECT '14x15x22 '
UNION ALL SELECT '21x17x13 '
UNION ALL SELECT '10x10x8	 '
UNION ALL SELECT '1x11x23	 '
UNION ALL SELECT '21x19x24 '
UNION ALL SELECT '19x9x13	 '
UNION ALL SELECT '21x26x28 '
UNION ALL SELECT '25x11x28 '
UNION ALL SELECT '2x17x1	 '
UNION ALL SELECT '18x9x8	 '
UNION ALL SELECT '5x21x6	 '
UNION ALL SELECT '12x5x2	 '
UNION ALL SELECT '23x8x15	 '
UNION ALL SELECT '30x16x24 '
UNION ALL SELECT '7x9x27	 '
UNION ALL SELECT '16x30x7	 '
UNION ALL SELECT '2x21x28	 '
UNION ALL SELECT '5x10x6	 '
UNION ALL SELECT '8x7x1	 '
UNION ALL SELECT '28x13x5	 '
UNION ALL SELECT '11x5x14	 '
UNION ALL SELECT '26x22x29 '
UNION ALL SELECT '23x15x13 '
UNION ALL SELECT '14x2x16	 '
UNION ALL SELECT '22x21x9	 '
UNION ALL SELECT '4x20x3	 '
UNION ALL SELECT '18x17x19 '
UNION ALL SELECT '12x7x9	 '
UNION ALL SELECT '6x12x25	 '
UNION ALL SELECT '3x30x27	 '
UNION ALL SELECT '8x19x22	 '
UNION ALL SELECT '1x9x27	 '
UNION ALL SELECT '23x20x12 '
UNION ALL SELECT '14x7x29	 '
UNION ALL SELECT '9x12x12	 '
UNION ALL SELECT '30x2x6	 '
UNION ALL SELECT '15x7x16	 '
UNION ALL SELECT '19x13x18 '
UNION ALL SELECT '11x8x13	 '
UNION ALL SELECT '16x5x3	 '
UNION ALL SELECT '19x26x24 '
UNION ALL SELECT '26x8x21	 '
UNION ALL SELECT '21x20x7	 '
UNION ALL SELECT '15x1x25	 '
UNION ALL SELECT '29x15x21 '
UNION ALL SELECT '22x17x7	 '
UNION ALL SELECT '16x17x10 '
UNION ALL SELECT '6x12x24	 '
UNION ALL SELECT '8x13x27	 '
UNION ALL SELECT '30x25x14 '
UNION ALL SELECT '25x7x10	 '
UNION ALL SELECT '15x2x2	 '
UNION ALL SELECT '18x15x19 '
UNION ALL SELECT '18x13x24 '
UNION ALL SELECT '19x30x1	 '
UNION ALL SELECT '17x1x3	 '
UNION ALL SELECT '26x21x15 '
UNION ALL SELECT '10x10x18 '
UNION ALL SELECT '9x16x6	 '
UNION ALL SELECT '29x7x30	 '
UNION ALL SELECT '11x10x30 '
UNION ALL SELECT '6x11x2	 '
UNION ALL SELECT '7x29x23	 '
UNION ALL SELECT '13x2x30	 '
UNION ALL SELECT '25x27x13 '
UNION ALL SELECT '5x15x21	 '
UNION ALL SELECT '4x8x30	 '
UNION ALL SELECT '15x27x11 '
UNION ALL SELECT '27x1x6	 '
UNION ALL SELECT '2x24x11	 '
UNION ALL SELECT '16x20x19 '
UNION ALL SELECT '25x28x20 '
UNION ALL SELECT '6x8x4	 '
UNION ALL SELECT '27x16x11 '
UNION ALL SELECT '1x5x27	 '
UNION ALL SELECT '12x19x26 '
UNION ALL SELECT '18x24x14 '
UNION ALL SELECT '4x25x17	 '
UNION ALL SELECT '24x24x26 '
UNION ALL SELECT '28x3x18	 '
UNION ALL SELECT '8x20x28	 '
UNION ALL SELECT '22x7x21	 '
UNION ALL SELECT '24x5x28	 '
UNION ALL SELECT '23x30x29 '
UNION ALL SELECT '25x16x27 '
UNION ALL SELECT '28x10x30 '
UNION ALL SELECT '9x2x4	 '
UNION ALL SELECT '30x2x23	 '
UNION ALL SELECT '21x9x23	 '
UNION ALL SELECT '27x4x26	 '
UNION ALL SELECT '2x23x16	 '
UNION ALL SELECT '24x26x30 '
UNION ALL SELECT '26x1x30	 '
UNION ALL SELECT '10x4x28	 '
UNION ALL SELECT '11x29x12 '
UNION ALL SELECT '28x13x30 '
UNION ALL SELECT '24x10x28 '
UNION ALL SELECT '8x12x12	 '
UNION ALL SELECT '19x27x11 '
UNION ALL SELECT '11x28x7	 '
UNION ALL SELECT '14x6x3	 '
UNION ALL SELECT '6x27x5	 '
UNION ALL SELECT '6x17x14	 '
UNION ALL SELECT '24x24x17 '
UNION ALL SELECT '18x23x14 '
UNION ALL SELECT '17x5x7	 '
UNION ALL SELECT '11x4x23	 '
UNION ALL SELECT '5x1x17	 '
UNION ALL SELECT '26x15x24 '
UNION ALL SELECT '3x9x24	 '
UNION ALL SELECT '5x3x15	 '
UNION ALL SELECT '5x20x19	 '
UNION ALL SELECT '5x21x2	 '
UNION ALL SELECT '13x5x30	 '
UNION ALL SELECT '19x6x24	 '
UNION ALL SELECT '19x17x6	 '
UNION ALL SELECT '23x7x13	 '
UNION ALL SELECT '28x23x13 '
UNION ALL SELECT '9x1x6	 '
UNION ALL SELECT '15x12x16 '
UNION ALL SELECT '21x19x9	 '
UNION ALL SELECT '25x5x5	 '
UNION ALL SELECT '9x7x9	 '
UNION ALL SELECT '6x5x8	 '
UNION ALL SELECT '3x11x18	 '
UNION ALL SELECT '23x25x11 '
UNION ALL SELECT '25x4x6	 '
UNION ALL SELECT '4x27x1	 '
UNION ALL SELECT '4x3x3	 '
UNION ALL SELECT '30x11x5	 '
UNION ALL SELECT '9x17x12	 '
UNION ALL SELECT '15x6x24	 '
UNION ALL SELECT '10x22x15 '
UNION ALL SELECT '29x27x9	 '
UNION ALL SELECT '20x21x11 '
UNION ALL SELECT '18x10x5	 '
UNION ALL SELECT '11x2x2	 '
UNION ALL SELECT '9x8x8	 '
UNION ALL SELECT '1x26x21	 '
UNION ALL SELECT '11x11x16 '
UNION ALL SELECT '2x18x30	 '
UNION ALL SELECT '29x27x24 '
UNION ALL SELECT '27x8x18	 '
UNION ALL SELECT '19x3x17	 '
UNION ALL SELECT '30x21x26 '
UNION ALL SELECT '25x13x25 '
UNION ALL SELECT '20x22x1	 '
UNION ALL SELECT '10x1x12	 '
UNION ALL SELECT '11x17x15 '
UNION ALL SELECT '29x11x30 '
UNION ALL SELECT '17x30x27 '
UNION ALL SELECT '21x22x17 '
UNION ALL SELECT '13x6x22	 '
UNION ALL SELECT '22x16x12 '
UNION ALL SELECT '27x18x19 '
UNION ALL SELECT '4x13x6	 '
UNION ALL SELECT '27x29x10 '
UNION ALL SELECT '3x23x10	 '
UNION ALL SELECT '26x16x24 '
UNION ALL SELECT '18x26x20 '
UNION ALL SELECT '11x28x16 '
UNION ALL SELECT '21x6x15	 '
UNION ALL SELECT '9x26x17	 '
UNION ALL SELECT '8x15x8	 '
UNION ALL SELECT '3x7x10	 '
UNION ALL SELECT '2x28x8	 '
UNION ALL SELECT '1x2x24	 '
UNION ALL SELECT '7x8x9	 '
UNION ALL SELECT '19x4x22	 '
UNION ALL SELECT '11x20x9	 '
UNION ALL SELECT '12x22x16 '
UNION ALL SELECT '26x8x19	 '
UNION ALL SELECT '13x28x24 '
UNION ALL SELECT '4x10x16	 '
UNION ALL SELECT '12x8x10	 '
UNION ALL SELECT '14x24x24 '
UNION ALL SELECT '19x19x28 '
UNION ALL SELECT '29x1x15	 '
UNION ALL SELECT '10x5x14	 '
UNION ALL SELECT '20x19x23 '
UNION ALL SELECT '10x7x12	 '
UNION ALL SELECT '1x7x13	 '
UNION ALL SELECT '5x12x13	 '
UNION ALL SELECT '25x21x8	 '
UNION ALL SELECT '22x28x8	 '
UNION ALL SELECT '7x9x4	 '
UNION ALL SELECT '3x20x15	 '
UNION ALL SELECT '15x27x19 '
UNION ALL SELECT '18x24x12 '
UNION ALL SELECT '16x10x16 '
UNION ALL SELECT '22x19x8	 '
UNION ALL SELECT '15x4x3	 '
UNION ALL SELECT '9x30x25	 '
UNION ALL SELECT '1x1x6	 '
UNION ALL SELECT '24x4x25	 '
UNION ALL SELECT '13x18x29 '
UNION ALL SELECT '10x2x8	 '
UNION ALL SELECT '21x1x17	 '
UNION ALL SELECT '29x14x22 '
UNION ALL SELECT '17x29x11 '
UNION ALL SELECT '10x27x16 '
UNION ALL SELECT '25x16x15 '
UNION ALL SELECT '14x2x17	 '
UNION ALL SELECT '12x27x3	 '
UNION ALL SELECT '14x17x25 '
UNION ALL SELECT '24x4x1	 '
UNION ALL SELECT '18x28x18 '
UNION ALL SELECT '9x14x26	 '
UNION ALL SELECT '28x24x17 '
UNION ALL SELECT '1x26x12	 '
UNION ALL SELECT '2x18x20	 '
UNION ALL SELECT '12x19x22 '
UNION ALL SELECT '19x25x20 '
UNION ALL SELECT '5x17x27	 '
UNION ALL SELECT '17x29x16 '
UNION ALL SELECT '29x19x11 '
UNION ALL SELECT '16x2x4	 '
UNION ALL SELECT '23x24x1	 '
UNION ALL SELECT '19x18x3	 '
UNION ALL SELECT '28x14x6	 '
UNION ALL SELECT '18x5x23	 '
UNION ALL SELECT '9x24x12	 '
UNION ALL SELECT '15x4x6	 '
UNION ALL SELECT '15x7x24	 '
UNION ALL SELECT '22x15x8	 '
UNION ALL SELECT '22x1x22	 '
UNION ALL SELECT '6x4x22	 '
UNION ALL SELECT '26x1x30	 '
UNION ALL SELECT '8x21x27	 '
UNION ALL SELECT '7x1x11	 '
UNION ALL SELECT '9x8x18	 '
UNION ALL SELECT '20x27x12 '
UNION ALL SELECT '26x23x20 '
UNION ALL SELECT '26x22x30 '
UNION ALL SELECT '24x3x16	 '
UNION ALL SELECT '8x24x28	 '
UNION ALL SELECT '13x28x5	 '
UNION ALL SELECT '4x29x23	 '
UNION ALL SELECT '22x5x8	 '
UNION ALL SELECT '20x22x3	 '
UNION ALL SELECT '9x9x17	 '
UNION ALL SELECT '28x3x30	 '
UNION ALL SELECT '10x13x10 '
UNION ALL SELECT '10x25x13 '
UNION ALL SELECT '9x20x3	 '
UNION ALL SELECT '1x21x25	 '
UNION ALL SELECT '24x21x15 '
UNION ALL SELECT '21x5x14	 '
UNION ALL SELECT '13x8x20	 '
UNION ALL SELECT '29x17x3	 '
UNION ALL SELECT '5x17x28	 '
UNION ALL SELECT '16x12x7	 '
UNION ALL SELECT '23x1x24	 '
UNION ALL SELECT '4x24x29	 '
UNION ALL SELECT '23x25x14 '
UNION ALL SELECT '8x27x2	 '
UNION ALL SELECT '23x11x13 '
UNION ALL SELECT '13x4x5	 '
UNION ALL SELECT '24x1x26	 '
UNION ALL SELECT '21x1x23	 '
UNION ALL SELECT '10x12x12 '
UNION ALL SELECT '21x29x25 '
UNION ALL SELECT '27x25x30 '
UNION ALL SELECT '24x23x4	 '
UNION ALL SELECT '1x30x23	 '
UNION ALL SELECT '29x28x14 '
UNION ALL SELECT '4x11x30	 '
UNION ALL SELECT '9x25x10	 '
UNION ALL SELECT '17x11x6	 '
UNION ALL SELECT '14x29x30 '
UNION ALL SELECT '23x5x5	 '
UNION ALL SELECT '25x18x21 '
UNION ALL SELECT '8x7x1	 '
UNION ALL SELECT '27x11x3	 '
UNION ALL SELECT '5x10x8	 '
UNION ALL SELECT '11x1x11	 '
UNION ALL SELECT '16x17x26 '
UNION ALL SELECT '15x22x19 '
UNION ALL SELECT '16x9x6	 '
UNION ALL SELECT '18x13x27 '
UNION ALL SELECT '26x4x22	 '
UNION ALL SELECT '1x20x21	 '
UNION ALL SELECT '6x14x29	 '
UNION ALL SELECT '11x7x6	 '
UNION ALL SELECT '1x23x7	 '
UNION ALL SELECT '12x19x13 '
UNION ALL SELECT '18x21x25 '
UNION ALL SELECT '15x17x20 '
UNION ALL SELECT '23x8x9	 '
UNION ALL SELECT '15x9x26	 '
UNION ALL SELECT '9x12x9	 '
UNION ALL SELECT '12x13x14 '
UNION ALL SELECT '27x26x7	 '
UNION ALL SELECT '11x19x22 '
UNION ALL SELECT '16x12x21 '
UNION ALL SELECT '10x30x28 '
UNION ALL SELECT '21x2x7	 '
UNION ALL SELECT '12x9x18	 '
UNION ALL SELECT '7x17x14	 '
UNION ALL SELECT '13x17x17 '
UNION ALL SELECT '3x21x10	 '
UNION ALL SELECT '30x9x15	 '
UNION ALL SELECT '2x8x15	 '
UNION ALL SELECT '15x12x10 '
UNION ALL SELECT '23x26x9	 '
UNION ALL SELECT '29x30x10 '
UNION ALL SELECT '30x22x17 '
UNION ALL SELECT '17x26x30 '
UNION ALL SELECT '27x26x20 '
UNION ALL SELECT '17x28x17 '
UNION ALL SELECT '30x12x16 '
UNION ALL SELECT '7x23x15	 '
UNION ALL SELECT '30x15x19 '
UNION ALL SELECT '13x19x10 '
UNION ALL SELECT '22x10x4	 '
UNION ALL SELECT '17x23x10 '
UNION ALL SELECT '2x28x18	 '
UNION ALL SELECT '27x21x28 '
UNION ALL SELECT '24x26x5	 '
UNION ALL SELECT '6x23x25	 '
UNION ALL SELECT '17x4x16	 '
UNION ALL SELECT '14x1x13	 '
UNION ALL SELECT '23x21x11 '
UNION ALL SELECT '14x15x30 '
UNION ALL SELECT '26x13x10 '
UNION ALL SELECT '30x19x25 '
UNION ALL SELECT '26x6x26	 '
UNION ALL SELECT '9x16x29	 '
UNION ALL SELECT '15x2x24	 '
UNION ALL SELECT '13x3x20	 '
UNION ALL SELECT '23x12x30 '
UNION ALL SELECT '22x23x23 '
UNION ALL SELECT '8x21x2	 '
UNION ALL SELECT '18x28x5	 '
UNION ALL SELECT '21x27x14 '
UNION ALL SELECT '29x28x23 '
UNION ALL SELECT '12x30x28 '
UNION ALL SELECT '17x16x3	 '
UNION ALL SELECT '5x19x11	 '
UNION ALL SELECT '28x22x22 '
UNION ALL SELECT '1x4x28	 '
UNION ALL SELECT '10x10x14 '
UNION ALL SELECT '18x15x7	 '
UNION ALL SELECT '18x11x1	 '
UNION ALL SELECT '12x7x16	 '
UNION ALL SELECT '10x22x24 '
UNION ALL SELECT '27x25x6	 '
UNION ALL SELECT '19x29x25 '
UNION ALL SELECT '10x1x26	 '
UNION ALL SELECT '26x27x30 '
UNION ALL SELECT '4x23x19	 '
UNION ALL SELECT '24x19x4	 '
UNION ALL SELECT '21x11x14 '
UNION ALL SELECT '4x13x27	 '
UNION ALL SELECT '9x1x11	 '
UNION ALL SELECT '16x20x8	 '
UNION ALL SELECT '4x3x11	 '
UNION ALL SELECT '1x16x12	 '
UNION ALL SELECT '14x6x30	 '
UNION ALL SELECT '8x1x10	 '
UNION ALL SELECT '11x18x7	 '
UNION ALL SELECT '29x28x30 '
UNION ALL SELECT '4x21x8	 '
UNION ALL SELECT '3x21x4	 '
UNION ALL SELECT '6x1x5	 '
UNION ALL SELECT '26x18x3	 '
UNION ALL SELECT '28x27x27 '
UNION ALL SELECT '17x3x12	 '
UNION ALL SELECT '6x1x22	 '
UNION ALL SELECT '23x12x28 '
UNION ALL SELECT '12x13x2	 '
UNION ALL SELECT '11x2x13	 '
UNION ALL SELECT '7x1x28	 '
UNION ALL SELECT '27x6x25	 '
UNION ALL SELECT '14x14x3	 '
UNION ALL SELECT '14x11x20 '
UNION ALL SELECT '2x27x7	 '
UNION ALL SELECT '22x24x23 '
UNION ALL SELECT '7x15x20	 '
UNION ALL SELECT '30x6x17	 '
UNION ALL SELECT '20x23x25 '
UNION ALL SELECT '18x16x27 '
UNION ALL SELECT '2x9x6	 '
UNION ALL SELECT '9x18x19	 '
UNION ALL SELECT '20x11x22 '
UNION ALL SELECT '11x16x19 '
UNION ALL SELECT '14x29x23 '
UNION ALL SELECT '14x9x20	 '
UNION ALL SELECT '8x10x12	 '
UNION ALL SELECT '18x17x6	 '
UNION ALL SELECT '28x7x16	 '
UNION ALL SELECT '12x19x28 '
UNION ALL SELECT '5x3x16	 '
UNION ALL SELECT '1x25x10	 '
UNION ALL SELECT '4x14x10	 '
UNION ALL SELECT '9x6x3	 '
UNION ALL SELECT '15x27x28 '
UNION ALL SELECT '13x26x14 '
UNION ALL SELECT '21x8x25	 '
UNION ALL SELECT '29x10x20 '
UNION ALL SELECT '14x26x30 '
UNION ALL SELECT '25x13x28 '
UNION ALL SELECT '1x15x23	 '
UNION ALL SELECT '6x20x21	 '
UNION ALL SELECT '18x2x1	 '
UNION ALL SELECT '22x25x16 '
UNION ALL SELECT '23x25x17 '
UNION ALL SELECT '2x14x21	 '
UNION ALL SELECT '14x25x16 '
UNION ALL SELECT '12x17x6	 '
UNION ALL SELECT '19x29x15 '
UNION ALL SELECT '25x9x6	 '
UNION ALL SELECT '19x17x13 '
UNION ALL SELECT '24x22x5	 '
UNION ALL SELECT '19x4x13	 '
UNION ALL SELECT '10x18x6	 '
UNION ALL SELECT '6x25x6	 '
UNION ALL SELECT '23x24x20 '
UNION ALL SELECT '8x22x13	 '
UNION ALL SELECT '25x10x29 '
UNION ALL SELECT '5x12x25	 '
UNION ALL SELECT '20x5x11	 '
UNION ALL SELECT '7x16x29	 '
UNION ALL SELECT '29x24x22 '
UNION ALL SELECT '28x20x1	 '
UNION ALL SELECT '10x27x10 '
UNION ALL SELECT '6x9x27	 '
UNION ALL SELECT '26x15x30 '
UNION ALL SELECT '26x3x19	 '
UNION ALL SELECT '20x11x3	 '
UNION ALL SELECT '26x1x29	 '
UNION ALL SELECT '6x23x4	 '
UNION ALL SELECT '6x13x21	 '
UNION ALL SELECT '9x23x25	 '
UNION ALL SELECT '15x1x10	 '
UNION ALL SELECT '29x12x13 '
UNION ALL SELECT '7x8x24	 '
UNION ALL SELECT '29x30x27 '
UNION ALL SELECT '3x29x19	 '
UNION ALL SELECT '14x16x17 '
UNION ALL SELECT '4x8x27	 '
UNION ALL SELECT '26x17x8	 '
UNION ALL SELECT '10x27x17 '
UNION ALL SELECT '11x28x17 '
UNION ALL SELECT '17x16x27 '
UNION ALL SELECT '1x8x22	 '
UNION ALL SELECT '6x30x16	 '
UNION ALL SELECT '7x30x22	 '
UNION ALL SELECT '20x12x3	 '
UNION ALL SELECT '18x10x2	 '
UNION ALL SELECT '20x21x26 '
UNION ALL SELECT '11x1x17	 '
UNION ALL SELECT '9x15x15	 '
UNION ALL SELECT '19x14x30 '
UNION ALL SELECT '24x22x20 '
UNION ALL SELECT '11x26x23 '
UNION ALL SELECT '14x3x23	 '
UNION ALL SELECT '1x28x29	 '
UNION ALL SELECT '29x20x4	 '
UNION ALL SELECT '1x4x20	 '
UNION ALL SELECT '12x26x8	 '
UNION ALL SELECT '14x11x14 '
UNION ALL SELECT '14x19x13 '
UNION ALL SELECT '15x13x24 '
UNION ALL SELECT '16x7x26	 '
UNION ALL SELECT '11x20x11 '
UNION ALL SELECT '5x24x26	 '
UNION ALL SELECT '24x25x7	 '
UNION ALL SELECT '21x3x14	 '
UNION ALL SELECT '24x29x20 '
UNION ALL SELECT '7x12x1	 '
UNION ALL SELECT '16x17x4	 '
UNION ALL SELECT '29x16x21 '
UNION ALL SELECT '28x8x17	 '
UNION ALL SELECT '11x30x25 '
UNION ALL SELECT '1x26x23	 '
UNION ALL SELECT '25x19x28 '
UNION ALL SELECT '30x24x5	 '
UNION ALL SELECT '26x29x15 '
UNION ALL SELECT '4x25x23	 '
UNION ALL SELECT '14x25x19 '
UNION ALL SELECT '29x10x7	 '
UNION ALL SELECT '29x29x28 '
UNION ALL SELECT '19x13x24 '
UNION ALL SELECT '21x28x5	 '
UNION ALL SELECT '8x15x24	 '
UNION ALL SELECT '1x10x12	 '
UNION ALL SELECT '2x26x6	 '
UNION ALL SELECT '14x14x4	 '
UNION ALL SELECT '10x16x27 '
UNION ALL SELECT '9x17x25	 '
UNION ALL SELECT '25x8x7	 '
UNION ALL SELECT '1x9x28	 '
UNION ALL SELECT '10x8x17	 '
UNION ALL SELECT '4x12x1	 '
UNION ALL SELECT '17x26x29 '
UNION ALL SELECT '23x12x26 '
UNION ALL SELECT '2x21x22	 '
UNION ALL SELECT '18x23x13 '
UNION ALL SELECT '1x14x5	 '
UNION ALL SELECT '25x27x26 '
UNION ALL SELECT '4x30x30	 '
UNION ALL SELECT '5x13x2	 '
UNION ALL SELECT '17x9x6	 '
UNION ALL SELECT '28x18x28 '
UNION ALL SELECT '7x30x2	 '
UNION ALL SELECT '28x22x17 '
UNION ALL SELECT '14x15x14 '
UNION ALL SELECT '10x14x19 '
UNION ALL SELECT '6x15x22	 '
UNION ALL SELECT '27x4x17	 '
UNION ALL SELECT '28x21x6	 '
UNION ALL SELECT '19x29x26 '
UNION ALL SELECT '6x17x17	 '
UNION ALL SELECT '20x13x16 '
UNION ALL SELECT '25x4x1	 '
UNION ALL SELECT '2x9x5	 '
UNION ALL SELECT '30x3x1	 '
UNION ALL SELECT '24x21x2	 '
UNION ALL SELECT '14x19x12 '
UNION ALL SELECT '22x5x23	 '
UNION ALL SELECT '14x4x21	 '
UNION ALL SELECT '10x2x17	 '
UNION ALL SELECT '3x14x10	 '
UNION ALL SELECT '17x5x3	 '
UNION ALL SELECT '22x17x13 '
UNION ALL SELECT '5x19x3	 '
UNION ALL SELECT '29x22x6	 '
UNION ALL SELECT '12x28x3	 '
UNION ALL SELECT '9x21x25	 '
UNION ALL SELECT '10x2x14	 '
UNION ALL SELECT '13x26x7	 '
UNION ALL SELECT '18x23x2	 '
UNION ALL SELECT '9x14x17	 '
UNION ALL SELECT '21x3x13	 '
UNION ALL SELECT '13x23x9	 '
UNION ALL SELECT '1x20x4	 '
UNION ALL SELECT '11x4x1	 '
UNION ALL SELECT '19x5x30	 '
UNION ALL SELECT '9x9x29	 '
UNION ALL SELECT '26x29x14 '
UNION ALL SELECT '1x4x10	 '
UNION ALL SELECT '7x27x30	 '
UNION ALL SELECT '8x3x23	 '
UNION ALL SELECT '1x27x27	 '
UNION ALL SELECT '7x27x27	 '
UNION ALL SELECT '1x26x16	 '
UNION ALL SELECT '29x16x14 '
UNION ALL SELECT '18x6x12	 '
UNION ALL SELECT '24x24x24 '
UNION ALL SELECT '26x2x19	 '
UNION ALL SELECT '15x17x4	 '
UNION ALL SELECT '11x7x14	 '
UNION ALL SELECT '14x19x10 '
UNION ALL SELECT '9x10x1	 '
UNION ALL SELECT '14x17x9	 '
UNION ALL SELECT '20x19x13 '
UNION ALL SELECT '25x20x8	 '
UNION ALL SELECT '24x20x21 '
UNION ALL SELECT '26x30x2	 '
UNION ALL SELECT '24x2x10	 '
UNION ALL SELECT '28x4x13	 '
UNION ALL SELECT '27x17x11 '
UNION ALL SELECT '15x3x8	 '
UNION ALL SELECT '11x29x10 '
UNION ALL SELECT '26x15x16 '
UNION ALL SELECT '4x28x22	 '
UNION ALL SELECT '7x5x22	 '
UNION ALL SELECT '10x28x9	 '
UNION ALL SELECT '6x28x13	 '
UNION ALL SELECT '10x5x6	 '
UNION ALL SELECT '20x12x6	 '
UNION ALL SELECT '25x30x30 '
UNION ALL SELECT '17x16x14 '
UNION ALL SELECT '14x20x3	 '
UNION ALL SELECT '16x10x8	 '
UNION ALL SELECT '9x28x14	 '
UNION ALL SELECT '16x12x12 '
UNION ALL SELECT '11x13x25 '
UNION ALL SELECT '21x16x28 '
UNION ALL SELECT '10x3x18	 '
UNION ALL SELECT '5x9x20	 '
UNION ALL SELECT '17x23x5	 '
UNION ALL SELECT '3x13x16	 '
UNION ALL SELECT '29x30x17 '
UNION ALL SELECT '2x2x8	 '
UNION ALL SELECT '15x8x30	 '
UNION ALL SELECT '20x1x16	 '
UNION ALL SELECT '23x10x29 '
UNION ALL SELECT '4x5x4	 '
UNION ALL SELECT '6x18x12	 '
UNION ALL SELECT '26x10x22 '
UNION ALL SELECT '21x10x17 '
UNION ALL SELECT '26x12x29 '
UNION ALL SELECT '7x20x21	 '
UNION ALL SELECT '18x9x15	 '
UNION ALL SELECT '10x23x20 '
UNION ALL SELECT '20x1x27	 '
UNION ALL SELECT '10x10x3	 '
UNION ALL SELECT '25x12x23 '
UNION ALL SELECT '30x11x15 '
UNION ALL SELECT '16x22x3	 '
UNION ALL SELECT '22x10x11 '
UNION ALL SELECT '15x10x20 '
UNION ALL SELECT '2x20x17	 '
UNION ALL SELECT '20x20x1	 '
UNION ALL SELECT '24x16x4	 '
UNION ALL SELECT '23x27x7	 '
UNION ALL SELECT '7x27x22	 '
UNION ALL SELECT '24x16x8	 '
UNION ALL SELECT '20x11x25 '
UNION ALL SELECT '30x28x11 '
UNION ALL SELECT '21x6x24	 '
UNION ALL SELECT '15x2x9	 '
UNION ALL SELECT '16x30x24 '
UNION ALL SELECT '21x27x9	 '
UNION ALL SELECT '7x19x8	 '
UNION ALL SELECT '24x13x28 '
UNION ALL SELECT '12x26x28 '
UNION ALL SELECT '16x21x11 '
UNION ALL SELECT '25x5x13	 '
UNION ALL SELECT '23x3x17	 '
UNION ALL SELECT '23x1x17	 '
UNION ALL SELECT '4x17x18	 '
UNION ALL SELECT '17x13x18 '
UNION ALL SELECT '25x12x19 '
UNION ALL SELECT '17x4x19	 '
UNION ALL SELECT '4x21x26	 '
UNION ALL SELECT '6x28x1	 '
UNION ALL SELECT '23x22x15 '
UNION ALL SELECT '6x23x12	 '
UNION ALL SELECT '21x17x9	 '
UNION ALL SELECT '30x4x23	 '
UNION ALL SELECT '2x19x21	 '
UNION ALL SELECT '28x24x7	 '
UNION ALL SELECT '19x24x14 '
UNION ALL SELECT '13x20x26 '
UNION ALL SELECT '19x24x29 '
UNION ALL SELECT '8x26x3	 '
UNION ALL SELECT '16x12x14 '
UNION ALL SELECT '17x4x21	 '
UNION ALL SELECT '8x4x20	 '
UNION ALL SELECT '13x27x17 '
UNION ALL SELECT '9x21x1	 '
UNION ALL SELECT '29x25x6	 '
UNION ALL SELECT '7x9x26	 '
UNION ALL SELECT '13x25x5	 '
UNION ALL SELECT '6x9x21	 '
UNION ALL SELECT '12x10x11 '
UNION ALL SELECT '30x28x21 '
UNION ALL SELECT '15x6x2	 '
UNION ALL SELECT '8x18x19	 '
UNION ALL SELECT '26x20x24 '
UNION ALL SELECT '26x17x14 '
UNION ALL SELECT '27x8x1	 '
UNION ALL SELECT '19x19x18 '
UNION ALL SELECT '25x24x27 '
UNION ALL SELECT '14x29x15 '
UNION ALL SELECT '22x26x1	 '
UNION ALL SELECT '14x17x9	 '
UNION ALL SELECT '2x6x23	 '
UNION ALL SELECT '29x7x5	 '
UNION ALL SELECT '14x16x19 '
UNION ALL SELECT '14x21x18 '
UNION ALL SELECT '10x15x23 '
UNION ALL SELECT '21x29x14 '
UNION ALL SELECT '20x29x30 '
UNION ALL SELECT '23x11x5	 ';

WITH 
TrimmedDims AS
(
	SELECT Dims = RTRIM(REPLACE(Dims, '	', ''))
	FROM #Input
)
, XLoc AS 
(
	SELECT Dims
		, FirstX = charindex('x', Dims)
		, SecondX = LEN(Dims) - charindex('x', REVERSE(Dims)) + 1
	FROM TrimmedDims
)
, Dims AS 
(
	SELECT BoxL = CAST(LEFT(Dims, FirstX - 1) as Int)
		, BoxW = CAST(SUBSTRING(Dims, FirstX + 1, SecondX - FirstX - 1) as Int)
		, BoxH = Right(Dims, LEN(Dims) - SecondX)
	FROM XLoc
)
, Measurements As 
(
	SELECT LW = BoxL * BoxW 
		, LH = BoxL * BoxH 
		, WH = BoxW * BoxH 
		, Slack = CASE WHEN BoxL >= BoxW AND BoxL >= BoxH THEN BoxH * BoxW 
				WHEN BoxW >= BoxH THEN BoxL * BoxH 
				ELSE BoxL * BoxW 
				END
		, RibbonWrap = CASE WHEN BoxL >= BoxW AND BoxL >= BoxH THEN BoxH * 2 + BoxW * 2
				WHEN BoxW >= BoxH THEN BoxL * 2 + BoxH * 2
				ELSE BoxL * 2 + BoxW * 2
				END
		, RibbonBow = BoxL * BoxW * BoxH 
	FROM Dims 
)
SELECT Wrapping =  SUM(LW * 2 
		+ LH * 2 
		+ WH * 2 
		+ Slack)
	, Ribbon = SUM(RibbonWrap + RibbonBow)
FROM Measurements;

*Update – I realized there was a part 2 to each day of code, so I updated the code above to figure out how much ribbon to order, too.  It was just adding the last two columns to the last CTE and the last column to the final SELECT statement.  BTW, I have never voluntarily written so many CTE’s in my life…this is about as normal as helping elves with math.

Let me know what you’d do differently. Also, if there’s interest, I also did the next couple days and could make posts for those as well.

Use Compression to Combine Data Quality and Performance

We’ve been using the wrong data types for all the wrong reasons.  DBAs, developers, data architects, etc. have all been told to keep data as narrow as possible, using the smallest data type to get the job done.  We sacrificed the integrity of our data for the performance of our data, and I’d like to thank Karen López (b|t) for pointing out that performance is NOT our #1 Job.

Disclaimer – Enterprise-ONLY

I’m talking about compression again, so I’m talking about my disclaimer again.  Compression is an Enterprise-ONLY feature that affects the structure of your data.  That means you can’t even restore a backup of your database to anything other than enterprise or developer editions unless you remove compression before the backup.

Also, this post is about maximizing the benefits of compression to reduce the costs of proper data architecture, not implementing compression blindly.  Test it on non-prod, test it again on non-prod, then only implement a documented change with a rollback plan.

Karen López Made Me Do It!

Wait a minute, it’s Lopez, not López, isn’t it?  Nope, she actually spells her name differently because there are so many programs out there that think we can get by with 26 letters in the alphabet.  To keep Karen happy (it’s in your best interest), we need to change our LastName column from VarChar(50) to NVarChar(50).  I know full well that I’m doubling the size of my data for every row, not just Karen’s, to make sure people don’t have to change their names to make me happy.

I’m wrong about half of that last sentence…. Yes, Unicode data is 2 bytes per character uncompressed while non-Unicode data is only 1 byte per character.  The key word being uncompressed, because that’s not your only option starting in SQL Server 2008 R2 (as opposed to 2008 when we got compression in general).  Look at what BOL has to say about Unicode Compression Implementation for more info on that, with the most important part being that you have to implement at least row-level compression to gain this advantage.

Now we can accommodate Karen without complaints.  After all, why should we complain?  She’s a great person to have around.

Running Out Of Numbers

Keeping up with my Karen López theme, she made a great presentation with Tom LaRock (b|t) called The Ticking Timebombs in Your Database that focused a lot on picking the right data type for identity columns.  Spoiler Alert…know your data well enough to choose either BigInt or Int for an identity column, then start with the maximum negative value for that type.

Look at your existing databases to see if you’re running out of numbers using Tom’s post SQL Server Identity Values Check.  You might find that you already have a case where there’s a 4-byte column about to hit the maximum number for an Int.  Maybe it wasn’t seeded with a big negative number, or maybe it’s because the database was just more active than it was designed for.  The important part is that this timebomb is quietly ticking down to the 3 AM phone call Tom just warned us about.

Now I’d like to stand in opposition to Karen and Tom a bit by suggesting starting a BigInt at the maximum negative value for Int.  Say you’re pretty sure an Int is big enough, but you’re not willing to bet your career on it.  However, you really like the idea of a 4-byte column as opposed to an 8-byte column.  Here’s what you do…  Make the column a BigInt (you’re not betting your career here), start it at -2,147,483,648 (maximum negative value for Int), increment by 1, and use compression.

Since the value is in the range of a 4-byte signed Int, the compressed stored value of this column will be 4 bytes.  If you happen to go over 2,147,483,647 then you keep your job (probably a good thing) and compression just loses a little of its magic when this column starts using up 8 bytes.

By the time you get to the point that the data can’t be compressed going forward, row-level compression for this single column only considering a single index only at the leaf level has saved you 16 GB.  Because it’s likely to be part of the clustered index key, it’s going to be part of all of your nonclustered indexes.  Say you have 4 nonclustered indexes on this table, each compressed giving you another 16 GB of savings.  So you saved 80 GB of disk space which is also less index space fighting for space in memory on your server.

You lose about half of your potential values when you use this method, but how much does that matter when I’d have to look up the name of my maximum value which is 9,223,372,036,854,775,807.  The important part is you avoided the extra cost of a BigInt while not failing miserably if you exceeded the maximum value of an Int.

I have kids and don’t get enough sleep as it is.  A compressed BigInt lets me sleep knowing I have both great performance and a large pool of numbers.

Why the Obsession with Karen?

Karen happens to be the perfect example.  She’s a great data architect who spells her name wrong to accommodate for not-so-great data architects.  Follow that up by her making presentations on using larger numeric data types to avoid running out of numbers.  Then I stumbled across her post about my #1 Job when I had compression on my mind, which cause me to make the correlation between compressing the data to get the performance along with the data quality.  Add to all this that she’s one of the most outspoken people (end of sentence).

It’s worth giving her (b|t) links again, with a little flare this time.  Her blog can be found at DataModel.com, and it’s a must-read for anyone creating or altering a column.  If you’re creating or altering lots of columns, then she’s a must-hire consultant.  She tweets like a bot with ADHD as @DataChick.  Unlike a bot, her tweets are pertinent and interesting…sometimes too interesting.

Just don’t ask her about her dad’s middle name, Canadian “zip codes”, the joys of air travel, or shoes.

Indexing Strategy

What do I care about when I’m playing with indexes? That’s easy. I want as few indexes as possible efficiently referenced by as many pertinent, well-tuned, consistently written queries as is reasonable. It’s explaining that last sentence that’s the hard part.

The thing that will jump out to most people is that my goal doesn’t mention a specific, single query that you want to run great.  Although that’s something I hope to achieve, it only becomes a priority as a last resort.  I’m more concerned with already having that data in memory because the index is being used by many queries, and also having fewer indexes to make data modifications more efficient. There’s more to it than that, but the detail belongs in the body of this post, not the intro.

If I was writing how to make a perfect index for a single reference to a table in a single query, this post could be done rather well in a couple paragraphs. Even though I’m focusing only on OLTP (ruling out columnstore indexes), in 99.999% of environments (ruling out in-memory hash indexes), and not getting into details of filtered indexes or indexed views, there’s still a lot to consider to the point that the first few paragraphs will just be what we’re going to keep in mind. I hope you didn’t have other plans today…

Does this advice apply to you?

It depends! Just kidding, I hate that (non) answer.

There are two targets audiences for this.  While it’s useful to everyone, you’d have to hit both of them for this to be perfect for you. First target is the person; this is written for someone who is comfortable working with indexes for single queries but wants a better view of the big picture. Second target is the database, which is a rather typical in-house OLTP database.

  • More data than you have memory
  • Writes throughout the day, especially in your larger tables
  • Read and write performance matter more than disk space
  • No extreme write loads, such as truncating and repopulating your largest table (easy fix, disable and rebuild your indexes around this action)
  • You have the ability to make indexing changes (this goes beyond what you can typically do with vendor databases)
  • Hopefully, you have the ability to make some code changes

If you or your database aren’t the perfect audience, don’t worry about it.  The most important things to know are what to keep in mind and how those things are interconnected.  The real goal is making more intelligent decisions for your databases, not fitting into a specific mold.

PreReqs!!!

Because this is an advanced look at the fundamentals of indexing strategy, you shouldn’t take offense if you have to do some prerequisite work for all of this to come together. If I give you a severe headache consider reading this stuff first, and the book on the list is well worth a second read cover-to-cover.

PreReqs:

Even with that I’ll probably still give you a headache (a common theme for me), and you’ll probably have questions. Keep in mind that some of the greatest compliments you can give someone are asking them a question and offering them large sums of cash. Although I don’t hand out my LLC’s address to send checks to unless I’ve actually looked over your indexes, I do have a free comments section below that I’d like to see used more often.

Something to consider

Here are all the things we’ll be considering. This is a great list, but nothing is ever going to be 100% all-inclusive or perfect.

Tune your queries: The ideal set of indexes for a poorly written query may be 100% different than the same query after it’s refactored.  We don’t want to constantly tinker with indexes, so this process is best if you tune your most expensive queries first.  An old, unpublished concept for this post had creating an index as a 13 step process with steps 1-11 avoiding indexes through tuning, step 12 making an index, and step 13 seeing if you could delete any other indexes.  That really wasn’t too different from Brent Ozar’s (b|t) Be Creepy method.  Indexing is not the only answer, and not the first answer either.

Query Importance: Some queries just need to complete, some need to run decently well, and some need to run as close to instantly as possible. Knowing how critical your query is will weigh in on how you index, but none of them, not even the most critical queries, will have their performance be the only deciding factor on what indexes you should have. Some outliers will require you to go back and create a specific index for them as a last resort, but there’s no reason to concern ourselves with last resorts before we get started.  If an index would work, even somewhat inefficiently, and it’ll already be in cache, do we want to create another index that will fight it for space in cache?  Don’t create another index to get your most critical query to 0.1 seconds when it’s running at 0.2 seconds and the business is happy with anything under 1.0 second.

Query Consistency: Are you querying the table the same way in all of your queries? If not, do you know you are stressing me out? Do you not care?  Using the same columns to join where possible matters, even if you could live without one of them in some cases because consistent queries mean index reusability.

Query Frequency: Some queries run five times a second, some run once a year. They aren’t even close to the same, and shouldn’t be treated the same.

Query Timing: It is different from frequency. Once-a-day is not just once-a-day. An “8-5 is critical” shop where all your users are sleeping at 3 AM means that we care much less about collateral damage from a 3 AM query. If we do a table scan on a huge clustered index at 3 AM that kicks everything out of cache it might not matter much, but do that same thing at 3 PM and we may want to consider an index even if it’s only used for a single query.

Query Justification: That query timing example threw up a red flag to me. Do we need to run that query at all? Does it need to run in prod, or is there a reporting database I can run it against? Should I consider making a reporting database? Does it need to run at 3 PM? Question the outliers that would change your indexing strategy for both if they need to run and if they could use a little T-SQL help.

Insert / Update / Delete performance: The more indexes you have, the slower your data modifications will be because they have to be written more than once. Wider indexes will be more overhead due to updates hitting it more often, larger index to maintain, and fewer rows per page of data.

Reusability: How many queries can use this index, and how will each of them use it? Is a query using it differently because it’s not referencing the table consistently or because it’s legitimately doing something different? This part is HUGE, and is really going to be a major focus. I didn’t give you a hard time on that query consistency point asking if you cared for no reason!

Memory usage: How much memory is being used, where is it being used, and why? Is that memory being used to fulfill multiple queries (see Reusability, which references Query Consistency, which goes back to Query Tuning)? Could we use less memory if we had a descent index? Is the query that requires all that memory justified and timed properly? These points are starting to mix together and reference themselves, aren’t they? Is indexing strategy an infinite loop?!?!? Yes, yes it is.

Key Lookups: For the queries that use this index, is this a covering index? If not, what would it need to be a covering index? We’ll look at these things: how critical is each query, how often is that query run, how many key lookups does it do, how wide are the total columns we would need to add to be covering, how often are each of those columns updated, what other queries would take advantage of having those columns in there, and is there any filtering being done on those columns?

Maintenance: It’s easy to see having fewer, more narrow indexes would make index rebuilds, index reorgs, and database backups quicker and easier. How about key column order and compression?

TDE: What’s this feature doing in an indexing article?

SQL Edition: Index compression is going to be the biggest one for us today. Online rebuilds can make a big difference, too, but it rarely makes a big difference in what indexes you want.

Pick a table, any table

We’re not going to change the entire database at once.  Partially because it’s overwhelming to you, but mostly because it’s lower risk that’s easier to troubleshoot and roll back if needed. So we’ll pick a single table that we want to have run more efficiently, make a change or two to it, then do it again with either the same table or a different one.

I’m not consistent on how I pick tables.  Although I usually pick one of the ones that’s the largest in the buffer pool that I haven’t made as efficient as I could already, which you can find using the query in  my Cleaning Up the Buffer Pool post.  However, that’s not always how I pick a table.  I’ll also start this off with a query that I wish was running faster, run it with SET STATISTICS IO, TIME ON to see what tables are getting hit in the slowest part, and work on a table that’s getting hit inefficiently here.  There’s no magic to it, just pick a table and reassure the other tables that they’ll get their turn later.

It looks like an infinite loop at first glance because I’ll keep picking tables and loop back to revisit table later, but it’s better to think of it as an upward spiral. That first trip around will give you all kinds of awesome, the second trip will add on to that, and each trip around yields less of an improvement. You could stop wherever you wanted if it wasn’t addictive.

Also as the size of your data changes, the queries hitting your database change, and more, it’s unreasonable to expect your indexing strategy to stay 100% the same.  This isn’t a job where you can ever say you’re really done, just in a better place than you were last week.

I have a table, now what?

At the times of day you want your database to perform great, what’s happening with your table? This may be anything that ever runs on the database for some places, and it may be anything that runs between 8 AM and 5 PM for others.

I’m being as all inclusive as possible by looking at everything that touches the table, so this won’t be as quick and easy as you’d think. Yes, my target audience for this post can create an index for a single query in minutes while I typically spend well over an hour on a single table; how fast you make it through this project isn’t my primary concern.

Once you picked a table to work on look in the proc cache to see what references the indexes on the table.  My query to do that in the same Cleaning Up the Buffer Pool post is good for this, but the one in Querying the Plan Cache is better for viewing an entire table at once. This has every cacheable plan that ran since the last restart of services and hasn’t been forced out of memory. Remember this is a really CPU intensive query that will take several minutes to run and needs to run against prod at a time of day you care about to provide what you need. If you have an extremely busy hour or two, run this as soon as things start to calm down.

Note, there were three different ways something could have avoided you seeing it in the proc cache, and that wasn’t counting if you turned on the typically recommended setting “Optimize for ad-hoc workload” that Kimberly Tripp (b|t) wrote about where you can miss the first run of ad-hoc queries in favor of keeping your memory cleaner. It’s also possible that a query is getting a different plan each time it gets compiled due to stats changing or parameter sniffing, but that affects us a little less since we’re going to make index changes that will change those plans anyways.

The proc cache query is also only capturing index usage. It does capture RID lookups, but not table scans against a heap. You’ll have to look at the modified scripts I put in Querying the Plan Cache to see the table scans because they’re stored differently in the XML.

For now, let’s focus on how things can sneak past our cache and how we can find them.

1 & 2: Was not run since the last restart of services or was forced out of memory. It can be in cache, it’s just not there right now. For that, we’re going to check back on the cache multiple times, and we’re also going to make one or two index changes at a time which will again have us checking back multiple times.

3: Uncacheable plans can happen for several reasons, with the most popular in my experience being temp tables where data was loaded into it then an index was created on the temp table. Whatever the reason, start up extended events or a trace and watch for sql_statement_recompile to help hunt them down. Take this list and search for references of your table to know which ones are relevant. To get bonus points (I’m not keeping score), find out why it’s not getting along with your cache and see if it’s something that should be fixed.

To make things a little more difficult in that step, you’ll also have to look for references to views and functions that reference the table. The views and functions will show up in my Proc Contains Text query, and you’ll have to iterate through that process again.

Keep in mind, this will never be perfect and 100% all-inclusive. I know I said that before, but I need some of the important details repeated to me before they sink in and I have to assume there are others like me. It will be very rare for this to pick up an ad-hoc query that runs for year-end processing. You can use your imagination to find 10 other ways you’ll miss something and still be shocked when a new way pops up.

However, we have enough to move forward, and we’re going to accept the rest as acceptable risk. If you don’t accept some risk you’ll never accept any rewards, it’s just a matter of reducing the risk and knowing enough to make an intelligent decision.

Now that you know what’s running, how is each one referencing the table? Looking at the proc cache, you’ll see predicates and seek predicates, which you’ll combine on a list. You’re going to have to run the stuff that didn’t make it into the proc cache manually on a test server and add them to the list as well.

This is completely overwhelming to do all of it.  The more you do, the more accurate your results will be, but it’s not actually reasonable.  Do what’s reasonable, understand that you’re trading off some level of accuracy for time, and also understand that if you don’t make that tradeoff then you’ll never have time for anything else…not even going home at night.

Here’s what the list could use:

  • Proc or name of SQL Batch
  • How important is it
  • How often does it run
  • When does it run
  • Predicates and Seek Predicates (let’s just call them predicates)
  • Equality columns
  • Range columns
  • Inequality columns
  • Column’s returned
  • Rows returned

If there was a RID or Key Lookup on a reference to a nonclustered index, add the output columns and predicate (not the seek predicate for this case only) from the lookup on here as well.  The seek predicate on a lookup is just the clustered index key or RID bookmark added as hidden key columns on your nonclustered index; they will not help you here.

Now look at this list and look for consistencies. What equality predicates are used a lot? You should be able to find different groups of equality predicates that can accommodate most of your queries, and those are going to be the key columns you’ll consider for your indexes. The first key column is going to be the column all of the queries you want to use this index have in common as an equality column, then iterate through them as the columns are used less and less.

This is not the traditional “order of cardinality” advice I’m sure you’ve heard when creating an index for a specific query, but we’re designing an index for your database, not your query. I’ll go one step further and say if it’s a toss-up between what’s the first key column, consider making it one that’s added sequentially such as DateAdded or ID on tables that see more updates because that will reduce page splits and fragmentation.

A query can take advantage of the chain of key columns starting with the first one. The chain can continue after each equality use. An inequality or range can take advantage of a key column as well, but the first one of these is the end of your chain. Once the chain is broken, everything else can be useful, but only as unordered values that don’t matter if they’re key columns or included columns.

You can stop putting in key columns when either queries stop being able to take advantage of them being ordered or the values you’re getting are either unique or close enough. These key columns aren’t free as Paul Randal (b|t) points out in his post On index key size, index depth, and performance.  If a key column is not very useful, then it’s very useful not to have it there.

I should note that if you’re using an index to enforce uniqueness then it will use all the key columns and none of the included columns to do so.  Based on the last paragraph you don’t want any key columns after it’s unique anyways, so don’t even consider that.  However, included columns aren’t used to calculate uniqueness, so you can make this a covering index if it helps you without hurting the unique constraint.

This process, like any other indexing process, isn’t going to be perfect. You’ll have to weigh your decisions with queries that are more critical or are called more often carry more weight in your decision.

Now that you have your key columns figured out, look at the queries that use more than just those columns. When they reference this index how many rows are they going to return where they have to get more information from the table itself through a lookup? How wide are those columns, and how many other queries are going to do the same? This is the balancing act between adding included columns and accepting key lookups. A key lookup is going to be a nested loop operation making separate calls to get the missing columns from the clustered index (or heap, for those who wish to anger me), so 10,000 key lookups is 10,000 separate calls in a loop. If you have to add a large number of columns to eliminate 10 key lookups then it’s almost never worth it. If you have to add one small column to eliminate 1,000,000 key lookups then it’s almost always worth it. Notice I didn’t use determinate language on those…you’ll have to find your balance, test it, and balance it again.

Some things like adding a column to avoid key lookups may make more of a difference to you than the user, but that doesn’t make it less important. For instance, I just said adding a small column to eliminate 1,000,000 key lookups is almost always worth it. If those 1,000,000 key lookups were from a single execution of a query then the user would probably notice, they might even buy you lunch if you’re lucky. If it was a single key lookup on a query run 1,000,000 times that day it’s still a drop in CPU utilization and a potential 1,000,000 pages from the clustered index that didn’t have to be read into cache. You’re doing this because it adds up to a better running server, not always because it’s noticed.

Your goal here is making an index as reusable as is reasonable. Here are the reasons you’re doing that:

  • Every index will fight to be in cache (assuming you don’t have vastly more memory than databases to fill it), an index that is reusable by many queries will be more likely to already be in cache and that space in cache will be more versatile.
  • Every index is another write process in an Insert, Update, and Delete, you’re trying to cut down on those.
  • Every index needs to be maintained, you’re cutting down on that, too.
  • Every index adds disk space, backup size, backup duration, restore durations, etc..
  • If you use TDE, every time a page is read from disk into memory it is decrypted. A reusable index tends to stay in memory more, reducing the number of times the CPU has to decrypt it. See, I TOLD you this belonged in an indexing strategy post!

Nothing’s free, so here’s what you’re giving up:

  • The index isn’t going to be the prefect index for most queries. Test the performance of your critical queries, but also keep in mind that these indexes are more likely to be in cache which could eliminate physical reads from the execution of those queries.
  • These indexes will tend to be wider than the query needs, which is basically restating that this isn’t going to be the perfect, most efficient index for a query. It also means that physical reads on these indexes will tend to be more expensive as there are fewer rows per page.  Again, keep in mind they’re more likely to be in memory because you’re going with fewer indexes that are shared by more queries.

Once you decide on an index or two to add, there are a couple things to consider.

  • What indexes don’t you want anymore? If a query could use another index slightly more efficiently, it will.  However, if it’s close enough then you want to get rid of that other index because of all those benefits of reusability I just mentioned (weren’t you paying attention?). It’s not a question of if a query would use the other index, it’s a question of if you want it to use it.
  • Some queries “should” use this index based on the key columns, but instead of it showing up as a seek predicate it shows up as a predicate. In these cases either your chain of key columns was broken (if column 2 wasn’t an equality column, column 3 will not be a seek predicate) or this column is not being referenced in a SARGable way.
  • Test in non-prod, not prod. Then test it again.
  • Know you’re accepting risk and understand everything involved the best you can. Have a healthy fear of that risk while also understanding that you took a risk just driving to work this morning.

Once these changes go through keep an eye on how they’re being used over the next couple weeks. If you’re in a rush to make a big impact, start a second table as the first change or two are in progress on the first table. Just don’t get too many changes in motion for a single table at once as that’s typically adding more risk and hiding which changes had positive and negative impacts. This is a process, and the longer it takes you do go through it the better the chance is that you’re doing it right.

If I’m doing this process for someone else who wants consistent improvement without taking on too much time or risk at once, then I like to get on their servers once or twice a month, find one or two changes to suggest, then have those go through testing and implementation.  Afterwards review the results and come up with the next suggestion.  It’s hard to be that patient as a full-time employee, but try.

Cluster It

All of that was talking about nonclustered indexes, but you get to pick a clustered index for your table as well.  Keep in mind this is a bigger change and involves more risk, but it’s also a bigger reward.

So, what do I care about that’s special for a clustered index?

  • Uniqueness
  • Key width
  • Width of columns being queried
  • Column types being returned (some can’t be in nonclustered indexes)
  • Number of rows being returned

The size of your key columns on your clustered index is the MINIMUM size of the key columns on a nonunique nonclustered index, and it’s also the MINIMUM width of the page level of any nonclustered index.  You need to keep that in mind.

However, just because your table has an identity column in it doesn’t mean that’s the best clustered index.  What is the best clustered index is going to vary wildly from table to table; there’s not always going to be a clear answer, and the answer will partially depend on how the table is queried.  I get into that a lot more in my last post, Picking a Clustered Index.  Yes, I wrote that post specifically to keep this one shorter…with mixed results.

If a table is often queried by a relatively small column that’s not unique, but the queries tend to pull back most of the columns in the table and a large number of rows then it’s worth considering using this as part of the clustered index key.

If you don’t then you’re faced with two solutions; you can make a really wide nonclustered index to cover these queries, or you can let the queries decide if they want to do a ton of key lookups or just scan the clustered index.  These don’t sound like fun to me.

You still have to worry about the integrity of your data, so if you’re dropping the unique clustered index with a single column to do this then you almost definitely want to add a unique nonclustered index with that single key column to maintain your data integrity.

Compress It

Index compression is an Enterprise-ONLY feature.

Compression is a very big point to hit on here, even if I’m only giving you the compressed version.  It makes your data smaller on disk (less I/O), smaller in memory (less need for I/O), and actually tends to lower your CPU usage instead of raising it.  I get into a lot more detail in my Data Compression post because I didn’t want to have too much space dedicated to a feature not everyone can use here.

Don’t Forget the Outliers

Go back to that list you made of all the queries hitting a specific table. Were some of the queries different than the rest? There are usually a couple, and that doesn’t necessarily mean there’s an issue. However, I look at these to determine if they are using the table properly.

Are they are joining on all the fields they should be. Sometimes you can get the correct results by joining on 3 of the 4 fields you technically should, so why join on the 4th? Well, index reusability is one of those reasons, because it may not be able to use the proper index because someone skipped a column that happens to be the first key field of the perfect index for this query.

Is the query SARGable? Sometimes you’re joining or filtering on the right fields, but something is written in a way that SQL couldn’t do a direct comparison.

Are you returning too many columns? I’ve seen queries returning 20 columns (or using *, which is a move obvious version of the same thing) to populate a screen that uses 3 of them, and on the SQL side you have a DBA trying to figure out if they should add included columns to an index to make that run more efficiently. The most efficient for this and many other examples is refactoring, not reindexing.

Remember, your goal is to make your server run more efficiently, and tweaking indexes is simply one of your tools. While you’re going through this process keep your eyes open towards how other tools can be used.  SSMS is never going to come up with a warning telling you that you should read a book or two by Itzik Ben-Gan (b|t) or Kalen Delaney (b|t), but I would welcome that change.

Does this negate my previous advice?

If you follow my blog at all, which is suggested in my very biased opinion, you may have seen me talk about Unused and Duplicate Indexes, but I make no mention of them here. Did I forget about them?

No, I did not. This is designing every index you want to have on your table in a reusable way. If that index was not on the list then you’ll want to consider getting rid of it. It’s two ways of looking at the same thing. A complete understanding of both of these methods will help you make intelligent indexing decisions and go as far as you need to for the situation you’re in.

Talk to me

This isn’t a short or easy process, and perhaps I could have worded some of it better.  I enjoy what I do, both writing this post and playing with indexes, and having someone think enough of me to ask me questions on this makes it all the more enjoyable.

I may be rewriting parts of this post as I find ways to reword concepts better, especially as I finalize and tweak my presentation with the same name for which this post is my guide. That presentation will make its debut at SQL Saturday Cleveland on February 6, 2016.

If you feel you can help me improve, please don’t hold back.  I’d rather feel that I’m improving than falsely believe I’m infallible.

Picking a Clustered Index

A Clustered Index is not another term for Primary Key, and more thought should be put into the key columns of the index than always allowing them to default to the PK.

First of all, the primary key is the main way you uniquely identify a row in a table enforcing data integrity, while the clustered index is the order in which a table is stored and retrieved. Although they are typically the same column(s), they are completely separate concepts. The are the same so often that this should be your default, but they are different often enough that you should remember that’s only a default and not a rule.

Example

Let’s have an exception in mind as we go through the details.

  • We have an Orders table which has OrderID as the PK, and 11 other columns including OrderDate, CustomerID, and Status.
  • 90% of our querys are filtered by OrderID; these queries are typically already narrowed down to a single record, but sometimes we have a small handful of OrderID’s in a table where we’re getting more than one a time.
  • When we query by a CustomerID we could be getting up to 5% of the table, and only getting four columns (CustomerID, OrderID, OrderDate, Status)
  • Throughout the day, several people run queries to see what orders were placed for a day, week, or month. These queries vary in other columns they are filtered by and how many columns they return, and several return pratically every column in the table.It seems very obvious to pick OrderID as the clustered index. It’s an identity column, it’s added sequentially, it’s the Primary Key, and 90% of our queries are filtered by it. What more could we hope for, right?

It’s not what I’d pick though

OrderID is the Primary Key, and we’re not going to even consider changing that. However, did you notice that our queries that can filter by OrderID are only pulling back a couple records each, typically only one record? That means the performance of those queries wouldn’t be noticeably hurt if they had to do some key lookups to get the rest of the information. While it’s true that a key lookup will add about 4 reads per row (assuming index depth of 4), a handful of rows means that will add up to 20 or 40 reads total. That’s not an issue.

CustomerID may seem like a logical choice. We could be pulling back 5% of the table for a single customer, and that’s a lot. However, the screen in our app and standard reports only require 4 columns, so we’ll just make a covering nonclustered index on this. Since it’s consistently only using 1/3 of your columns, it’ll be quicker as a nonclustered index anyways.

OrderDate is a little different. It’s rather common to pull back a lot of records, and it’s not too rare to ask for a lot of columns when we do that. This means we have four choices.

      1. Narrow Nonclustered Index – Do key lookups to get other columns. They’re only 4 reads each, but that’s 200,000 reads if they query returns 50,000 rows.
      2. Covering Nonclustered Index – It would have to include pratically every column to avoid key lookups.
      3. Clustered Index (or Table) Scan – Just let it read the entire table. It may be cheaper to read it this way even if you have a nonclustered index because those key lookups add up when you get too many rows.
      4. Make this the first key field of the clustered index – Quick seek with all your columns. There are other things to keep in mind, but this is looking good.

Cluster Date

Ok, so we want to look into making OrderDate our clustered index. It’s not a unique column, so if this was our only key field it would not be a unique index. That’s something you want to avoid because all indexes are unique, it’s just a matter of if you make it that way or if SQL Server has to do it behind your back.

Here’s a rabbit hole that Lisa Jackson-Neblett (b) started me on a couple years ago while I was attending one of David Pless’s (b|t) classes.

A nonunique clustered index will get a new, hidden 4-byte column called the uniquifier that you can’t do anything useful with.  Its value is a 0 when there are no duplicates, then 1 for the first duplicate, and so on.  Ken Simmons (b|t) gets into the details with his post Understanding and Examining the Uniquifier in SQL Server.  There, with that column added on, now it’s unique!

A nonunique nonclustered index will use the clustered index’s key fields to make its key unique.  See this in action in Kendra Little’s (b|t) blog post How to Find ‘Secret Columns’ in Nonclustered Indexes.  That means if you made OrderDate the only key field on your nonunique clustered index, then making a nonunique nonclustered index with the only explicit key field CustomerID would have three key fields, in this order: CustomerID, OrderDate, Uniquifier.

A unique nonclustered index like we’re talking about making on OrderID to enforce the primary key would still need the clustered index’s key, just not as key fields.  Instead it would add OrderDate and Uniquifier as included columns.

Make it Unique

While there are times having a nonunique clustered index is a good idea, this isn’t one of them.  In fact, almost any time you think it’s a good idea it’s because you’re missing something.

In this case it’d be easy to add a second key column of OrderID and call it unique, eliminating the 4-byte uniquifier by adding in what’s probably a 4-byte integer field making the clustered index key the same size.  Even if it was a bigger company that needed a big int for the column, at least you have a useful column.

The nonclustered index on CustomerID will now have the key columns CustomerID, OrderDate, and OrderID.  That’s not that big of a deal because you were planning on including those two columns anyways.  You’ll still declare the index with OrderID, OrderDate, and Status as included columns.  SQL Server will promote OrderID and OrderDate to key columns, but you want them on your definition so they’re still in the index if someone changes the clustered index down the road.  The net effect is that this nonclustered index just got 4-bytes smaller for included columns because you made the clustered index unique.

The nonclustered index on OrderID will still have just the one key field because it’s unique, and it will include OrderDate automatically.  Just like the index on CustomerID, the included columns are 4-bytes smaller now.

Query It

Now that you have these indexes, when you query by OrderID you have to do key lookups on the few rows you return.  It’s more expensive than it was before, but the time this adds is usually not an issue because you’re talking about so few extra reads.

Since you’re sorted by OrderDate and it’s typical to want to know about your recent orders more than historical orders, you’re also being very efficient with your memory.  It’s an advanced piece to look at, but worth mentioning.  The first 10,000,000 leaf-level pages of the clustered index are historical data that is rarely read (probably not in cache), but the last 100,000 pages are more current and read a lot (in cache).  Not only that, but each page has many orders that we need to have in cache, and that difference adds up.  Yes, it’s true we would have had the same effect if we left OrderID as the clustered index, but it’s good to know we didn’t hurt anything.

Fragmented Cluster

With a clustered index as an incremental identity column it’s always adding data to the end of the table, so page splits are rare.  Yes, it’s true that adding another page at the end of the table counts as a page split, but it’s not the kind that slows us down.  Changing the clustered index like this adds this as a concern, making it more likely that this index will need to be cleaned up by our index maintenance task.

We made the first key field OrderDate.  While it’s not guaranteed to be the last row in the table, I’d expect that to be somewhat normal which would avoid fragmentation.  If, however, you have a single order in there with an OrderDate of 2099-01-01, you’re doomed.  Well, not doomed, but every new page added will split the last page 50/50 and write to the second page.  Not only did it have to do more work to split the pages instead of just creating a new one, but it also left a 50% full page while having you start out on a page half way to requiring another page split.

This isn’t a deal-breaker, but it’s another cost we have to keep in mind.

What was that again?

So the primary key wasn’t the best clustered index in this case because of how the table is queried.  By the numbers it’s queried more by the primary key, but we had to look at what the queries were doing as well.  Nothing beats knowing your data.

Although I just talked you through this without running the scripts and testing every piece, it does not mean you can make a change like this without testing (I’m not allowed to, I wrote it into my mental contract with myself).  Get a typical load and run it in non-prod, make this change in non-prod, then run it in non-prod again.  Feel free to measure twice and cut once, I heard that’s a good idea.

They Can’t All Be Clustered

You only have one clustered index on your table because, well, it IS your table.  However, you can have lots of nonclustered indexes.  I will say that if you can tell me how many nonclustered indexes you can have then you’re doing it wrong, but you’re probably doing it wrong if you don’t have any, too.  Look into my next post, Indexing Strategy, to start to get an idea of what you want to do with your nonclustered indexes.

Digging Deeper

The more you learn about indexing the more intelligent your decisions are going to be, so keep learning.  Know the details of what’s in an index, why it’s there, and how it affects you.  Practically everything you do with SQL Server is either modifying or retrieving data from an index.

Here are a couple things to think about.

The clustered index is the table itself ordered by the key columns, and you can pretty much think of it like a nonclustered index that automatically includes every other column in the table.  Although you can have a table without a clustered index, it’s typically a heap of trouble you don’t want.

Because all of the columns in the clustered index are at least in the leaf-level pages of all nonclustered indexes, any update to any of the key columns of the clustered index will be an update to all of the nonclustered indexes.  I’ve seen times where this was an issue, I’ve seen times where it wasn’t an issue, and I’ve seen many more times where the clustered index key just isn’t on columns that get updated.  In this case there may be rare times when an order date is changed, but nothing to worry about.  Besides, most indexes on this table would want the OrderDate column in there anyways.

I may have left you with the impression that data is physically stored in the order of the key field, but that’s not how SQL Server does it.  Read Gail Shaw’s (b|t) post Of clustered indexes and ordering to see SQL Server is really doing.  Basically, SQL Server knows which order to retrieve pages, although they’re not stored in that order because fragmentation happens.  Also, the rows on each page aren’t actually stored in order, either.

A couple days after I originally released this post Matan Yungman (b|t) released When Should You Use Identity as a Clustered Index Key?, which is viewing this from the inserts side.  It’s different from my discussion, but still the Clustered Index.  View things from as many sides as possible if you want the best answer possible…read his post!

I’m sure there’s more to say about the key columns of a clustered index I’m not thinking about right now.  Let me know in the comments below and I’ll add to this list as needed.

Data Compression

Data compression is often misunderstood to cost CPU in exchange for smaller size on disk.  Somewhat true, but that simple explanation ignores other savings that often result in net drop in CPU utilization.

Full disclosure: This is an Enterprise-ONLY feature introduced in SQL 2008.  It is engrained in the structure of your data, so it also means you can’t take a backup of a database that has a compressed index and restore it to anything other than Enterprise or Developer Editions.

Here are the simple facts we’ll play with:

  • Two levels – row-level and page-level
  • Page-level is row-level plus extra compression
  • Compression ratios vary by column types, index width, and data
  • Data is compressed both on disk and in memory
  • CPU goes both ways, and it needs to be tested
    • Uses CPU to compress on writes and index maintenance
    • Uses CPU to decompress when used by a query
    • Saves CPU with fewer physical reads
    • Saves CPU with fewer logical reads
    • And more…

Abstract Thought

This post is at a level of abstraction that doesn’t get into what happens in the background.  My goal is to encourage you to test it out, understand why it helps, and be able to explain that in your change control process.

For those of you who aren’t satisfied with my “Gas pedal make car go fast” explanation, Jes Borland (b|bob|t) wrote A Look Inside SQL Server Row and Page Compression, and Brad McGehee (b|t) wrote A Quick Introduction to Data Compression in SQL Server 2008.

You can even dive into more details such as using different levels of compression on each partition of an index, or even talking to Joey D’Antoni (b|t) about the archival levels of compression on columnstore indexes.

There’s a lot of detail on how compression can cost CPU, but the details that save CPU are typically only mentioned in passing without doing a deep dive into the topic.  Data Compression: Strategy, Capacity Planning and Best Practices mentions that less Logical I/O is less to consume CPU.  SQL Server Database Compression is indirectly mentioning having a smaller B+Tree structure.

The purpose of this post isn’t to earn anyone a doctorate (or claim that I’m at that level), it’s more of a practitioner level.

What’s it do?

Each page is the same 8kb size but contains more data per page, as per Captain Obvious.  This means less space on disk and backups.  Those are nice, but I don’t care too much about that.

Then you read the data into memory so queries can use it.  This is a physical I/O to disk that goes through the CPU (using extra CPU to decrypt it if you use TDE) to make it into memory.  It stays compressed when in memory, so all of your indexes (not just this one) have more room to hang around and avoid more physical I/Os and the costs I just mentioned.

Finally, a query needs to use the data, and that has positives (+) and negatives (-).  The data is more likely to be in cache (+) because it’s smaller and a page with more data is more likely to be referenced. It’s easier to get into cache if it wasn’t there already (+). Then it’s easier to get to the data because the smaller data may have fewer levels in the B+Tree (+). Along the way it has to decompress the root and intermediate level pages (-) which are always row-level compressed when you use any level of compression then decompress the leaf-level pages (-) which are compressed at the level you picked.  However, there are fewer pages, which results in less Logical I/O (+).

You’re not going to accurately figure out the positives and negatives of that last paragraph.  The important part is that you know there are positives AND negatives, which means you put away the calculus and just run some tests.

My experience is that if the data is compressed by 25% or more than it helps more than it hurts.  Find how much you’ll save by running sp_estimate_data_compression_savings for both row-level and page-level compression.  If you don’t get much extra compression with page-level then don’t even test it, it’s an added expense that needs to be justified.

What Compresses Well?

The hard way is to analyze each column, its data type, the data in that column, the width of the index, etc..  You can read the links in the Abstract Thought section to see what Brad and Jes have to say about it if you want.  This will be very important if you’re designing tables and keeping how compressible the data is in mind, but less so if you’re compressing already existing indexes.

The easy way (my personal favorite) is to just run sp_estimate_data_compression_savings I just mentioned and actually compress the indexes on a non-prod server.  Computers are good at math, let them do it.

How to Test

I’m not diving deep into testing here, but there are three things that stand out.

  • How much memory are you saving?
  • How do your queries perform?
  • How much faster is the data pulled from disk?

For how much memory you would save, look at my Cleaning Up the Buffer Pool post to see how much memory that index is using.  Since you’re only changing how much space the data takes and not the columns of the indexes here, you can just multiply that by the new compression ratio.  Use the actual ratio comparing the index size in prod to where you’re testing in non-prod to make sure it’s accurate.  Yes, if you have a 10 GB index which tends to be 100% in cache that you just compressed 80%, it will be like you added 8 GB of memory in many ways.

I do query performance and how much faster the data is pulled from disk together, because that’s how it’s done in the real world.  Pick your queries that hit that index, typically by looking in the plan cache or an XEvent session.  Then, on a non-prod server, run the queries both with and without DBCC DROPCLEANBUFFERS, again, on a non-prod server.

You can remove compression on any index, down to the partition level, by doing ALTER INDEX … REBUILD WITH (DATA_COMPRESSION = NONE).  Adding compression is the same statement with ROW or PAGE instead of NONE.

Sum It Up

Do this all in non-prod.

  1. See what compresses well
  2. Test it
  3. Test it again

The End

Let’s hear from you.  If you needed more data to make an informed choice, throw it in the comments where others can benefit from your experience, and I may even edit the post to add it in.  Also, if this is making a big difference for a lot of people, I’ll do what I can to tweak the SEO and help more people find this through search engines.

The best compliment is a question.  It means you value my expertise enough to want my thoughts and opinions.

Querying the Plan Cache

I love working with indexes, and I need to know what’s using them to work on them intelligently. Most of that information is already there waiting for you to query it. Luckily, Jonathan Kehayias (b|t) did the hard work for us in his post Finding what queries in the plan cache use a specific index, and I could modify his code to get entire tables.

Remember that you’re querying XML and that’s a CPU intensive process. However, you’re also looking for what’s in cache which is most relevant during or just after your busiest time of day on prod. The longer you wait, the more chance a query will be flushed from cache for one reason or another, although waiting a couple hours typically isn’t a problem on a server that’s not under extreme stress.

This first script will find all references to the indexes of a table in the plan cache, including key and RID lookups. However, table scans against heaps are in the XML a little different, and that’s what my second query is for. Hopefully you have a clustered index on every table you want to run this against, but you’ll probably need both of these.

If you’re only looking for a single index in the cache, I already have that query on my Cleaning up the buffer pool post. However, I’m working on some new stuff that’s way too long to put scripts in-line, so I had to create this post to add as a reference. I’ll try to remember to update this post as I have everything done, but keep an eye out for my “Indexing Strategy” post and my presentation by the same name I hope to debut at the Cleveland SQL Saturday in February 2016.

Table Index Usage

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
DECLARE @TableName SYSNAME = '[ShiverMeTuples]'; 
DECLARE @DatabaseName SYSNAME;
 
SELECT @DatabaseName = '[' + DB_NAME() + ']';
 
WITH XMLNAMESPACES
   (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
    n.value('(@StatementText)[1]', 'VARCHAR(4000)') AS sql_text,
    --n.query('.'),
    cp.plan_handle,
    i.value('(@PhysicalOp)[1]', 'VARCHAR(128)') AS PhysicalOp,
    i.value('(@EstimateRows)[1]', 'VARCHAR(128)') AS EstimateRows,
    i.value('(@EstimateIO)[1]', 'VARCHAR(128)') AS EstimateIO,
    i.value('(@EstimateCPU)[1]', 'VARCHAR(128)') AS EstimateCPU,
	cp.usecounts,
    i.value('(./IndexScan/@Lookup)[1]', 'VARCHAR(128)') AS IsLookup,
    i.value('(./IndexScan/Object/@Database)[1]', 'VARCHAR(128)') AS DatabaseName,
    i.value('(./IndexScan/Object/@Schema)[1]', 'VARCHAR(128)') AS SchemaName,
    i.value('(./IndexScan/Object/@Table)[1]', 'VARCHAR(128)') AS TableName,
    i.value('(./IndexScan/Object/@Index)[1]', 'VARCHAR(128)') as IndexName,
    --i.query('.'),
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./OutputList/ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS output_columns,
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./IndexScan/SeekPredicates/SeekPredicateNew//ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS seek_columns,
    RIGHT(i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'), len(i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)')) - charindex('.', i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'))) as Predicate,
    query_plan
FROM (  SELECT plan_handle, query_plan
        FROM (  SELECT DISTINCT plan_handle
                FROM sys.dm_exec_query_stats WITH(NOLOCK)) AS qs
        OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp
      ) as tab (plan_handle, query_plan)
INNER JOIN sys.dm_exec_cached_plans AS cp 
    ON tab.plan_handle = cp.plan_handle
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/*') AS q(n)
CROSS APPLY n.nodes('.//RelOp[IndexScan/Object[@Table=sql:variable("@TableName") and @Database=sql:variable("@DatabaseName")]]' ) as s(i)
--WHERE i.value('(./IndexScan/@Lookup)[1]', 'VARCHAR(128)') = 1
ORDER BY 3 
OPTION(RECOMPILE, MAXDOP 4);

Table (Heap) Scans

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
DECLARE @TableName SYSNAME = '[ItsAHeapOfSomething]'; 
DECLARE @DatabaseName SYSNAME;
 
SELECT @DatabaseName = '[' + DB_NAME() + ']';
 
WITH XMLNAMESPACES
   (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
    n.value('(@StatementText)[1]', 'VARCHAR(4000)') AS sql_text,
    n.query('.'),
    cp.plan_handle,
    i.value('(@PhysicalOp)[1]', 'VARCHAR(128)') AS PhysicalOp,
    i.value('(./TableScan/@Lookup)[1]', 'VARCHAR(128)') AS IsLookup,
    i.value('(./TableScan/Object/@Database)[1]', 'VARCHAR(128)') AS DatabaseName,
    i.value('(./TableScan/Object/@Schema)[1]', 'VARCHAR(128)') AS SchemaName,
    i.value('(./TableScan/Object/@Table)[1]', 'VARCHAR(128)') AS TableName,
    --i.value('(./TableScan/Object/@Table)[1]', 'VARCHAR(128)') as TableName,
    i.query('.'),
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./OutputList/ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS output_columns,
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./TableScan/SeekPredicates/SeekPredicateNew//ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS seek_columns,
    RIGHT(i.value('(./TableScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'), len(i.value('(./TableScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)')) - charIndex('.', i.value('(./TableScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'))) as Predicate,
	cp.usecounts,
    query_plan
FROM (  SELECT plan_handle, query_plan
        FROM (  SELECT DISTINCT plan_handle
                FROM sys.dm_exec_query_stats WITH(NOLOCK)) AS qs
        OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp
      ) as tab (plan_handle, query_plan)
INNER JOIN sys.dm_exec_cached_plans AS cp 
    ON tab.plan_handle = cp.plan_handle
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/*') AS q(n)
CROSS APPLY n.nodes('.//RelOp[TableScan/Object[@Table=sql:variable("@TableName") and @Database=sql:variable("@DatabaseName")]]' ) as s(i)
--WHERE i.value('(./TableScan/@Lookup)[1]', 'VARCHAR(128)') = 1
OPTION(RECOMPILE, MAXDOP 4);

SQL Server Error Log Reader

Reading the SQL Server Error Log is miserable.  It contains very useful information you should address as soon as possible, or at least know that it’s happening.  However, it’s hidden between so many informational messages that it’s hard to find, then it’s spread out between multiple files for every server reboot or automated file rollover event you may have set up.

Many DBAs skim these files, but when there’s a single login failure mixed into log backups running every 5 minutes for 100 databases then they’re just happy to have found something.  That login failure tells you nothing, just that someone should have been more careful typing in their password, right?  When you’re just happy you were even able to find something then you’re almost certainly not going to see it clearly enough to notice a trend, such as that login failure happens every Sunday between 10:00 PM and 10:15 PM.  However, if you knew that then you could tell someone that there’s an automated job that’s failing, it’s obviously part of a bigger process because the time varies a little, but it’s consistent enough to say it’s definitely a process.

So, the trick is to get past the junk and to the useful information.  You can listen to Warner Chaves (b|t) in his Most Important Trace Flags post and turn on trace flag 3226 to stop backup information from going to the logs, but I’m always paranoid (it’s part of the job) that it just may come in useful some day.  I know it never has, but I leave it in there anyways.

Even if you do take out information from the logs like that, it’s still a flat file that’s difficult to analyze for any number of reasons.  We’re just a T-SQL kind of group, and flat files just fall flat.

As with everything in SQL Server, I cheat my way through it.  This time I pull it into a temp table, delete the stuff I’m ignoring (please be very, very careful about what you ignore because you’ll never know it happened), then look at the results.  If there’s a login failure then I’ll uncomment the section that deletes everything except a single error and trends will pop out at me.  If I wanted to do more advanced analysis I would run queries doing aggregates of any kind against the temp table that was created.  Everything’s in the format you’re used to analyzing, so you can do crazy things without going crazy.

DECLARE @dStart DateTime 
	, @dEnd DateTime
	, @MaxLogFiles Int 

SELECT @dStart = GetDate()-30
	, @dEnd = GetDate()-0
	, @MaxLogFiles = 5

--Pulls into #TempLog because an empty log file causes errors in the temp table
--If there are no records, don't pass the issues onto your #Log table and return the results

IF OBJECT_ID('tempdb..#Log') IS NOT NULL BEGIN
	DROP TABLE #Log
END

IF OBJECT_ID('tempdb..#TempLog') IS NOT NULL BEGIN
	DROP TABLE #TempLog
END

CREATE TABLE #Log
(
	LogDate DateTime
	, ProcessInfo NVarChar(50)
	, LogText NVarChar(1000)
)

CREATE TABLE #TempLog
(
	LogDate DateTime
	, ProcessInfo NVarChar(50)
	, LogText NVarChar(1000)
)

DECLARE @Num int
SELECT @Num = 0

WHILE @Num < @MaxLogFiles BEGIN
	TRUNCATE TABLE #TempLog 

	INSERT INTO #TempLog 
	exec xp_readerrorlog @Num, 1, null, null, @dStart, @dEnd

	IF @@ROWCOUNT > 0 BEGIN
		INSERT INTO #Log 
		SELECT *
		FROM #TempLog
	END ELSE BEGIN
		SELECT @Num = @MaxLogFiles
	END
		
	SELECT @Num = @Num + 1
END 

/*
--Uncomment to trend out a specific message and ignore the rest
DELETE #Log
WHERE LogText NOT LIKE 'Login failed for user ''WhatAreYouDoingToMe?!?!?''%'
--*/

--Ignore most of the log file rollover process
--Keep "Attempting to cycle" and "The error log has been reinitialized" if you want to confirm it happened and succeeded
DELETE #Log 
WHERE LogText LIKE '%(c) Microsoft Corporation%'
	OR LogText LIKE 'Logging SQL Server messages in file %'
	OR LogText LIKE 'Authentication mode is MIXED.'
	OR LogText LIKE 'System Manufacturer: %'
	OR LogText LIKE 'Server process ID %'
	OR LogText LIKE 'All rights reserved.'
	OR LogText LIKE 'Default collation: %'
	OR LogText LIKE 'The service account is %'
	OR LogText LIKE 'UTC adjustment: %'
	OR LogText LIKE '(c) 2005 Microsoft Corporation.'--Should I be ignoring this or fixing it?
	OR LogText LIKE 'Microsoft SQL Server % on Windows NT %'
	OR LogText LIKE 'The error log has been reinitialized. See the previous log for older entries.'
	OR LogText LIKE 'Attempting to cycle error log.%'

--Ignore databases being backed up and integrity checks running, assuming you verify this some other way.
--I don't want to complain to try to have these removed because I may need that info someday; today isn't that day.
DELETE #Log
WHERE LogText LIKE 'Log was backed up%'
	OR LogText LIKE 'Database differential changes were backed up%'
	OR LogText LIKE 'Backup database with differential successfully %'
	OR LogText LIKE 'Backup database successfully %'
	OR LogText LIKE 'Database backed up%'
	OR LogText LIKE 'DBCC CHECK% found 0 errors %'
	OR LogText LIKE 'CHECKDB for database % finished without errors %'

--We all have vendor databases...
--Ignore the stuff where it keeps making sure the setting is where the setting was.
DELETE #Log
WHERE LogText LIKE 'Configuration option % changed from 30 to 30.%'
	OR LogText LIKE 'Configuration option % changed from 5 to 5.%'
	OR LogText LIKE 'Setting database option COMPATIBILITY_LEVEL to 100 for database ReportServer%'
	OR LogText LIKE 'Configuration option ''user options'' changed from 0 to 0. Run the RECONFIGURE statement to install.'

--Now your own custom ones
--Just be careful.  You'll rarely read logs without this script once you see how easy it is.
--If you put it on the ignore list, you won't see it again.
--I have starting and stopping traces on mine, because my monitoring software likes to start and stop them a lot
----I'm accepting the risk that I won't see other people starting and stopping traces.
DELETE #Log
WHERE LogText LIKE 'Know what risk you''re taking on by putting stuff in here'
	OR LogText LIKE 'You will rarely read logs without this, so you won''t see these ever again'
	OR LogText LIKE 'DBCC TRACEON 3004,%'
	OR LogText LIKE 'DBCC TRACEON 3014,%'
	OR LogText LIKE 'DBCC TRACEON 3604,%'
	OR LogText LIKE 'DBCC TRACEOFF 3604,%'
	OR LogText LIKE 'DBCC TRACEON 3605,%'
	OR LogText LIKE 'Error: %, Severity:%'--They give the english version next
	OR LogText LIKE 'SQL Trace ID % was started by %'
	OR LogText LIKE 'SQL Trace stopped.%'
	OR LogText LIKE 'Changing the status to % for full-text catalog %'
	OR LogText LIKE 'I/O was resumed on database %'
	OR LogText LIKE 'I/O is frozen on database %' 

/*
--When mirroring gives me trouble it lets me know by flooding the logs
--I uncomment this to see if there were other issues in the middle of all that.
DELETE #Log 
WHERE LogText LIKE 'Database mirroring is inactive for database%'
	OR LogText LIKE 'The mirroring connection to%has timed out%'
	OR LogText LIKE 'Database mirroring is active with database%'
--*/

/*
--This is only useful if you're using the trace flag 1222
--Only show the line that says 'deadlock-list'.  Remove this if you need to see the deadlock details.
--Note, only use this when needed.  It will give you a 1 second blind spot for every deadlock found.
--Why aren't you using extended events anyways?
DELETE L
FROM #Log L
	INNER JOIN #Log L2 ON L.LogDate BETWEEN L2.LogDate AND DateAdd(second, 1, L2.LogDate) AND L.ProcessInfo = L2.ProcessInfo 
WHERE L2.LogText = 'deadlock-list'
	AND L.LogText <> 'deadlock-list'
--*/

SELECT * 
FROM #Log
ORDER BY LogDate DESC

Don’t Ignore Me

Anything you ignore you won’t see here again. It’s still in the logs, but not in what you’re reading on your screen when you mentally check the logs off as being read through.  If you’re ignoring anything make sure it either doesn’t matter or you’re watching for it another way.

Backups are the first thing to be ignored.  Yes, yes, they ran successfully, they do that a lot, don’t tell me about them.  That can be good advice gone horribly wrong.  Do you have another way of saying I absolutely know I have backups taken of everything?

DBCC CheckDB ran successfully is next on the list.  Same thing goes for it, except more DBAs miss verifying that this is running and also miss running it.  If you ignore it, how are you verifying that it ran?

I don’t care how you do it.  Do what works best for you, just do something.

Be Careful

I’ll just end by saying be careful again. This code is a life saver when it’s not shooting you in the foot.

Give More Feedback

I write on my blog and get a couple comments at best. I talk at conferences and a large part of the audience fills out evals at the time.  Then I often wonder if I’m making a difference while rarely, if ever, knowing if anyone actually used what they learned at work.  There are some confidence issues at hand here, which I was convinced were just limited to me.  However, talking to a couple others in the SQL community, it’s not just me.

In fact, I told one of the leading respondents to SQL Server Central forums questions that he really made a difference in my career with the quantity and quality of the knowledge he shares.  His response was a lot like I’d imagine mine to be.  A humbleness and sincere thank you with a follow up that it’s very difficult to know if the work you put into the community really makes a difference.  That’s saying something considering that he’s in a more personal position than me with his online presence because he’s often answering user questions while I offer unfocused unsolicited advice.

The blog posts, articles, and forum help you see online is all done on a volunteer basis.  Sure, some people get paid $25 for an article, which is a lower hourly rate than an entry-level DBA if you think about how much work they put into them.  Some people write blog posts to promote their business or help their careers while knowing without a doubt that well over 99% of people who read what they have to say will never hire them.  Yes, there are ways you can say it’s not on a volunteer basis, but if any of us were really in it for the money then almost all of us would opt for something more lucrative such as setting up a lemonade stand on the corner with our kids.

That wasn’t even getting into in-person events like SQL Saturdays where volunteers put everything together then ask for volunteers to come speak, many of whom pay their own travel expenses.  Full disclosure, it’s not completely unpaid, we get invited to a free dinner the night before and typically get a free shirt.  Both of these are amazing benefits because I’m eating dinner with people I look up to then I have a shirt that I wear to work every time I need a good confidence boost, so I can’t say it’s really free.  By-the-way, every SQL Saturday is numbered, and I they all have an email address of SQLSaturday###@SQLSaturday.com.  Help keep the motivation going and force me to update that masked email address to have four digits!

So, you may notice that I write at length about not getting much feedback on my only post that does not allow comments.  I even went out of my way and deleted the name of the guy whose answers I liked so much on SQL Server Central.  Why?!?!  It’s not as counter-productive as it seems.  I know this post didn’t make your career better, I know it’s not likely to change your life, and this isn’t a plea for you to comment on THIS post or for people who inspired ME.

This post is a challenge to you.  Think back to blog posts that helped make you awesome at work.  Think about conferences that were put together so well that all you noticed were the great learning opportunities.  Then go out and comment as publically as possible how it helped, adding as many details as you can.  After that, keep this in mind when you’re learning in the future, be it free or paid events, reading blog posts, or using #sqlhelp on twitter.  Say it all, constructive criticism, compliments, offers to buy someone a beer, don’t hold back!  If you want more of something, speak up.  We have an amazing community, and I want more of it!

By the way, the guy from SQL Server Central likes plain old Budweiser.

Extended Events Intro

Extended Events is supposed to be taking over for most of Profiler and server-side tracing functionality, but there were people like me out there that took their time making the switch. For me the excuse was SQL Batch Completed isn’t in 2008 / 2008 R2, most databases are slow to move off of 2008 because it’s hard to say 2008 can’t do everything you want, and, well, I really like SQL Batch Completed!!!

Now I’m losing my excuses. It’s been three years since 2012 came out and all the new servers I’m installing are 2012 or 2014, so I lost that excuse. Then I came up with “I don’t know how to use them, but I know how to use something that does almost the same thing”; that’s great logic during an outage, but terrible logic for the long term. So, I’m learning it, still learning it, and I have something to help out those who made it to 2012 but are still stuck on the “I don’t know how to use them” phase of denial.

In several of my blog posts I referenced my favorite server-side trace. Capture everything on the server that takes over X seconds, with 5 being my favorite starting place. Yes, this adds about 1ms to anything that takes over 5 seconds, which adds up to…insignificant. Seriously, if this kills your server then it was dead already, but you will see arguments against creating this for long-term runs.  I don’t agree with those arguments, but they’re your servers and I want you to make an informed decision on what you run.

Anyways, here’s how I set it up with Extended Events, for which I used Jonathan Kehayias’s (b|t) Trace to XE Converter to get started.

IF EXISTS (SELECT 1 FROM sys.server_event_sessions WHERE name = 'DurationOver5Sec')
	DROP EVENT SESSION [DurationOver5Sec] ON SERVER;
GO
CREATE EVENT SESSION [DurationOver5Sec]
ON SERVER
ADD EVENT sqlserver.rpc_completed(
	ACTION 
	(
		  sqlserver.client_app_name	-- ApplicationName from SQLTrace
		, sqlserver.client_hostname	-- HostName from SQLTrace
		, sqlserver.client_pid	-- ClientProcessID from SQLTrace
		, sqlserver.database_id	-- DatabaseID from SQLTrace
		, sqlserver.request_id	-- RequestID from SQLTrace
		, sqlserver.server_principal_name	-- LoginName from SQLTrace
		, sqlserver.session_id	-- SPID from SQLTrace
	)
	WHERE 
	(
			duration >= 5000000
	)
),
ADD EVENT sqlserver.sql_batch_completed(
	ACTION 
	(
		  sqlserver.client_app_name	-- ApplicationName from SQLTrace
		, sqlserver.client_hostname	-- HostName from SQLTrace
		, sqlserver.client_pid	-- ClientProcessID from SQLTrace
		, sqlserver.database_id	-- DatabaseID from SQLTrace
		, sqlserver.request_id	-- RequestID from SQLTrace
		, sqlserver.server_principal_name	-- LoginName from SQLTrace
		, sqlserver.session_id	-- SPID from SQLTrace
	)
	WHERE 
	(
		duration >= 5000000
	)
),
ADD EVENT sqlos.wait_info(
	ACTION 
	(
		  sqlserver.client_app_name	-- ApplicationName from SQLTrace
		, sqlserver.client_hostname	-- HostName from SQLTrace
		, sqlserver.client_pid	-- ClientProcessID from SQLTrace
		, sqlserver.database_id	-- DatabaseID from SQLTrace
		, sqlserver.request_id	-- RequestID from SQLTrace
		, sqlserver.server_principal_name	-- LoginName from SQLTrace
		, sqlserver.session_id	-- SPID from SQLTrace
	)
	WHERE
	(
		duration > 5000 --This one is in milliseconds, and I'm not happy about that
            AND ((wait_type > 0 AND wait_type < 22) -- LCK_ waits
                    OR (wait_type > 31 AND wait_type < 38) -- LATCH_ waits
                    OR (wait_type > 47 AND wait_type < 54) -- PAGELATCH_ waits
                    OR (wait_type > 63 AND wait_type < 70) -- PAGEIOLATCH_ waits
                    OR (wait_type > 96 AND wait_type < 100) -- IO (Disk/Network) waits
                    OR (wait_type = 107) -- RESOURCE_SEMAPHORE waits
                    OR (wait_type = 113) -- SOS_WORKER waits
                    OR (wait_type = 120) -- SOS_SCHEDULER_YIELD waits
                    OR (wait_type = 178) -- WRITELOG waits
                    OR (wait_type > 174 AND wait_type < 177) -- FCB_REPLICA_ waits
                    OR (wait_type = 186) -- CMEMTHREAD waits
                    OR (wait_type = 187) -- CXPACKET waits
                    OR (wait_type = 207) -- TRACEWRITE waits
                    OR (wait_type = 269) -- RESOURCE_SEMAPHORE_MUTEX waits
                    OR (wait_type = 283) -- RESOURCE_SEMAPHORE_QUERY_COMPILE waits
                    OR (wait_type = 284) -- RESOURCE_SEMAPHORE_SMALL_QUERY waits
	--OR (wait_type = 195) -- WAITFOR
                )
	)
)
ADD TARGET package0.event_file
(
	SET filename = 'DurationOver5Sec.xel',
		max_file_size = 10,
		max_rollover_files = 5
)
WITH 
(
	MAX_MEMORY = 10MB
	, MAX_EVENT_SIZE = 10MB
	, STARTUP_STATE = ON
	, MAX_DISPATCH_LATENCY = 5 SECONDS
	, EVENT_RETENTION_MODE = ALLOW_MULTIPLE_EVENT_LOSS
);

ALTER EVENT SESSION DurationOver5Sec
ON SERVER
STATE = START;
	

What’s it all mean?

This captures all SQL Batch Completed and RPC Completed events that took over 5 seconds along with any waits that took over 5 seconds. Seems kind simple and much easier to read than the script to create a trace, but there are some things to point out here.

  • Duration – Milliseconds or Microseconds
  • File Location
  • Restart on server reboot

First, duration, is it milliseconds or microseconds? Trick question, it’s both!!! Ola Hallengren (b|t) opened This Connect Item which resulted in Microsoft apologizing and saying they’ll fix it in SQL 2016. They can “fix” it one of two ways. Either they’ll document them as being different scales and your script will still work without any modifications, or they’ll make them all be the same scale and the script I have here will behave differently on different versions of SQL Server. Anyways, the script I have here is 5 seconds all around, which is 5,000,000 microseconds for the SQL and 5,000 milliseconds for the waits.

Continuing on the duration confusion, the wait info’s duration still isn’t what you might think it is.  This is the duration for a single instance of a wait, not cumulative waits for the duration of a specific query or batch.  If you made a loop that executed 10 times calling a wait for delay of 1 second each time then those are 10 separate 1 second waits.  More important, if your query is doing physical reads from disk then it’s a ton of small PageIOLatch waits, not one large one this session will see.  I understand why it has to be this way, but understanding and enjoying are two different things.

The rest of this isn’t nearly as confusing, I promise.

The file location I’m using is just a filename without a path, which will default to where your error log files are. It’s going to be a much more complex name once SQL Server is done with it, and not just a underscore with a rollover count like server-side traces.  However, it will start how it’s written here and you can use wildcards to say which files you want to read.

Now when you restarted the server, or just SQL services, with server-side traces you’d have to manually start that trace again if you wanted it to run all the time. This is exactly how the script I have here works, too. However, Extended Events also added “Startup State” which means it will start when the server starts. The only issue I have with this is that it will ONLY automatically start when the services start. That’s all fine and good if you’re a patient person….it’s not all fine and good for me, so I manually start it when it’s created.

The other minor details I can see having questions on are on the Max Dispatch Latency and Event Retention Mode. These are limits on how it writes to the file location I’m using. Max Dispatch means that SQL Server will write it to the output file within that many seconds after the event actually happens, so it’s obviously an asynchronous action. Event Retention Mode can give SQL Server permission to say it’s too overworked to write the entries and skip them if needed, and I chose to tell SQL Server to drop as many as it wants to if I’d be slowing it down.  It’s all documented in BOL, and it’d be good for you to read through that instead of just listening to me picking out select details.

So, how do you read this? It’s going to put it all in XML, isn’t it!?!?

If you can read this, Kendra Little (b|t) may suggest you’re a demon robot! (While her comment makes me laugh, that post is a great XE reference or it wouldn’t be on here.) Well, she’s probably right in calling this next script names because friendly robots don’t treat your CPU like this. In fact, if you’re on a stressed server or want to do a lot of analysis on this, it’s going to be best to copy the .xel files to a server you care a little less about and shred the XML there.

Unfortunately, I wrote my script to automatically find the files and add a wildcard character, so you’d have to modify my script to run it on another server. The point is getting you started with as little frustration as possible, and this is perfect for that reason. The biggest catch is that the session has to be running for this to work without changes because it’s grabbing the full file name with path from the DMVs for running sessions.

Anyways, here goes….

DECLARE 
	@SessionName SysName 
	, @TopCount Int = 1000

--SELECT @SessionName = 'UserErrors'
SELECT @SessionName = 'DurationOver5Sec'
--SELECT @SessionName = 'system_health'
/* 
SELECT * FROM sys.traces

SELECT  Session_Name = s.name, s.blocked_event_fire_time, s.dropped_buffer_count, s.dropped_event_count, s.pending_buffers
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE target_name = 'event_file'
--*/

SET STATISTICS IO, TIME ON

IF OBJECT_ID('tempdb..#Events') IS NOT NULL BEGIN
	DROP TABLE #Events
END

IF OBJECT_ID('tempdb..#Queries') IS NOT NULL BEGIN
	DROP TABLE #Queries 
END

DECLARE @Target_File NVarChar(1000)
	, @Target_Dir NVarChar(1000)
	, @Target_File_WildCard NVarChar(1000)

SELECT @Target_File = CAST(t.target_data as XML).value('EventFileTarget[1]/File[1]/@name', 'NVARCHAR(256)')
FROM sys.dm_xe_session_targets t
	INNER JOIN sys.dm_xe_sessions s ON s.address = t.event_session_address
WHERE s.name = @SessionName
	AND t.target_name = 'event_file'

SELECT @Target_Dir = LEFT(@Target_File, Len(@Target_File) - CHARINDEX('\', REVERSE(@Target_File))) 

SELECT @Target_File_WildCard = @Target_Dir + '\'  + @SessionName + '_*.xel'

--SELECT @Target_File_WildCard

SELECT TOP (@TopCount) CAST(event_data AS XML) AS event_data_XML
INTO #Events
FROM sys.fn_xe_file_target_read_file(@Target_File_WildCard, null, null, null) AS F
ORDER BY File_name DESC
	, file_offset DESC 

SELECT  EventType = event_data_XML.value('(event/@name)[1]', 'varchar(50)')
	, Duration_sec = CAST(event_data_XML.value ('(/event/data[@name=''duration'']/value)[1]', 'BIGINT')/CASE WHEN event_data_XML.value('(event/@name)[1]', 'varchar(50)') LIKE 'wait%' THEN 1000.0 ELSE 1000000.0 END as DEC(20,3)) 
	, CPU_sec = CAST(event_data_XML.value ('(/event/data[@name=''cpu_time'']/value)[1]', 'BIGINT')/1000000.0 as DEC(20,3))
	, physical_reads_k = CAST(event_data_XML.value ('(/event/data  [@name=''physical_reads'']/value)[1]', 'BIGINT')/1000.0 as DEC(20,3))
	, logical_reads_k = CAST(event_data_XML.value ('(/event/data  [@name=''logical_reads'']/value)[1]', 'BIGINT') /1000.0 as DEC(20,3))
	, writes_k = CAST(event_data_XML.value ('(/event/data  [@name=''writes'']/value)[1]', 'BIGINT')/1000.0 as DEC(20,3))
	, row_count = event_data_XML.value ('(/event/data  [@name=''row_count'']/value)[1]', 'BIGINT')
	, Statement_Text = ISNULL(event_data_XML.value ('(/event/data  [@name=''statement'']/value)[1]', 'NVARCHAR(4000)'), event_data_XML.value ('(/event/data  [@name=''batch_text''     ]/value)[1]', 'NVARCHAR(4000)')) 
	, TimeStamp = DateAdd(Hour, DateDiff(Hour, GetUTCDate(), GetDate()) , CAST(event_data_XML.value('(event/@timestamp)[1]', 'varchar(50)') as DateTime2))
	, SPID = event_data_XML.value ('(/event/action  [@name=''session_id'']/value)[1]', 'BIGINT')
	, Username = event_data_XML.value ('(/event/action  [@name=''server_principal_name'']/value)[1]', 'NVARCHAR(256)')
	, Database_Name = DB_Name(event_data_XML.value ('(/event/action  [@name=''database_id'']/value)[1]', 'BIGINT'))
	, client_app_name = event_data_XML.value ('(/event/action  [@name=''client_app_name'']/value)[1]', 'NVARCHAR(256)')
	, client_hostname = event_data_XML.value ('(/event/action  [@name=''client_hostname'']/value)[1]', 'NVARCHAR(256)')
	, result = ISNULL(event_data_XML.value('(/event/data  [@name=''result'']/text)[1]', 'NVARCHAR(256)'),event_data_XML.value('(/event/data  [@name=''message'']/value)[1]', 'NVARCHAR(256)'))
	, Error = event_data_XML.value ('(/event/data  [@name=''error_number'']/value)[1]', 'BIGINT')
	, Severity = event_data_XML.value ('(/event/data  [@name=''severity'']/value)[1]', 'BIGINT')
	, EventDetails = event_data_XML 
INTO #Queries
FROM #Events

SELECT q.EventType
	, q.Duration_sec
	, q.CPU_sec
	, q.physical_reads_k
	, q.logical_reads_k
	, q.writes_k
	, q.row_count
	, q.Statement_Text
	, q.TimeStamp
	, q.SPID
	, q.Username
	, q.Database_Name
	, client_app_name = CASE LEFT(q.client_app_name, 29)
					WHEN 'SQLAgent - TSQL JobStep (Job '
						THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(q.client_app_name,32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(q.client_app_name, 67, len(q.client_app_name)-67)
					ELSE q.client_app_name
					END  
	, q.client_hostname
	, q.result
	, q.Error
	, q.Severity
	, q.EventDetails
FROM #Queries q
--WHERE eventtype NOT IN /*rather typical filtering*/ ('security_error_ring_buffer_recorded', 'sp_server_diagnostics_component_result', 'scheduler_monitor_system_health_ring_buffer_record')
	--AND eventtype NOT IN /*specific troubleshooting filtering*/ ('connectivity_ring_buffer_recorded', 'wait_info')
ORDER BY TimeStamp DESC 

The top is all I typically edit. The Session Name is ‘DurationOver5Sec’ for this purpose, but as you can see it can also read some of the information from the system_health session. The TopCount is just saying I’m only going to look at the most recent 1,000 records BEFORE I shred and filter it, mostly because I love my CPU and this changes how much XML I’m throwing at it.

No matter how I write this to filter before playing with XML or try to be 100% safe, my scripts will not be perfect for you.  In fact they may crash your server. Unfortunately, I’m only half kidding, because nothing is 100%. The space is limited to 50 MB, the overhead of writing to this file should be fine since we’re only working with events that took over 5 seconds asynchronously with data loss allowed, so I don’t foresee any issues in most cases unless you change these numbers. However, reading them gets expensive and you have to use the most caution here.  Keep in mind that you can put the .xel files on any server with SQL installed, even developer edition, and query it.

Why’d I write this and what else is going through my head?

The inspiration for this post actually came from a question a former coworker asked me quite a while ago, just seeing if I knew a lot about Extended Events.  At the time the technically correct answer was “I don’t know”, but instead I told him that although I didn’t have experience with Extended Events that he could look at Jonathan Kehayias’s (b|t) blog, specifically his XEvent A Day Series.  However, I also warned him that Jonathan’s work tends to go deeper than a beginner is prepared for and that he should start with his coworker, Erin Stellato’s (b|t), version that is easier to start with, specifically her Making the Leap From Profilers to Extended Events session she made available on YouTube. This actually inspired two blog posts, I just haven’t written the How to Say “I don’t know” Better post yet.

More Links

It is so wrong that I originally posted this without linking to Erin Stellato’s Extended Events Stairway Series on SQL Server Central.  There, problem is fixed.

Have fun, experiment (with queries), and ask questions.

Technical Interviews – How to Thrive

There’s a trick to technical interviews. Every question is looking for integrity first, and intelligence and energy second. This is paraphrasing Warren Buffet, who became the most successful investor out there by becoming the best at interviewing the management of the companies he was investing in.

The Warren Buffet Way

“In evaluating people you look for three qualities: integrity, intelligence, and energy. And if you don’t have the first, the other two will kill you.” -Warren Buffet pp. 172-173

To understand what answers are right you need to understand the motives of the interviewer. Why are they asking these questions, what are they looking for? Yes, you can cram for an interview and hope you get more questions right, but that isn’t what’s going to make or break you. It may help you represent yourself as more of a subject-matter expert than you are, which can be both good and very, very bad. What they’re looking for is someone who can be a good employee a lot more than someone who can currently pass a written exam.

In the end, if you don’t know how to answer a question I know you can go look it up. However, what integrity do you have when you don’t know the answer? This is where all those years you spent in school are going to fail you, miserably. You’ve been taught from a young age that you need to have an answer for every question on the test, even if it’s because you’ll get 25% of them right on multiple choice or 10% partial credit on something else. Interviewers, like your teachers, know the answers they’re looking for before the question was asked. However, interviewers are different in that they give negative credit instead of partial credit.

So, what is this negative credit and why are they giving it to you? Lets look at it from the context of being on the job. I’m a Database Administrator, and I have developers ask me questions from time to time. Many times I know the answers and can help them out off the top of my head, we’re not going to worry about those times. The problems come from when I don’t know or I’m unsure of the answer. Do I have the integrity to tell that developer I don’t know? If I don’t, will they build the world upon inaccurate information and end up with a world that needs to be torn down and rebuilt? One of these answers cost a little bit of time to look it up. I’m not going to try to put an estimate on the cost of the other answer.

Interviewers don’t like it when you know all the answers because of this, and a good interviewer is going to have a question or two they can bring up that no one would know the answer to off the top of their head for this very reason. Yes, they do like it when you know a lot of answers, but more importantly they want to know what type of integrity and mindset you have when you don’t know. The right answer if you’re uncertain is to say you would need to verify it, but here’s my answer and here’s how I’d check it. The right answer if you have no clue is “I don’t know”, hopefully followed up by where you would start to look for answers.

In both cases, you write down the question, look it up after the interview, and be ready to answer it next time. If you want bonus points, and who can’t use bonus points in this situation, you’ll email your interviewer the next day and say that you looked into the question and here’s the right answer. Now you’re acting like the ideal employee. You maintained integrity by saying you either didn’t know or weren’t certain of the answer, which couldn’t have done a better job showing you had the integrity Warren Buffet was referring to. Then you followed up with the energy to gain the intelligence to answer it better.

On the answers you know, be it ones you knew all along or ones you crammed for, you fulfilled one of the lessor criteria that Warren Buffet would look for. On the answers you don’t know you instantly fulfilled the most important criteria if you admitted you didn’t know, then you had the opportunity to throw in the other two criteria on a follow up.

Here’s the problem. Every good employee is a teacher, and you’re better off without a teacher if their lessons are wrong. A good interviewer is looking for a good teacher more than a know-it-all.

Speaking of being a teacher, how do you answer the questions you do know very well? Chances are you know how to answer them in a way that someone else who knows the answer would understand. However, they want to know if you could teach someone else the answer, even if that person doesn’t have your specialty. Make sure you can teach them the right answer in an interview when you understand it well enough, especially if it’s a common question they’ll ask at every interview.

Back to my Database Administrator background, every interview I’m in someone will ask what the primary types of indexes are in SQL. The “right” answer is that a clustered index is the table itself in a sorted order, and a nonclustered index is a subset of that data which is also sorted. This is good, they understood my answer and know that indexes aren’t something new to me. However, if you’re talking to someone on the job who doesn’t specialize in SQL Server, how well are they going to understand and remember this answer?

The answer that will blow them away is that a telephone book is a clustered index which is sorted by the key fields “Last Name” then “First Name”, and has all the relevant information is clustered together in that single index. A nonclustered index is the index at the back of a book where it’s sorted by the key field “key word”, and it includes the page number, which is the clustered index key, so you can look up the rest of the information if the index didn’t tell you everything you needed to know. It’s not uncommon to get the response “Do you teach?” from your interviewer after an answer like that, to which you respond “every chance I get.”

There are other questions you can expect in interviews, too. “Tell me about yourself”, “Where do you expect to be in 5 years”, and so on. I’m not going to pretend to be a good list of all the common questions, but I can tell you how to prep for them. Again, think about it from the interviewer’s perspective.

“Tell me about yourself” is obviously a very vague question. I hate it this question, but it still gets asked. I view it from the standpoint that they want to see how you handle vague questions. Do you just ramble on about your kids and dog or do you clarify what they are asking? Do you have the ability to break this down into something specific and get to the right answer? On my job people will often come to me and say “the server is slow”, which I have to break down into pieces until I can start asking the right questions and answers. Now I’m taking a “why would you ask this” question and demonstrating what they want to see in an employee.

“Where do you expect to be in 5 years” is also every common, and they aren’t expecting you to be a fortune teller here. Also, I wouldn’t expect someone to try to commit to being with my company for that length of time before they’re even offered the job. What else could an interviewer be hoping for? They just want a direction. Are you excited about what you’re doing and have the drive to keep learning? Do you want to make the jump from technical employee to manager, which would start another conversation with the stated goal still being to look for the energy to improve yourself and be a continually better asset to the company. If you don’t know what career you want to have in 5 years then you can’t be expected to have the energy to improve your abilities at your current career.

Here’s some more you should know before an interview, but I’ll be brief since it goes beyond the scope of this post.  When people ask you questions in an interview, do you exude the following?

  • Humble – If someone asks you to rate yourself on anything from 1 to 10 then 10 is not an option.  Even 9 is arrogant unless you’re known around the world for your skill.  Too high and you’re too confident, won’t take criticism, and, my personal favorite, you’re done learning because you think you know it all.
  • Hunger or Drive – You want to keep bettering yourself, both in work and out.  You want something a little above your skill level because you’ll be there soon.
  • Energy – If the interviewer is good, they’ll ask you questions you don’t know no matter how good you were.  Did you come back with a solution after the interview?
  • Communication – How were you able to answer the questions you got right?  Were they just technically right, or did you put off the impression that you could help others get to your level?
  • Leadership – Chances are you aren’t interviewing for a management position, but management isn’t leadership.  Are you showing that you can take an idea and run with it, asking others for help as you go, and making sure it’s right without someone looking over your shoulder the whole time?  Managers want to do more than stand around holding a bullwhip, they want to watch people fly on their own with ground control giving a little direction.

These questions are being asked now in the context of an interview, but you should be asking yourself these things in your continual career planning phase as well. “Tell me about yourself”, can you break down large issues and find the root of the problem, and do you care enough about what you’re doing to get down into the details and root causes? “Where do you expect to be in 5 years”, do you care enough about this career path to make yourself a better you? If you don’t have good answers to these questions for yourself then where do you want to be? Figure that part out and you’ll have a better chance at finding the career you want to do. And don’t forget to consider the integrity of your answers when coworkers look towards you for guidance.  Do you give each answer your best shot off the top of your head, or do you tell them you would have to look it up and follow up with the right answer after you do?  Following this advice you’ll be a lot closer to the ideal employee the interviewers are looking for, then all you have to do is walk into the interview and be yourself.

TSQL2SDAYThis post was already in the works before I found out it was the topic for this month’s T-SQL Tuesday led up by Boris Hristov (b | t) on Interviews and Hiring. This is an amazing coincidence for both of us. For me, it gives me more visibility on this post, which I think a lot of people could benefit from. For you, it gives you a couple links in this paragraph to exercise your hunger to better yourself. Just remember that a lot of what you see is going to be how to get a job, but you can skew it to how to be better at your job and evaluate where you want to be.

Optional Parameters Causing Index Scans

Optional parameters in a stored procedure often lead to scans in the execution plan, reading through the entire table, even if it’s obvious that the right index is in place. This is extremely common in procs that are used behind search screens where, for example, you could search for a person according to their first name, last name, city, state, zip code, gender, etc.

To take a simplified look at this, make a quick and easy table consisting of an indexed identity column for us to search by and fluff to make the row size bigger.

CREATE TABLE Test 
(
    ID Int NOT NULL Identity(1,1) PRIMARY KEY
    , Value CHAR(500) NOT NULL
)
GO

INSERT INTO Test (Value)
SELECT '';
GO 250

INSERT INTO Test (Value)
SELECT TOP (1000) Value 
FROM Test 
GO 100

Now we’ll create a proc to get data. In this proc we’ll get the data several different ways, showing the advantages and disadvantages of each method.  You’ll notice that I put my comments as PRINT statements, which will make it easier for you to read through the messages tab and see what’s going on.

CREATE PROC proc_Testing 
    @ID int = 125
    , @MinID int = 100
    , @MaxID int = 150
AS 

SET STATISTICS IO ON
SET NOCOUNT ON

PRINT 'Clustered index scan, because the query optimizer has to be prepared for both null and non-null values'

SELECT *
FROM Test 
WHERE @ID = Test.ID OR @ID IS NULL

PRINT ''
PRINT 'It''s a small mess, but the query optimizer knows exactly what to expect for each query.'

IF @ID IS NOT NULL BEGIN
    SELECT *
    FROM Test 
    WHERE @ID = Test.ID 
END ELSE BEGIN
    SELECT * 
    FROM Test
END 

PRINT ''
PRINT 'Expand that to two possible parameters and it still looks simple right now, when it''s doing a scan'
SELECT *
FROM Test 
WHERE (ID >= @MinID OR @MinID IS NULL)
    AND (ID <= @MaxID OR @MaxID IS NULL)
    
PRINT ''
PRINT 'Make it optimized with IF statements and it quickly becomes a big mess.'  
PRINT 'Many "search screen" queries will have 20 or more optional parameters, making this unreasonable'

IF @MinID IS NOT NULL BEGIN
    IF @MaxID IS NOT NULL BEGIN
        SELECT * 
        FROM Test 
        WHERE ID BETWEEN @MinID and @MaxID 
    END ELSE BEGIN
        SELECT *
        FROM Test 
        WHERE ID >= @MinID 
    END
END ELSE BEGIN
    IF @MaxID IS NOT NULL BEGIN
        SELECT * 
        FROM Test 
        WHERE ID <= @MaxID 
    END ELSE BEGIN
        SELECT *
        FROM Test 
    END
END

PRINT ''
PRINT 'However, the query optimizer can get around that if it''s making a one-time use plan'

SELECT *
FROM Test 
WHERE (ID >= @MinID OR @MinID IS NULL)
    AND (ID <= @MaxID OR @MaxID IS NULL)
OPTION (RECOMPILE)

PRINT ''
PRINT 'And again with the single parameter'

SELECT * 
FROM Test
WHERE ID = @ID OR @ID IS NULL 
OPTION (RECOMPILE)

PRINT ''
PRINT 'However, this leaves two nasty side effects.'
PRINT 'First, you create a new plan each time, using CPU to do so.  Practically unnoticed in this example, but it could be in the real world.'
PRINT 'Second, you don''t have stats in your cache saying how indexes are used or what your most expensive queries are.'
PRINT ''
PRINT 'Another option is dynamic SQL, which still has most of the first flaw, and can also bloat the cache'

DECLARE @Cmd NVarChar(4000)

SELECT @Cmd = 'SELECT * FROM Test WHERE 1=1'

IF @ID IS NOT NULL BEGIN
    SELECT @Cmd = @Cmd + ' AND ID = ' + CAST(@ID AS VarChar(100))
END

IF @MinID IS NOT NULL BEGIN
    SELECT @Cmd = @Cmd + ' AND ID >= ' + CAST(@MinID AS VarChar(100))
END

IF @MaxID IS NOT NULL BEGIN
    SELECT @Cmd = @Cmd + ' AND ID <= ' + CAST(@MaxID AS VarChar(100))
END

EXEC (@Cmd) 

PRINT ''
PRINT 'Yes, you can do sp_executesql which can be parameterized, but it gets more difficult.'

Now that we have the data and proc, just execute it. It will be best if you add the actual execution plans to this, which you can get to on your menu going to Query / Include Actual Execution Plan. If you don’t know how to read execution plans then this is even better! No experience necessary with plans this simple, and you’ll start to feel more comfortable around them.

EXEC Proc_Testing

Even with the default values for the parameters the query optimizer knows that you could have easily passed in NULL for the values, and it has to plan for everything. Lets step through the results to see what this all means.  Part of that planning for everything is saying that @ID IS NULL will be used instead of ID = @ID, so the query optimizer can’t say for sure that it can just perform a seek on that column.

In the messages tab we have the number of reads done by each statement courtesy of me adding SET STATISTICS IO ON in the proc itself (bad idea in prod, by the way). For the first one we have 7272 logical reads, saying we read 7272 pages of 8kb of data, so we had ask the CPU to play around with over 56 MB of data to find records that matched.

Looking at the execution plan we can see why that happened. The Predicate you can think about as the “Scan Predicate”, since it’s saying what you couldn’t seek for and had to scan instead. In this case it’s showing the entire “@ID = Test.ID OR @ID IS NULL” in there, because it compiled a reusable plan that might be used with a NULL.

Execution Plan of Query 1

In the next statement we were able to do a seek, doing only 3 reads. This is because there are two levels of the B+tree on this index, so it read the root level, then the next level of the B+tree, and finally the single leaf level of the index that contained the data we wanted.

Take a look at the execution plan here to see the difference with not only the Seek itself, but the Seek Predicate in the details. You’ll see that it uses @ID in there still showing that it’s a reusable plan ready for any ID to be passed in, but the WHERE clause is telling the query optimizer that it can always use a seek to find this data.

Execution Plan of Query2

Now that I know you can read through a couple yourself, I’ll skip ahead to the 6th query where we used the same “@ID = Test.ID OR @ID IS NULL” that caused is trouble before, but did OPTION (RECOMPILE) at the end of the statement. Again it did the same 3 reads I just mentioned still using a seek, but the details of the execution plan looks different. The seek predicate used to say @ID so it could be reused, but the query optimizer knows nothing with OPTION (RECOMPILE) will get reused. Knowing that, it simply converted @ID into a constant value of 125 and was able to eliminate the possibility of 125 being NULL.

Execution Plan of Query 6

If you want to get more into the details, you can look at the XML version of the execution plan by right-clicking on the plan and selecting “Show Execution Plan XML…”. Here’s what you’ll find for the parameter’s value in the first two queries. It shows that the value that was sniffed when it was compiled was 125, and also that it was run this time with a value of 125. If you ran this again passing 130 in then you’d see the sniffed value staying at 125 because it’s not going to use CPU resources to recompile it, but the runtime value will be 130 for that run.

<ParameterList>
  <ColumnReference Column="@ID" ParameterCompiledValue="(125)" ParameterRuntimeValue="(125)" />
</ParameterList>

The OPTION(RECOMPILE) versions don’t look like this though, they’re changed to a constant value even at this level.

<RangeColumns>
  <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[Test]" Column="ID" />
</RangeColumns>
<RangeExpressions>
  <ScalarOperator ScalarString="(125)">
    <Const ConstValue="(125)" />
  </ScalarOperator>
</RangeExpressions>

I hope I left you with a better understanding of what happens when you write code for parameters that could be null. We’ve all done this, and many of us (me especially) have been frustrated with the performance. This is a change I put into production not too long ago that ended scans on a rather large table, and the results were noticed in query execution time, the page life expectancy counter since the entire table was no longer being read into memory, and I/O activity levels since that index and the indexes it kicked out of memory didn’t have to be read off of disk nearly as often.

Let me know what you think, and especially if you have a better way of handling this issue.

Why worry about CXPACKET

TSQL2SDAYMany people see CXPACKET at the top of their waits and start trying to fix it. I know this topic’s been done before, but people ask me about it enough that it should be done again, and again. In fact, Brent Ozar wrote an excellent article about it, and Jonathan Kehayias wrote an article about tweaking it using the cost threshold of parallelism. Even with those, it took Grant Fritchey warning us to be careful about giving guidance to lead me to Bob Ward’s take on it.  After all of these people and more saying CXPACKET isn’t what you think it is, enough people have it all wrong to the point that I felt this was the biggest piece of public opinion to speak out against for Michael Swart’s T-SQL Tuesday #052.

BOL still doesn’t state it completely clear when it defines it as “Occurs with parallel query plans when trying to synchronize the query processor exchange iterator. If waiting is excessive and cannot be reduced by tuning the query (such as adding indexes), consider adjusting the cost threshold for parallelism or lowering the degree of parallelism.” The true meaning of this wait is much easier, with Bob Ward’s explanation of “This wait type only means a parallel query is executing.”

Yes, that’s it, it’s not really all that complex. If you have a MAXDOP of 4 then you could end up with 5 threads. Think of it as having one manager who can delegate work out to 4 subordinates. The subordinates do their pieces of the job and the manager marks down CXPACKET on his time sheet while he waits for them. When they finish, the manager compiles their work (repartitions the streams if you want). There was a little extra work done here because the manager wouldn’t have to combine all the work if he did it all on his own (MAXDOP = 1), and it’s possible the subordinates did more work than they had to. However, in most cases, it does get done faster, and duration is what all your users talk about.

The advice in BOL above saying “consider adjusting the cost threshold for parallelism or lowering the degree of parallelism” is incomplete at best. Upping the cost threshold is exactly what it sounds like, making it so only more expensive tasks can be assigned to those subordinates. Then lowering the degree of parallelism is just dropping how many subordinates you can use.

Cost Threshold for Parallelism is something Brent Ozar makes fun of a bit, because it’s a default of 5 units that are undefined anymore. Jeremiah Peschka recommends setting this to 50 and tweaking it from there. I’d recommend 25 for most OLTP because I like parallelism and my procs aren’t stressed. However, the point is that you should only tweak what you understand, and now I hope you have a descent understanding of this. If you have a lot moderately large queries you wish could break out the work to get it done faster, drop it. If all of your CPUs are running hot and you wish less work was being done repartitioning streams or compiling more complex plans, raise it.

Max Degree of Parallelism is what people really like to tweak, and it’s usually a bad idea. My personal recommendation is to set it to 4 in OLTP (higher in OLAP, very dependent on your load) and forget it unless you have a really, really good reason to change it. The default of 0 is absolutely horrible for OLTP on larger servers, where this could mean that you’re breaking out a task and assigning it to 80 or more subordinates, then trying to put all of their work back together, and killing yourself doing it. This shouldn’t be above 8 unless you really like to be overworked putting your subordinates’ work back together and dealing with those performance problems. However, it’s great to have subordinates who can help you out when needed, so please don’t drop this below 4 unless you have an amazing reason to do so, and only on the query level if you only have one amazing reason.

So, what do you do when CXPACKET is taking over? The trick is to think of it in terms of how a business would work. You have a ton of work to do, enough where your managers are going crazy trying to find subordinates to do it for them. Do you tell your managers to do it themselves OPTION(MAXDOP=1), do you tell them to go out and find as many subordinates as they can OPTION(MAXDOP=0), or do you tell them you’ll see if you can help them find a more efficient way to do their jobs?

If you have the Cost Threshold for Parallelism set to 25 or 50 you have a descent limit before you consider spreading the load around. So, what takes that long for your server to handle? A good portion of the time you’ll find this when you’re scanning a large index or table, or even doing a very expensive seek. If you look at your execution plan is it doing a scan? If it’s a seek that’s starting your parallelism then does it have a small “Seek Predicate” and a lot to do in the “Predicate” in the properties for that operator? It could be that you could use better indexes, it could be that you could find a more SARGable approach, or a variety of other performance tuning opportunities.

The thing is that parallelism isn’t bad and CXPACKET is saying you’re using a good thing. A high CXPACKET could mean you’re trying to use too much of a good thing because you’re “overworking your managers”, but the workload is typically more of an issue than the fact that you let your manager delegate their work when there’s enough to go around. A CXPACKET a lot higher than your baseline could mean that you have poor statistics or parameter sniffing that’s causing a bad execution plan, or it could mean you have a new query that could use some attention.  When you’re talking about parallelism it’s also possible that the statistics were off and one thread (subordinate) ends up doing 99% of the work, making CXPACKET jump up, and that can be found in the actual execution plans; read more about that in Grant Fritchey’s book.

If you have a MAXDOP around 4 (between 2 and 8, reading these recommendations) and Cost Threshold of 25 or higher then I’d be looking into the workload more than the fact that you’re using parallelism. Sure, some tasks just don’t take well to parallelism and there’s reasons to use a MAXDOP=1 hint, but it’s rare to need to take it down to 2 on the server level and even more rare to turn if off by taking it down to 1.

What ever you do, when you have “a big query using parallelism”, try to focus on “a big query” first, it’s usually the problematic part of that statement.

Related articles

Free SQL Server Training (not a sales pitch)

My company cut the training budget this year, and it’s not that uncommon for that to happen anymore. However, I’m still getting my week of training in, and more, without paying much.

Blogs of people who talk too much – FREE

There’s so much for free online, but you’re reading my blog so I don’t have to dive too deep here. Follow a couple blogs in an RSS reader, and make sure you sign up for freebies they may offer like Paul Randal giving away training!

Personally, I try to read a majority of everything written by:

You can include me on this list if you’re feeling generous. However, they’re all smarter than me, and I’ll understand if you don’t.

Local User Groups – FREE

I know most DBAs don’t go to the local user groups. Some never go, some go when the topic is right, but you can benefit from these every time.

Even the presentations that are outside of your normal job duties are valuable. If someone’s teaching SSIS and you don’t use it, knowing what you can learn from an hour-long presentation will put you in a situation to have more intelligent conversations when these topics come up. Read over job descriptions, they expect DBAs to know everything, and here’s your chance to know a little about it.

Also, you can learn from every presentation. I don’t care if you could give that presentation, people in a field known for introverts could use practice and examples of how to present ideas to others. You’ll see good, you’ll see bad, and you’ll be better because of it. Also, one or two small details you might pick up on a topic you know extremely well will make a big difference.

Speaking of our field being known as introverts, this is your opportunity to step out of your comfort zone and present. That’s what I did. Some present because they were born to do it, and I started presenting because I was born to blend into a cube farm! I’ll be presenting at my local user group in April, have already done a SQL Saturday this year, and would like to do at least two more SQL Saturdays, which drags me (kicking and screaming) into improving my soft skills where a typical DBA, like me, is behind the curve.

Our last presentation by Bill Wolf was an excellent presentation on avoiding common coding mistakes. It wouldn’t have been too much of a stretch for me to make a similar presentation, but I still learned a couple things and met some good people. The next meeting on March 25th will be about the interdependency between SQL Server and .NET, which would be amazing for any DBA or .NET developer to go to, even if Dell wasn’t giving away free laptops as door prizes!

Finally, the presentations aren’t the real reason you’re going there, are they? You’re getting to know the other DBAs in the area, and you’re sitting there talking person-to-person to the ones that want to share everything they know with you. You get to know them, they get to know you, and when one of you wants a new job or new coworker then you have a better chance at finding a good fit that you can be comfortable with from the start.

If you’re not sure where your local user group is or if you even have one, start by looking at this link.  Most groups are affiliated with PASS, and that means they’ll be listed there.

SQL Saturday – $10 Lunch

SQL Saturday is a cross between the local user group and the PASS Summit. In fact, the last one I was at has been referred to as a mini PASS Summit because of the number of high-quality speakers there, including multiple names from the list of bloggers I mentioned above.

The cities that host these events typically do so as an annual event, and there’s probably one not too far from you each year. I live in Pittsburgh, PA, which hosts one just a couple minutes from my house. I’m also trying to make it to ones reasonably close to me such as Cleveland, OH a couple weeks ago that’s about 2 hours away, Philadelphia, PA coming up in June that’s a 5 hour drive where I have family I can stay with. I think I’ll skip Columbus, OH because it’s the week after Philly and I have three small kids. Although I won’t make it this year, I may go to Richmond, VA, which is a 6 hour drive where I also have family to stay with.

It’s been a cold winter here in Pittsburgh, and now I’m starting to think about how much I miss my relatives in Texas, California, Florida and South Carolina. It knocks this idea out of the free or low cost training blog post, but that doesn’t mean it’s a bad idea.

While my company did cut the budget, they still paid mileage and gave me Friday off as a travel day. Even without a budget, still ask to see what you can get.

Virtual PASS Chapters – FREE

Pass also does virtual online events. I know I’m risking making this sound like a PASS commercial doing the local PASS chapter, the PASS SQL Saturdays, now the Virtual PASS chapters, but it’s hard to talk about free or low cost SQL Server training without talking about these people. So, why would I avoid this sounding like a commercial for them? They deserve to be known, and with all this free junk they can’t have much of an advertising budget!

Anyways, the virtual chapters each have their own focus, and I’m a member of the PowerShell, Performance, Database Administration, and I think a couple others. They do presentations about once a month, and I try to watch them when I have time. Worst case, they usually make the recordings available later.

There are also a couple “24 hours of PASS” events throughout the year where there are 24 back-to-back sessions. The next one on the schedule is the Russian version, which I’ll pass on.

BrentOzar.com – FREE

In addition to their blogs, the Brent Ozar Unlimited team (Brent’s just a member, not their leader!!!!) does weekly webinars that are usually very useful to me. I’d say I watch 3 of them a month, missing one either due to my schedule or because it’s a rare one that I don’t feel bad about missing. They put up the recordings later, although I have to admit I usually don’t watch them if they aren’t live. I know I don’t make any sense, but you’ll get used to it.

Twitter – FREE

Seriously, if you can fit your question and #sqlhelp in 160 characters or less, you’ll have someone pointing you in the right direction within minutes. They’ll point you to the posts that you didn’t know what key words to Google to find, but keep in mind that they’re doing this for free. Always be grateful, and understand when they say that digging into the issue would mean a trip to the consultant.

The End – Sad, but still free

So, yeah, my company cut my training budget. That means that so far this year (March 3rd here) I had to stick to learning from Steve Jones, Grant Fritchey, Erin Stellato, Stacia Misner, Tom LaRock, and others in-person. Then I watched free videos by Jes Borland, Jeremiah Peschka, Brent Ozar, Kendra Little, and Doug Lane. I have a free month of the SQL Skills Pluralsight classes with a code that Paul Randal personally emailed me, with which I’m in the middle of a class by Jonathan Kehayias right now. I’m planning on taking full advantage of this trial with classes by Joe Sack, Erin Stellato, Paul Randal, Bob Beauchemin, and Kimberly Tripp, although I’ll end up running out of time and buying a subscription for $30 / month.

So, is a lack of a training budget stopping you? You can say it is, but it’s not.

I gave you a lot and I missed a lot. This is not all-inclusive, and a quick look around the internet will prove that point. There are still Free e-Books and many other resources out there. I have to start skipping stuff or this post will never make it out to you, and what good would that be?

Hey, look!!! There’s a section right below here marked “Comments” that’s perfect for your addition to the list. Maybe it can be all-inclusive!!!

The Approachable DBA

DBAs are the gatekeepers, but if we make it an unpleasant process then people will find a way around the gate. It’s common to think of DBAs and developers being polar opposites that don’t speak to each other. This is wrong! It’s not one of the database-themed answers of “it depends”, this is just flat wrong.

DBA with a bat

I may look calm, but in my head I’ve killed you 3 times.

The issue with this is that when you take a polarizing position you get a polarized result. Take Grant Fritchey’s use of his Hickory Stick of Education for example. He goes on a rant about databases needing to be backed up, which, as protector of the data, is the proper view to have. However, the harsh approach many DBAs take to this role, with this being an exaggerated extreme, alienate their coworkers.

His coworkers on the other end of this conversation, we’ll call them developers to go with the theme here, will take one of two initial views. “I better do what he says, or he’ll beat me”, or “I don’t need this, I don’t need him, I have my reasons, and I’ll do it my way without discussing the details with him.” Unfortunately, both of these answers are wrong, and the ensuing conversation with HR may also be unpleasant.

Although they’re opposite responses, they’re both wrong for the same reason. Neither developer is going to discuss the details. The first developer is going to start backing up the database at any cost, but is a full backup and restore the best recovery option? The second developer isn’t going to back anything up, but are pieces of this database unique and need to be recovered in a disaster? After all, you’re responsibility as a DBA isn’t making sure you’re backing everything up, it’s making sure you can put it back to where it was in accordance to the written agreement you have with the business owners. In this case it could be data that is pulled from 30 different locations and the best option for recovery is to pull that data again, or you could have 90% static and the best option is to have a static backup of one filegroup and nightly backups of another, or…talk to the developer.

The number of details they tell you is directly proportional to how approachable you are. If you use a polarizing approach that sets your approachability to 0 then you’ll get 0 details, even to the extreme that people will try to spin up production servers without your knowledge. I’ve seen it happen. I’ve also seen those same developers come back later and ask us to restore something that was never backed up because we never knew it was production. You can’t allow the recovery conversation to be the first time the developers talk to you because that will end as well as trying to do a restore without a backup. It’s your responsibility to make sure there is communication going on.

So, would I hire a DBA with a reputation for not working well with developers? Maybe, after a long conversation about how to work well with others, not as my only DBA, and it would take a lot more in other areas for to convince me I was making the right choice. I need to have a DBA that will talk to the developers, go out to lunch with them, and have the developers accept someone protecting our data as one of their own. When they get to that point the developers start asking questions that help them use the database as smoothly as possible and the DBAs start asking questions to find out why the bad practices are so tempting in the first place.

Stop and think about this. Would I hire you? I would hire Grant because I’ve met him outside of an interview and feel his is much more approachable than his nicknames and this blog post. Just look at his posts directly before this post where he’s acknowledges the issue saying “I’m pretty sure most people don’t talk to their Dev teams if they can help it, and it does seem like most Dev teams seem to be on less than a perfectly chatty basis with their DBAs”, and after this post where he mentions the struggle to resolve this issue stating “I spent years at my previous job trying to do a better job interacting with the development teams.” The thing is, Grant’s an odd one; he’s a SQL Server MVP which means I can find more than one piece of information about him. Does your future manager know other details about you or is it possible that one rant is the only piece of your reputation they’ve heard? Chances are, if I ever heard pieces of this about you then it’d be an uphill struggle to get to an interview, let alone a job offer. Even worse, IT is a small world, and if you live in the same city as me then I can find someone who has worked with you before that’s not on your list of references.

Now think about it from the side of your current job. Yeah, you just kinda won the fight with making sure this database is backed up, but the developers are starting another project as well. This time they’re making a new database from scratch, didn’t involve you, and it’s an ORM reading and writing directly to tables. Everything’s a clustered index scan or table scan, but it works great with 1,000 rows of sample data. There are no cases where they’re reading from a view, no writes are being done through a proc, and they’ve built a lot of stuff on this foundation before you knew any details of the project. You know if you were involved from the start that it would have been done better from the start, but, still licking their wounds from their last conversation with you, the developers avoided you as long as possible. There’s a big mess, it’s your own fault, and a lot of people are going to be paying for it for a long time.

I’m not asking for you to be a pushover. Instead the point is to discuss the situations, find all of the details, give them proof of your logic when you differ in opinion, educate them as you accept their invitations to educate you, and go out of your way to be viewed as approachable. The result you’re going for is the same, which is recoverability of data, and I’m not advocating letting off until you achieve that goal. However, that’s not the only interaction with the developers you’ll ever have, and this approach will help you “win” this interaction quicker at the expense of not being able to work together through every other interaction. The only thing I’m going for is that DBAs as a whole need to develop the proper soft-skills to work with developers. This is going to help a lot of people in the long run, including you, your family, the developers, and me.

You: Your job will be easier. Things that would be issues will be handled before they’re big issues, and a lot more things will be set up right from the first place. Your reputation will be where you want it, allowing you to enjoy the job you have and get the job you want.

Your Family: You’ll be coming home from work less stressed, actually enjoying your interactions with developers. Beyond that, you’ll be more and more desirable, getting better and better jobs, bringing home a better income to help provide for your family.

The Developers: Problems are avoided, they have a mentor on the database side, and going to work is more enjoyable.

Me: I get really tired of this whole “Us vs. Them” mentality, hearing DBAs talk and only half jokingly say “and maybe even, ‘gasp’, talk to your developers”, and developers, rightfully so, assuming DBAs their enemy from the start. If for no other reason, all of the DBAs in the world should be more approachable to make my life easier!!! The only thing you bullheaded DBAs do is give developers ways to slow down! Why do I always have be here to fix your problems? The whole profession is nothing more than a bunch of socially awkward nerds on a power trip going out and telling developers “NO” with a smile on their face every chance they get. Who needs you?!?!

You know what? The developers that don’t talk to DBAs are usually right in their logic. Why ask someone a question if you know the answer is no, you just don’t know how you’ll get belittled on your way to that answer? In many cases developers aren’t talking to their current DBAs because of experiences at past jobs, but it’s still your problem to fix.

I’ll end this post the way I started it. DBAs are the gatekeepers, but if we make it an unpleasant process then people will find a way around the gate.

ApproachableDBA.com

Related articles

Cleaning up the Buffer Pool to Increase PLE

Chances are you have extra information in the buffer pool for a bad query and it’s dragging down your PLE, causing SQL Server to run slower because it’s reading more from disk. Although this approach is taking a 180 from my post Fixing Page Life Expectancy it has the same effect, with that post focusing on fixing your worst queries and this one focused on fixing your most misused indexes.  One approach doesn’t replace the other, it’s more like burning the candle at both ends, except that you end up with a better running database instead of getting burnt out.

With this approach we start with what’s in the cache.  You’ll see the same types of issues in almost any database, and this just happens to be a production database I’m going through today.

ScreenHunter_01 2014-01-06 14.52.28

The results were found with this query:

SELECT count(1)/128 AS cached_MB 
    , name 
    , index_id 
FROM sys.dm_os_buffer_descriptors AS bd with (NOLOCK) 
    INNER JOIN 
    (
        SELECT name = OBJECT_SCHEMA_NAME(object_id) + '.' + object_name(object_id)
            --name = 'dbo.' + cast(object_id as varchar(100))
            , index_id 
            , allocation_unit_id
        FROM sys.allocation_units AS au with (NOLOCK)
            INNER JOIN sys.partitions AS p with (NOLOCK) 
                ON au.container_id = p.hobt_id 
                    AND (au.type = 1 OR au.type = 3)
        UNION ALL
        SELECT name = OBJECT_SCHEMA_NAME(object_id) + '.' + object_name(object_id) 
            --name = 'dbo.' + cast(object_id as varchar(100))   
            , index_id
            , allocation_unit_id
        FROM sys.allocation_units AS au with (NOLOCK)
            INNER JOIN sys.partitions AS p with (NOLOCK)
                ON au.container_id = p.partition_id 
                    AND au.type = 2
    ) AS obj 
        ON bd.allocation_unit_id = obj.allocation_unit_id
WHERE database_id = db_id()
GROUP BY name, index_id 
HAVING Count(*) > 128
ORDER BY 1 DESC;

Service Broker Errors Table

First, there is an errors table here with Service_Broker_Errors, and that should never be in the top 10. What are we doing wrong? The index ID of 1 tells me right away that this is a clustered index, the fact that the table is at least 1.5 GB tells me that we’re probably not purging old data, and 1.5 GB in memory on this table is hinting that we’re probably doing a clustered index scan.

I’ll start by looking at the proc cache to see what’s going on. There’s only one execution plan that used that index and it is, in fact, doing a clustered index scan as I expected.

ScreenHunter_01 2014-01-06 15.24.05

Predicates or Seek Predicates

In an execution plan you’ll have a seek predicate and just a plain old predicate. The seek predicate is what you were able to do taking advantage of the index being in order, and the predicate is what you had to scan for.

ScreenHunter_01 2014-01-06 15.26.02

This case is easy because we’re only searching by a single column, but others could have both a seek predicate and a predicate. For instance, if I had an index on my customers table with the key columns Active, First_Name, Last_Name then searched where Active = 1 and Last_Name = ‘Hood’ then it will show up as an index seek with a seek predicate of Active = 1 and a predicate of Last_Name = ‘Hood’. Anyways, lets get back on topic with the issue of my Service_Broker_Errors table…

Now this sounds like a reasonable query looking for errors. I’m looking for the errors that occurred in the last so many days. The CONVERT_IMPLICIT(datetime,[@1],0) shows me that someone typed this the lazy way of GetDate()-1, which isn’t as efficient as DateAdd(Day, -1, GetDate()), but you’re getting me off topic again.

Fixing a useless index

Looking at the indexes on this table I realize there is only one, and it has the single key column of ID. For uniqueness you can’t do much better than an ID column, but you have to ask yourself if you’ll ever use it to query by.  In this case the index has never had a seek against it, only scans.  Although there table is rarely queried with only 4 uses in the last 2 months (I limited my historical data for this query), it’s still pulling 1.5 GB into cache for every use.  After a couple seconds of shaking my head I start to create a change request to make add TimeStamp in as the first key column in the clustered index.

ScreenHunter_01 2014-01-06 15.35.27

However, I then remembered that this table is 1.5 GB. Is that right? It’s an error table, so if it’s really 1.5 GB then I should either be cleaning up old data or there are so many problems that there is no reason I should be spending my time tuning. Seeing that it has 0 updates in the last 2 months, I already know it’s old data.  To double-check on this I run a simple query, keeping in mind the ID is still the clustered index, to find the oldest record, and discover that we haven’t removed anything from this table in years.

SELECT timestamp
FROM Service_Broker_Errors
WHERE id = (SELECT Min(ID) FROM Service_Broker_Errors)

So I have to talk to people about how old an error can be before we say we just don’t care. It was determined that we probably don’t care about anything more than a month old. I’m paranoid, it comes with the job title, so I made it three months with my change being this:

DECLARE @ID Int

SELECT @ID = MAX(ID) FROM Service_Broker_Errors WHERE TimeStamp < GETDATE()-90 

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (10000)
    FROM Service_Broker_Errors
    WHERE ID <= @ID 
END 

IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[Service_Broker_Errors]') AND name = N'PK_Service_Broker_Errors') 
    ALTER TABLE [dbo].[Service_Broker_Errors] 
    DROP CONSTRAINT [PK_Service_Broker_Errors] 
GO 

ALTER TABLE [dbo].[Service_Broker_Errors] 
ADD CONSTRAINT [PK_Service_Broker_Errors] PRIMARY KEY CLUSTERED 
( 
    [TimeStamp] ASC
    , [id] ASC 
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] 
GO 

The reason I did it like this is because I don’t want to fool around with the junk values as I drop the clustered index (rebuilding the table) then recreate the clustered index (rebuilding the table) with all of that information in there, then delete it all and wreak havoc on the new index. Sure, the fragmentation at the page level would be fine since I’m deleting in order, but that’s a lot of changes to the b-tree.

Finally, I add these lines to my nightly cleanup job to keep things trimmed up from now on:

DECLARE @BatchSize Int 

SELECT @BatchSize = 10000 

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (@BatchSize)
    FROM Service_Broker_Errors
    WHERE TimeStamp < GETDATE()-90
END

I know, I just got done saying that GetDate()-90 was less efficient than DateAdd(Day, -90, GetDate()), but it’s an implicit conversion that’s done once per call. I admit this is a bad habit of mine, and any time there’s even a chance of this being called once per row I have to write it differently than I normally do. Nobody’s perfect, and the most important thing is that I’m getting rid of all those records using batches, right?

In Conclusion with Service Broker Errors

Now that all of this is done I took a 1.5 GB table with all of it in cache to a 20 MB table with only 1 MB in cache. Cleaning up the data had more of an impact that my index change, but it’s usually not that easy to clean stuff up. Even if I wasn’t able to clean up the data, the index change alone would have allowed for the cache usage to be decreased by at least 1.4 GB.

On to the next one

That’s not the only issue I’m seeing here, in fact I bet I could take half of these off the top 10 list. However, today I’m picking up low-hanging fruit and moving on to the next task on my list. In this case, External_Messages is the next one that doesn’t look right to me. Once again, it’s a clustered index (index_id = 1) of a table that isn’t part of the primary focus of the database, which is handling orders.

Starting with the same steps I look in the proc cache to see what put this here. Once again I see a single query in cache referencing the table, but this one is different. It’s the typical IF X = @X or @X IS NULL that we’ve all written, and it’s being used as the predicate.

ScreenHunter_01 2014-01-06 16.34.32

I check the indexes on the table and it has the ID as the only key column of the clustered index, so that’s not an issue. Why isn’t it able to use the clustered index with a seek predicate? After all, I’m rather certain that they almost always, if not always, call this with a parameter where it should be filtered down.

Lets look at the execution plan to make sure. The sniffed parameter value from the first time it was called, as found at the bottom of that statement in the XML version of the execution plan, is, in fact, a non-null value.

ScreenHunter_01 2014-01-06 16.39.02

However, SQL Server can’t guarantee that you’re going to pass it a non-null value, and it has to make an execution plan that can account for either possibility. I’ve seen this before, so I know the basic options for a single optional parameter (there are more options, with increasing complexity). I can either add OPTION (RECOMPILE) to the query or I can rewrite it to be two separate queries.

OPTION (RECOMPILE)

Here’s the option I didn’t choose. This will recompile the statement every run, which isn’t too much of a problem because it’s a quick compile on something that only runs a couple times a day. However, it will make it so this doesn’t show up in the proc cache anymore, which I take advantage of quite a bit (for instance, look at the blog post you’re currently reading). Also, it goes against my rule of using hints as a last resort.
That’s not saying you can’t do it this way, just that I didn’t. The reason it works is because SQL Server knows when it makes this execution plan that it will only be used once, so it doesn’t have to account for the possibility of it being a NULL value next run. In fact, it just makes the execution plan with your parameter as a constant value.

ScreenHunter_01 2014-01-06 16.51.09

Two queries

Here’s the method I decided to go with. Assuming third-normal form doesn’t apply to query text, you should be good to go here. If it’s simple enough then it will be very obvious to anyone editing this at a later point that they need to make any changes in two places.

CREATE PROC NotTheRealProcName
    @id INT = NULL
AS

IF @id IS NULL BEGIN 
    SELECT ID
        , Subject
        , Message
        , DateAdded 
    FROM ExternalMessage EM
END ELSE BEGIN
    SELECT ID
        , Subject
        , Message
        , DateAdded 
    FROM ExternalMessage EM
    WHERE ID = @ID
END

This solution isn’t too complex with just a single parameter as it only creates two queries, but if you get just three parameters and try to do this then you’re up to 8 queries. The OPTION (RECOMPILE) method starts to look pretty good as a quick and easy fix before too long. I would still prefer one of the long and drawn out methods, such as getting interim results to a temp table, making it look a little more complex, but getting it to run efficiently with a cached plan.

It worked as I expected, with no one actually getting all of the IDs. I left the code in place to make it possible to get all of them to avoid digging through all of the application code to find where it could return everything then make sure it gets changed there. The result I was looking for was cleaning up the cache, which I got when this dropped from 1.4 GB down to 25 MB in cache.

The End

Overall I was able to drop about 3 GB out of cache, making room for other data while raising the PLE, in two simple changes that were rather easy to track down just by saying “that doesn’t look like it belongs here”. Even the ones that do look like they belong there probably don’t need to have that large of a presence in the cache.

Although I didn’t actually measure the impact that it would have on the end users in these cases because I didn’t start with a performance complaint, I would imagine that SQL Server tends to run faster returning processing 1 row it found quickly instead of reading through a couple million. So, measured or not, I’ll count that as a success as well.

PS. Jonathan Kehayias helped me

My queries looking into the proc cache are slightly modified versions of Jonathan’s work on his blog. Honestly, it didn’t need tweaked or touched for any reason other than the fact that I learn by tinkering. I did help myself by adding in filtering by database because I have a couple servers that have multiple copies of the same database, but, again, it worked great before I touched it, which is something you should expect from Jonathan’s work.

Here’s my altered version:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
DECLARE @IndexName SYSNAME = '[PK_ExternalMessage]'; 
DECLARE @DatabaseName SYSNAME;

SELECT @DatabaseName = '[' + DB_NAME() + ']';

WITH XMLNAMESPACES
   (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
    n.value('(@StatementText)[1]', 'VARCHAR(4000)') AS sql_text,
    n.query('.'),
    cp.plan_handle,
    i.value('(@PhysicalOp)[1]', 'VARCHAR(128)') AS PhysicalOp,
    i.value('(./IndexScan/@Lookup)[1]', 'VARCHAR(128)') AS IsLookup,
    i.value('(./IndexScan/Object/@Database)[1]', 'VARCHAR(128)') AS DatabaseName,
    i.value('(./IndexScan/Object/@Schema)[1]', 'VARCHAR(128)') AS SchemaName,
    i.value('(./IndexScan/Object/@Table)[1]', 'VARCHAR(128)') AS TableName,
    i.value('(./IndexScan/Object/@Index)[1]', 'VARCHAR(128)') as IndexName,
    i.query('.'),
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./OutputList/ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS output_columns,
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./IndexScan/SeekPredicates/SeekPredicateNew//ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS seek_columns,
    RIGHT(i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'), len(i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)')) - charindex('.', i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'))) as Predicate,
    cp.usecounts,
    query_plan
FROM (  SELECT plan_handle, query_plan
        FROM (  SELECT DISTINCT plan_handle
                FROM sys.dm_exec_query_stats WITH(NOLOCK)) AS qs
        OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp
      ) as tab (plan_handle, query_plan)
INNER JOIN sys.dm_exec_cached_plans AS cp 
    ON tab.plan_handle = cp.plan_handle
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/*') AS q(n)
CROSS APPLY n.nodes('.//RelOp[IndexScan/Object[@Index=sql:variable("@IndexName") and @Database=sql:variable("@DatabaseName")]]' ) as s(i)
--WHERE i.value('(./IndexScan/@Lookup)[1]', 'VARCHAR(128)') = 1
OPTION(RECOMPILE, MAXDOP 1);

Indexing Fundamentals

First, lets understand what the types of indexes are. If you take a reference book, you have the two types of indexes. The book itself is a clustered index, it contains all the data and has the key field of page number. The index in the back is your nonclustered index, it’s in order by the key field of subject, then has the implied second key field in there for page number, which nonclustered indexes in SQL Server have as well. On top of that, the table of contents at the beginning of the book is just like a b-tree built on every index that helps you get to the right page faster.

The key fields I mentioned are the fields everything is ordered by, just as you would expect a book to be ordered by page number. Nonclustered indexes also have implied key fields, which are all of the clustered index’s key fields that weren’t explicitly made key fields already, which is a great reason to keep your clustered indexes as narrow as possible in most cases, which means they have fewer columns.

If you have a phone book then you have a clustered index with the key fields last_name, first_name, address, phone_number. That means it’s ordered by each one, so any duplicate last names are still in order by first name, and so on. It’s wide for a clustered index, but doesn’t have any nonclustered indexes to worry about, so wide is fine here.

However, the important thing to notice is that if I asked you for people with the last name “Hood”, you’d find them rather quickly. If I asked you whose phone number is 412-555-6789, you’d cringe. The same thing happens in SQL Server where it will do the work you ask of it, but it won’t be cheap, quick and easy. This is why the internet has reverse phone number lookups where they have a nonclustered index on phone number which lets you find the name quickly by just knowing the number.

Indexes can also have included columns which are stored in the index, but no sorting is done on them. These fields don’t even exist in the b-tree for finding the information, only the page level where everything is stored. To simplify the process, this is where you put fields that you have in your SELECT clause but not your WHERE clause since don’t need it sorted to search quicker, you just need it to fulfill your query’s needs.

How are they used, and what do they cost?

A nonclustered index is used exactly the same way you read a reference book. You look up the key field in the back of the book and see what information is there. If you covered all the information that you needed just by looking at that index then you’re done. However, if you want to know more that wasn’t listed in the index then you take the key field of the clustered index (the page number) and do a key lookup by flipping to those pages and reading more about the subject at hand.

If you noticed, reading all that information in the back of the book seems pretty simple and efficient, especially when all of the information is there so you don’t have to touch the book itself. If you don’t have to touch the book then that’s called a covering index, which can be accomplished by adding the fields you want to retrieve to an index as either included or key columns. Depending on how many of these key lookups you have to do and how much extra information it has to get from the clustered index, it may make sense to make your index a covering index.

However, none of this is free. Every index you make has to be maintained. If you ignore filtered indexes, which I’ll get to soon, every time you insert or delete rows on a table then you have to write to each index. So if you have 6 nonclustered indexes, it doesn’t matter if you have a clustered index or a heap (merely a lack of a clustered index), you have to do one write to the table itself and one write to each nonclustered indexes, so 7 writes for that one row.

Add on top of that maintaining those indexes for fragmentation, extra storage needed on disks and backups, and more data that can possibly be held in the cache, and this stuff really starts to get expensive.

How many indexes should I have?

There’s no magic number of indexes you should have on any single table. Too few makes it hard to find the data, and too many makes a maintenance nightmare with data modifications taking longer. What indexes and how many will vary wildly, but there are methods to figure out what’s appropriate for each table. Chances are, many of your tables will benefit from both dropping indexes and creating them.

Consolidate what you have

First, look at the indexes themselves. If you have indexes that have the same first key field as the clustered index then you can consider removing those indexes with little to no detrimental effects. This is regardless as to if these indexes are being used or not.

Now, look at what indexes are never used. This can be done by looking at sys.dm_db_index_usage_stats to get you up to a month of data (assuming a monthly reboot for patching), or, my preferred method, by querying your tables you have set up that monitor this DMV. If an index is never being used it means it’s never helping you, but it is extra weight that’s adding up and slowing your database down. Just make sure you have enough data to say that the index really isn’t needed before relying on this alone.

Also, look for patterns in the indexes themselves by comparing the first couple key fields. If the couple key fields are the same, these are candidates for having the indexes combined into a single index. Depending on how quickly those key fields make the data semi-unique, you may be talking one to three key fields matching before I’m itching to reduce the number of indexes. You can do this by taking the columns that are in the second index that aren’t in the first and adding them to the first index as included columns, then drop the second index. Queries that used to hit either one of these indexes should run slower by an unnoticeable amount, yet you’re only storing data, doing updates, and maintaining a single index instead of two.

When you’re going through this process you just have to be careful that you understand the implementations of dropping a unique index. It is rarely a good idea to get rid of a unique index unless that uniqueness is enforced by other means as well.

Take this example.  I have a heap with a unique index on Col1 that gets scanned, but no seeks.  Also, there’s a NC index on Col2 that has a lot of seeks then is obviously doing a lot of lookups to get more information from the heap.  In this case I happen to know that the “unique” part of the index on Col1 is because that’s an identity column, so I’m not too concerned about maintaining the unique constraint here.  After seeing the usage stats over a month I’m able to restructure this table.  Both nonclustered indexes were dropped, a new unique clustered index was created on Col4 and Col1.  User Lookups were eliminated since it’s all just a clustered index now, we saved about 700 MB on disk, and the compressed equivalent on backups as well.

Heap_With_Unique_NC_Identity

Take another example where you can see that an index is duplicated.  No, none of these are 100% the same, but the second index is the unique (is_unique field), clustered (* for included columns, it’s just my way of doing it) has a single key column on Col1, which is the same as the first nonclustered index.  Seeing the usage like this, I’m going to drop the first index, saving 435 MB of space in my database.  This will have a negative impact in that it will move 626 index scans done throughout the month to an index that’s twice the size, and that’s a problem I’ll revisit later when I’m tuning queries that have high physical reads.

Notice you can also see two issues on this table that I’ll need to look into.  First, I have a nonclustered index that’s larger than my clustered index, something is wrong here.  Perhaps I should have put the fill factor on here to show that wasn’t the issue.  The real answer is that I’m working on a table that somehow got missed on the index maintenance, and I now know that this is possible and needs to be fixed.  Another issue is that there is a filtered index on here that has about 0.1% fewer rows than the unfiltered indexes, which means that you’re getting no benefits while alienating queries that may have used this otherwise.

Unused NC2

And one last example.  The 2nd and 3rd indexes both have a single key field on Col2 and just vary on the included columns.  This is a simple task of combining the two indexes by making one of them include Col1, Col2, Col6, Col7, Col8, and Col9, then dropping the other.  That saves us about 200 MB on disk, and avoids us having duplicated data in the cache.  Ask a SAN or backup admin and they’ll tell you the disk savings is great, but ask an unrealistically informed user and you’ll realize that cache savings is what makes a difference dropping your physical reads, upping your PLE, and making your app runs faster.

Then, there’s another simple task on this table of dropping the final index, which is nothing more than 112 MB of dead weight that was being updated an average of 30 times a day.  Not much to say here, they obviously didn’t use this as much as they thought they would.

Finally, you’re left with the info that User_Lookups is almost as high as the total number of seeks and scans on the nonclustered indexes.  This means that almost every query that hits the nonclustered indexes need to do a key lookup on the clustered index.  Knowing that, if I was going to get to that level of tuning, I would be looking into the queries that hit this table and determine if I want to add more included columns to actually cover the query or admit that I’ll never cover it efficiently and remove some included columns.  My answer here will vary partially depending on how many records are involved – key lookups are ok for 10 rows, but not so much for 1,000,000 rows.

Duplicated Index

None of these are hypothetical situations I forced AdventureWorks to do.  This is a real, production database, and I wrote two queries against it.  One to help me find these problems so I can put a change request in to fix them, and a second query that does the exact same thing while obfuscating the names to protect my employment.  So when you see the duplicated indexes or the sizes I’m working with, you can’t say that would never happen in production….it just did.

You reduced the load, now add to it

Now that you have fewer indexes in your database, it’s time to add to them. There are several ways to figure out what indexes you want to add, and I prefer to do everything manually. That means there’s no DTA (Database Tuning Advisor) here, it’s all scripts with proof that what I’m doing is making a difference.

Where are you finding the queries to tune

There are a couple approaches to finding the queries in most need of attention. Query Stats will give you the queries that take the most resources during your peak times, which will reduce the load on your server the quickest. Traces on duration will give you the queries that take the longest along with sample calls with all the parameters you need. Blocking monitors will give you the queries other queries are waiting on the most. Then, one of my favorites, job shadowing gives you what the pain points are for the users.

The first thing I look at is blocking, since that silently sits in the background making queries that could run extremely fast sit there and wait for a single slow query. If there is a lot of consistent blocking then take the head blockers and start tuning them. This won’t eliminate the fact that they’re grabbing locks, but it will cause those locks to be released quicker. Locks by themselves are not an issue and shouldn’t be eliminated, but the impact the locks have on other queries are an issue that is best to be handled this way. If you don’t have many locks that go too long, you can skip this and revisit it later. I say “too long” instead of giving you a number because it varies a lot; 5 seconds is too long for a web app while a minute isn’t much for a data warehouse.

Once that is out of the way, I’ll start diving into the most expensive queries on the server. They may not be the ones with the longest average run-time, but they add up to the most on the server. If you can cut the reads from 100,000 to 1,000 on a proc called every second then you’re doing more for the server than you would by tuning an hourly query that does 10,000,000 reads. Also, focusing on physical reads will help raise your PLE if that’s a concern on your server, which it usually is. Don’t stick on this for too long, you’re just trying to get the easy tuning out of the way and hopefully change the look of your Top-10 most expensive queries list a bit.

Now I move on to looking at the traces to see if any huge queries are running that are easy tuning opportunities. Again, skim through it, pick up the low hanging fruit and move on. Right now you’re still in the phase of reducing the load on the server, but you may not have hit what the users see yet.

Another place to look is in the cache to find out what indexes are taking up the most space there. If you know your data well enough then you should know what to expect from your cache. What you see there will probably surprise you, and you’ll probably see an Index_id of 0 or 1 (Heap or Clustered Index) sitting there taking 10 times what you would expect it to do. These are cases where there are clustered index scans taking place that you need to find. I was recently able to clear up 12 GB of cache on a server doing a full index scan on a 12 GB table where a 10 MB filtered index was able to do the job, which made PLE almost double.

So, now that you have the server running a little (or a lot) better in general, I get into the work that never ends. Get a hold of the people using the app, either by yourself or asking customer service to arrange it, and find out what they hate the most about the app’s performance. What you need here is for someone to write a list of when anything took a while, what screen they were on, and approximately how long it took. Don’t go requesting anyone to use stopwatches, you’re looking for “a couple seconds”, “about a minute”, or “I click OK then go get a cup of coffee.” Your goal here is to be able to correlate their pain points to something captured in your traces, giving you the procs they’re using along with real-life example calls to dive into.

Something you may find odd is that I’m diving into the part that affects the end users the most last, but I do it on purpose. The other pieces are things I skim through and do what I can in a week or so, but they’ll make the bigger impact on the server in general along with resolving some of the customer’s complaints in the process. In the process, you’re learning more about the server and how everything goes together so that when you start talking to your customers you can make a more immediate impact on their complaints.

After getting through list once I let my ADD take over, and man do I have ADD. The customers have an idea of what a DBA can do for them now to the point that they may even be contacting you directly when something annoys them, developers realize a SQL Server specialist can be a good person to lean on, and your server’s biggest pain point hurts a lot less. Once you get to this point, jump around on this list on your servers and take up a query or two when you feel like it.

New Job Checklist

I started a new job a month ago, but didn’t want to just figure things out as they came to the surface. Instead, I walked in with a list of questions I wanted to know about their environment. For both my side and theirs, this went over very well.

For those used to my blog, you’d expect all the performance monitoring to be a part of it, but that’s not all there was to it. In fact, the most important questions had to do with backups, recoverability, and nonprod environments. Lets go through it all in the order that I feel it’s important. This is already a long enough list with just saying what to watch for without getting into the details, and I’ll do deep dives on each section either over time or upon request.

Staying Organized

Do we know what SQL Servers are out there and which ones are prod? Also, who do we contact if it either went down on its own or if we’re going to intentionally take it down for maintenance? For when you do take it down intentionally, what’s the typical maintenance window for each server? Also, what builds of SQL Server and Windows are you running, including SQL Server build number and architecture for both Window and SQL Server? There are a lot of things we should know about the servers we’re working on, right up to basic hardware stats with current and max RAM and the number of cores for licensing. Know what you’re working with and make sure you have all the access you need on all the servers you’re supporting.

Backups and Recoverability

When was the last time each Tier-1 system was recovered from offsite backups? I’m looking for two things here. First, that they are being backed up offsite. Second, that we know those backups are working by testing them regularly. I don’t care if you use tape or disk, and I don’t care if the “test” was an emergency restore to get data back that someone accidentally deleted a week ago. Do they exist and do they work? In my case, I found a bit of a surprise here and ended up getting in the middle of a business continuity plan as seen on last week’s post,Disaster Recovery and Business Continuity

Continuing on with Business Continuity, I wanted to get the typical misconceptions out of the way. So I continued with the questions of how long it would take to recovery from different types of disasters and what does the business think those expectations are. Also, how much potential data loss does the business think is possible compared to the current recoverability? I know from my last blog post that some people feel that I’m overstepping the boundaries of a DBA with some of this, and I’m ok with that. My immediate boss is the VP of Infrastructure, and I think that people would do much better trying to view things from their boss’s point of view instead of their current position’s point of view.

Security

Now that I have basic recoverability covered, I moved on to security. First, most companies have too many people with access to prod along with sensitive data in all environments. This is easy to overlook. First thing’s first, protect prod at all costs. Who has what permissions, specifically, who has sysadmin? It used to be BUILTIN/Administrators, and your old boxes will still have that default, often unchanged. Don’t trust them to tell you who, query it yourself.

The next concern is the sensitive information that needs to be in prod and gets carried down to nonprod. Information such as SSNs don’t need to be accurate in nonprod, and there doesn’t need to be a way to recreate them from the nonprod data. However, you have to be careful when making this data useless since many apps join on this field, making you have a consistent algorithm to obfuscate the data. Hopefully you have another field such as EmployeeID or ClientID that you can substitute consistently and the developers would never know the difference.

Server Builds

Is there any consistency to the servers? Is the Min/Max server memory set up properly, are they using Instant File Initialization where possible, how big and how many files are there for TempDB, what is MaxDOP set as, what accounts does SQL Server run off of and is it a different account for each server? You need to have consistency if you’re going to consistently manage these servers. Yes, the values will vary, but it’s good to know that they’re set up right before you start getting into the issues. Jonathan Kehayias and Brent Ozar both have good server build docs out there, and I think most DBAs could benefit from reviewing what they have.

Depending on how they do it, the DBA will have some role in setting up failure notifications for SQL Agent as well. I’ve seen this as Database Mail where the DBA does almost everything and as an external program that reads the error logs where the DBA just makes sure the jobs are set to write to the app log on failure. Either way, you have to query the system tables on all the servers to make sure the setting are what you expect them to be.

You also can’t forget about the maintenance jobs. For which I believe everyone can benefit from Ola Hellengren’s Maintenance Solution. You may not use all of it, but the parts you do use you won’t write this thoroughly. Ola did something amazing here, put it out there for free, and you’d be crazy not to at least look at his generous offerings.

Staying Up-To-Date

It’s best to find out up-front what the plans are to move forward with getting everything to SQL 2012 or 2014. Even below that level, it’s good to know the process of getting all servers to the latest service pack and what the plan is with hotfixes and cumulative updates. The quicker you know this information the quicker you’ll know what’s on your plate for the next couple months.

How are we doing?

How do other departments view the DBA group? How do our customers (end users of any type) view our overall performance? How will you ever know if you don’t ask? The answers to questions like these are going to make a big impact on where your focus is. I would love to have questionnaires go out to Sr. Management, Mid-Level Management, and to End Users to find out their thoughts on a lot of different things, with a mix of open-ended questions such as their 3 favorite and 3 least favorite things on different areas and simple ratings questions with a rating from 1 to 4 (there is no average, there’s horrible, bad, good, and great). I want to know what they hope the app can do differently a year from now, how they view performance, and what their biggest headaches are.

In my case I made a mistake here when starting my new job because I came in knowing they had severe performance issues. When I looked at the different servers I saw a 2 TB OLTP with PLE dipping below 1,000 at times and a 100 GB OLTP with PLE staying above 10,000. I KNEW which one they were complaining about performance on and starting making some huge improvements. I was wrong, and that 100 GB database needed a lot more help then it was letting on. That’s not to say that the improvements I made weren’t needed or appreciated, it’s that I was fixing stuff that internal users were complaining about while waiting for another manager to ask me why I wasn’t focusing on making life better for the people who were paying our bills. It flew under my initial radar because it was smaller waits and coding issues while it was a constant struggle for the users that could pick one of our competitors over us.

I made a couple large changes that the SAN admin should notice on the 2 TB database and I got a couple “good job” comments out of it. Then I made a smaller change on the 100 GB database and got genuine gratitude and appreciation from multiple people. It wasn’t that I was working harder, it’s that my effort was focused in the right place after knowing where we stood and where our pain points actually were.

Finally, Monitoring

Once you know what you have, if it’s being backed up properly, it has security, is built consistently, there are plans to stay up-to-date, and you know where the databases stand from others’ point-of-view, you are finally ready to start looking into monitoring your servers. Look at my Monitoring and Baselines Presentation to give you an idea of what I’m looking to watch on each server. To be fair, we have Idera at my new job and I think that works out rather well, so there are things that I don’t have custom code watching. What I do is make sure that there is something accessible that watches everything I need and most of the things I want without duplicating too much work.

I do have complaints about Idera, and perhaps I’ll get into the pros and cons about the product in a later post. As far as this type of product, it is by far the one I have the most pros to talk about, so if any of the folks at Idera are reading this then you shouldn’t be sweating too much.

Wrap it up already!

This is a quick overview of the things that I’d prefer to talk about in my first two weeks on a job. Nothing was hit in-depth, and I could definitely write a whole post on each subject. However, don’t count on that happening. If you want a dive into a subject, ask.

I’d love to hear your comments on this, with the warning that I will stick to my guns completely if asked about wandering outside of the DBA role. I used to stick closer to my job description and requested duties, but didn’t feel like I fully bloomed until I tried to see everything from a higher view, most often from my boss’s view. Everyone organization can benefit from more leadership, and leaders have nothing to do with position in the company. I truly hope that this post encourages at least one or two people into taking the lead and getting the full picture under control, no matter whose responsibility it was.

Disaster Recovery and Business Continuity

Fires, SAN failures, tornados, and car accidents all came up today, yet it was a very good day at work.  I just started my job two weeks ago and made a list of questions to go through with them. After asking the one about when they last pulled an offsite backup and restored each Tier-1 database, I ended up in the middle of a project to restructure the disaster recovery and business continuity plan.

As a DBA I thought my job was to ensure that a database was recoverable from any disaster and that if you focused completely on the worst disasters possible then you would be safe from anything.  In the short-sighted version of things, perhaps I’m still right.  However, in the realistic version I had much to learn, and still do.

What is my job?

First, my job isn’t to make sure the databases are recoverable, it’s to make sure that database recoverability is part of a full business continuity plan.  That goes well beyond restoring a simple database to some server that I have no idea where it’s at or how it got there connecting to an unknown network without an app server.  There is so much more to a DBA’s responsibility in this aspect than I once believed.

If the app isn’t restored with the database then there’s no purpose for the database.  If there’s nowhere for the employees to sit and use the app then there’s no purpose for the app.  Finally, if the you restore too far from your current location then no one except the most senior and dedicated employees will travel to work.  All of this has to fit together for the mere responsibilities of a DBA to matter at all.

Second, the worst of disasters isn’t the most common, and the multiple day turnaround time is unreasonable for something as simple as a SAN failing or a building burning down.  Save the week long turnaround times for the disasters that remove your ability to work in the city you’re in.  Small disasters shouldn’t cost more than a couple days of data and productivity loss.

You need to look at disasters in different levels, and you’ll realize that as you go up the levels that they are increasingly less common.  Therefore you can’t have a simple business continuity plan and expect it to cover it all.  Also, if you don’t have anything then you can’t wait until you can figure it all out before you implement the first levels.

Missing Data

At the most common, you have a processing mistake.  The server and infrastructure are perfectly fine, someone just did something they shouldn’t have done.  Forget about the details, it doesn’t matter to you at this point if it was a developer mixing up < and > (haha, brings back memories of lost weekends), a user deleting the wrong records, or a DBA making a simple mistake.  No matter what, your challenge at this point is the same, you’re missing data and the business wants life back to normal as quick as possible.  For this the business needs to define what level of recoverability you need to have and help ensure you have the budget to do it.  If they never want to lose more than 2 minutes of data, you need log backups.  If an hour is ok then you can get by with Volume Shadow Snapshots (VSS) on VMs or possibly differentials on physical boxes.  If a whole day of data loss is acceptable then you’re home free with daily fulls or differentials.  No matter what you’re talking about here, this should be easily accessible by the DBA without outside help anytime day or night.

Small Hardware Failures

Now lets move on to local hardware failures.  A single drive shouldn’t be a concern because there’s no excuse not to have a minimum of RAID-5 on absolutely every file on a server.  Not saying you don’t have to double-check that the server “Prod-012” isn’t a laptop sitting under a stack of papers on an engineer’s desk, I’m just saying there’s no excuse.  Anyways, motherboards don’t last forever, RAM fails, and if your RAID-5 array is too large then it’s no safer than a stand-alone drive.  If you lose something, how long will it take to have something else take over?  If you lose a couple drives do you have enough spare space to get you by until you can replace them?  As for servers, Active/Passive clusters or VMs are great, one loss and you’re up and running in 5 minutes without too many complaints.  Active/Active clusters are also great if they’re set up right, as long as you’re not double-dipping on the RAM with two instances each set up to use 50 GB of RAM on a box with 64 GB of RAM where the OS will lose the fight quickly.  Standalones, however, are out of luck and need to have a plan in place.  How long does is the business able to deal without that server?  Did the business give you that answer in writing, and is the person in the business responsible for that server currently aware of that potential downtime as of your last annual review of disaster recovery and business continuity?  Wait, do you do an annual review with the business of your disaster recovery and business continuity???  Many companies don’t, and it’s your responsibility to start that conversation.

Shared Hardware Failures

Shared hardware and services also fail, no matter how fault-tolerant you think it may be. Power goes out, do you have a generator?  Networks go down, do you have two connections that can handle the traffic at an acceptable speed coming in through separate hardware?  SANs spontaneously combust, were your data files and backup files on that same rack?  Walk through your server room one day with a system admin and point to each device, discuss what happens when it stops working.  Telephone polls and transformers take time to replace, were your fault-tolerant connections coming in on the same pole?  Now we’re at the point that I’m not expecting you to have a spare on hand, but you still need a plan.  SANs are expensive, so you at least need to know where the money is going to come from if you need a new one overnight.  Again, know your risks, know your recovery time, and make sure the business has agreed to that in writing.

Building Loss

Now for what people tend to have in mind, fires, tornados, floods, blah, blah, blah, they all happen.  I’m in Pittsburgh, so hurricanes are covered in floods and there is no specific earthquake plan.  Is your server room in an area that would get flooded more than once a century?  If so, can it be moved?  If the building your server room in is no longer there, what do you have to recover with?  Offsite backups of databases aren’t enough.  You aren’t restoring databases at this point.  If that’s what you were thinking, you’re in a heap of trouble here.

First, you’re panicked; everyone’s panicked. Where is your plan, it wasn’t in that building, was it?  Who knows where to find the plan since you’re not in a disaster where you can count on everyone being available anymore.  Can everything be restored to a functioning state with acceptable data loss from data stored offsite, and was it tested to prove that it could be?  Remember, if you’ve only restored the database then you’ve restored a lot of valuable, yet useless, information.  Are the app servers in that same plan and up-to-date so the users can use the data?  Where are those users going to sit?  I know, this is on a DBA’s blog, so mostly DBA’s are reading this, we don’t tend to think about a user’s cube, or if their temporary cube is further than an hourly employee would even consider traveling to work.  However, right now our job is business continuity, not strictly databases.

So, say you do have an offsite, easily accessible plan that is written in a way that panicked and stressed employees mixed in with temps to make up for the employees that had to tend to their families instead of showing up for work could understand and implement flawlessly, but what does the business expect?  I keep coming back to this because the answer is usually that they think we have a million-dollar plan while giving us a small budget.  Realistically it may take us a week to have it all together, and we’d never know the full picture without testing it.  However, the business may be there saying that if it’s down for more than four days then we’re out of business, which would really suck after putting all that work in to this awesome, yet lower budget plan.  Make sure everyone knows the expectation and have that offsite plan not only signed off by the business, but the proof that they signed off on it stored offsite as well.  The IT department isn’t the only department stressed out when the data center burns to the ground, and stressed out people like to blame others for everything no matter who is at fault.

Oh no, I never thought that would happen

Finally, the largest of disasters.  What if the entire city where your data center in is inaccessible.  I know, how often does that really happen?  Hurricanes took out New Orleans for months and parts of New York City for weeks, earthquakes threaten to do the same to parts of the country, terrorist attacks are possible, plagues can happen (although I don’t foresee the zombie apocalypse), and I now know where the nearest nuclear power plant is.  Do you have recent backups stored outside of the region that can recover from these types of disasters, and has the business signed off on this?  For some companies it is acceptable for the answer to be that they don’t exist and it’s acceptable to go out of business for these disasters, and that’s ok.  Again, get it in writing, and store that piece of information in your personal offsite storage (yay, Google drive).  If this is the case, you’re protecting your rehirability when a company goes out of business saying their IT department failed them in a disaster.

I provide advice, not business continuity plans

Now it’s time for my disclaimer.  This is by no means all inclusive, and nothing you get from ANYONE’s blog should be taken as such.  Some companies have someone who has the job title VP of Business Continuity, and some don’t.  No matter what the case, it’s your responsibility as a DBA to make sure the DBA team has a part in this process and that the process is implementable for all levels of disaster with restorability that the business is aware of.  Unless you’re the only person in the IT department you shouldn’t have to do this alone, but it is your responsibility to start it if no one else has taken that responsibility.

Monitoring and Baselines Presentation

Here’s my full presentation for SQL Saturday #250 in Pittsburgh this past Saturday along with some notes on what I got out of it.

My Lessons

I learned a lot doing this, and I hope people learned a lot from it. For the people on the other side of the podium from me, the main lessons were in what I put in the abstract; they learned mostly about SQL Server. On my side of the podium though, the lessons were far from the same. Public speaking, creating presentation, prepping for unknown audiences, and seeing the gratitude of the SQL Server community in person. Anything that was on the abstract I learned slightly better, but that didn’t compare to what wasn’t on it.

Public speaking was always a fear of mine, and I tend to be someone who avoids my fears. However, throwing myself out there at the public as a whole was never my thing either, and I’ve been doing half decent, at least in my own mind. That being said, I decided to go all out and push myself a little further. After all, what are the chances of them picking me to present with all those professional teachers, consultants, and MVPs out there throwing up their abstracts. Best case I could say I tried, worst case I was going to throw myself out of my comfort zone and hope for the best.

They picked me, which I didn’t let myself expect. Everyone that knew I put in my abstract was also told by me that I only had a 50/50 chance at best, which was more optimistic than I really was, but I don’t like people knowing when I’m being dismissive of myself. It turns I was wrong a lot, and in every case I was wrong I was glad I was wrong.

Being Forced to Get Better

I’m not one to learn things for the sake of learning them, which is why I sucked in school. To really learn things I need a real-life use case, somewhere I’m going to apply it. I didn’t practice speaking ever before because I didn’t speak in front of large groups. I didn’t know how to put together a presentation properly because that’s not my thing…well, wasn’t is probably a better word now. Just like a couple months ago I wasn’t quite sure how to blog, but I’ve made it past that point.

Like everything in my life, blogging went from unknown to addiction quick. I’m not sure that I can say the same thing about speaking, but I can see the possibility. SQL Saturday only comes to Pittsburgh once a year and the local user group, which I’ll be joining soon, only meets once a month and has a single speaker. However, I can’t say that I didn’t look up when the SQL Saturdays in Cleveland and Washington DC were. Also, I took my own thoughts on my presentation and the feedback I received in and immediately starting thinking about how I could make that presentation better.

What’s below is the original, not touched up at all in the last 4 days. In part to show where I went wrong and how I’m going to fix it, and in part because Monday was my first day at a new job. By the way, starting a new job when between jobs is the best and worst thing you could ever do to yourself. You don’t have time to be too nervous about anything because you’re too overworked, but you’re also too overworked. Eh, you can’t win them all.

Well, on to the presentation. What you have here is my script that I talked through and taught myself before the presentation. However, if you were there, you’ll notice this isn’t exactly what I said. I didn’t read it, I presented it. There were no demos (the biggest complaint from the crowd) to avoid me from stumping myself, and the PowerPoint was just a whimsical picture for the start of each paragraph to keep me on track because I knew I’d be nervous and lose my way without a paragraph-by-paragraph guide helping me along.

I must have done at least descent because my reviews came back with two 3’s and the rest 4’s and 5’s out of 5. You’re just asking me to get off topic and start talking about why I hate ratings that are odd numbers, but no, I’m not getting off topic, I’m getting on to my presentation!!!

The Presentation

Download slide deck here.

I’ve been working on databases for over a decade, and most of that without having good standards or monitoring in place. My job was rough and I was much less effective. Now I know what I’m doing and I want to share that knowledge. In this presentation I am going to tell you what I watch and why I watch it. A presentation is a poor format to go through how to do this; having it in writing is much more beneficial to you. That is why I’m pretty much skipping that portion here and diverting you to my blog at SimpleSQLServer.com. Not so much to promote a blog that I lose money on, but instead to give you the resources in the best format.

As a DBA you will hit problems with the performance of your databases. It doesn’t matter if it’s one process or across the board, or if it’s just today or it has always been that way. No matter what it is, the more you know about your servers the easier it’s going to be to fix them.

It’s not easy, there’s no single spot to watch, no single solution to all of your problems, and there’s no “normal” values for these counters you can apply across every server you manage (Note: Thank you, Mike John, you stressed this point a lot). I watch several things on all the important servers, and most of these on every server. The important part is that you watch them continuously, even when you aren’t expecting to use the data. Some parts are cumulative and you can’t tell what was during an incident or what was from overnight maintenance. Other parts are snapshots and there’s no looking back.

(Note: Now I’m on to the stuff Brian Castle taught me. He’s the best you could hope for in a manager, and last I checked he was still hiring at EDMC in Robinson Township near Pittsburgh, PA)

To me, monitoring and baselining is the same thing. I know others will do a specific baselining process on a specific day and keep that. I feel you lose too much doing that and I watch enough to say that I have a continuous baseline going back for 13 months in most instances. I’m not all that worked up about lucky numbers or anything, it’s just nice to be able to say “that annual process we ran last week ran like this last year”. If you’re superstitious or have OCD, 400 days makes me just as happy.

Traces – Snapshot
Wait Stats – Cumulative – Resets on restart
Blocking – Snapshot
Query Stats – Cumulative – Resets on recompile
Index Stats – Cumulative – Resets on restart
OS Perf Counters – Varies
Database Sizes – Snapshot
Table Sizes – Snapshot

Traces – What long running processes have completed on the server?

You have to be careful, this one can kill your server. I have, on more than one occasion (sorry, Brian), caused a large server to reboot in the middle of the day because I filled the drives writing 40 GB of traces in 5 minutes. That being said, there is a safer way to trace, just don’t do an unfiltered trace catching everything.

I personally feel that absolutely every server should have a trace capturing RPC:Completed and SQL Batch:Completed that ran over X seconds duration. What is X? Well, that depends on the server. I’ve seen the best number for X be as low as 100 ms and as high as 10 seconds. Start high, and work your way down. You can add or do a separate trace for the Statement Completed for each of these if you want more detail.

You can get all of this except for SQL Batch:Completed with the text data from Extended Events, and that is a very good alternative. I haven’t made that jump for several reasons, although I would recommend it. First, this level of tracing hasn’t hurt me. Second, I still support several important servers that are still on SQL 2005 and I want to be as consistent as possible to provide as consistent as possible support.

What you’re trying to solve here is answering what ran long, have some hints as to why it ran long, and be able to say how often it has been running that way in recent history. A long duration doing very little work was waiting on something else, and other monitoring will help solve that problem. A lot of CPU, reads, or writes shows that you may need to look into tuning or statistics. Just keep in mind that reads can be reads of work tables, and writes can be writes to tempdb.

If you do it right you should have 4 or more days worth of history, and by doing it right I mean having 5 to 10 rollover files sized a reasonable size you can send off if needed (100 MB at most, they zip well) and capturing over the best duration threshold for your environment.

Wait Stats – What’s slowing you down?

The traces tell you what work was done, and the waist statistics tell you what went on when the query couldn’t actively do its work. This is cumulative over the server, so you can’t get details down to an individual process, however, you can see how much time is wasted and where it’s wasted at.

There are almost 500 distinct wait types in 2008 R2, but you don’t need to worry about that. The important thing is to know what are your biggest waits are, and you can look them up if you don’t recognize them. In the last slide I have links to a free eBook written by Jonathan Kehayias and Tom Kruger for the Accidental DBA that does an amazing job documenting what you can ignore and the meaning of the big ones you’re most likely to see.

If you’re looking for overall server health then you would want to look at waits over a long period of time, specifically the times of day you’d like to see better performance. For incidents, you want to look at what your wait stats are now and compare them to your running baseline. For me, the typical baseline I use is the same timeframe yesterday and 7 days ago. These are actual relevant days that have all but the changes you put in over the last couple of days, and it’s a true baseline for this specific server. If this server never waits on PageIOLatch, but it’s in your top 3 today then you instantly have a direction your heading in troubleshooting.

Blocking – Who’s in your way?

Locking occurs when one query is using data, and blocking occurs when another query needs to do something that is incompatible with that lock. If you don’t keep an eye on it then you’re looking back at a trace and seeing that a query took 1 second of CPU, did 1,000 reads, and no writes, yet it figured out a way to take 5 minutes in duration. The users complain the app is slow or unusable, and you’re giving them the answer that you’ll try to figure it out next time it happens and THEN try to find out the root cause.

That’s a bad idea. There are three types of users – those who don’t know you because things are running smooth, those that like you because you can say “I see exactly what happened and I can start working on avoiding reoccurrences”, and those users that you keep telling that you’re not sure what happened and you’ll try to get a better idea the next time it happens. I wish I could tell you that I could have more users not know your name, but watching blocking is a very easy way to get the users that like you to outnumber the users that hate you.

So, what can you do to tell them you know what just happened? My answer is to capture everything that’s blocking on the server every minute. Sure, there’s going to be a lot you capture that is actually only blocking for 10ms, and there are going to be things that were blocking for 59 seconds that were timed perfectly so you never saw it happen. The trick is to know that there is no perfect solution, and capture what you can. The DMVs are there to provide all the information, and being able to capture that on the fly along with in a proactive monitoring setup will make you look amazing to the users. If something blocked for 5 minutes, you have no excuse not to say “This query blocked this query, and this was the head blocker”. Sure you may have an excuse to say “It’s a vendor database and I passed it on to their support”, but at least you can tell the users something, which is always better than nothing.

Query Stats

The DMV dm_exec_query_stats is used by many DBAs, even if all we realized was that we were running one of the “Top 10 Most Expensive Query” scripts off the internet. The problem is that this is using it wrong. This DMV holds everything from when the query first went into cache and loses that information when the query goes out of the cache, even just for being recompiled. So the big process you ran overnight could still be in cache, leading you down a path that’s actually low priority for you. However, that relatively large process that runs every 5 minutes that just recompiled a couple minutes ago due to stats auto updating, it’s not even in there. This is too volatile to say that you’re getting good numbers out of it.

The answer isn’t to find somewhere else to get this information, it’s to capture this information more often and more intelligently. Capture it once and you have a snapshot of what it looked like, capture it again an hour later and you know what work has been done in that hour. Repeat the process over and over again, constantly adding another hour’s worth of data and saving off the latest snapshot in the process and you have some real information to go off of. This isn’t so much of what happened to be in cache at the time, it’s a pretty good summary of what ran and when it ran. Now you can query in detail, specifically saying “I want to know what queries are doing the most physical reads between 9:00 AM and 5:00 PM on weekdays so I can have the biggest impact when I tune this database.”

Index Stats

The DMV dm_exec_index_usage_stats isn’t too different from dm_exec_query_stats except that it’s cumulative since the time SQL services were started. Looking at a snapshot still leaves you half blind, unable to see what happened before the last time you rebooted the server and unable to tell when that index was used. Personally, seeing when it was used is less important to me than how much it was used long-term. Tracking this can do amazing things for you if you’re really fine-tuning a database.

This DMV is really under-used in my experience. As time goes on with a database the data changes and you add more indexes to make it run faster, but you don’t know if or when the index it used to use was being used by anything else so it stays there. Over time that means you have more and more unused indexes, and there’s no solid proof that they aren’t being used without you doing the work to collect that proof. This, in addition to duplicated indexes, adds to the workload for inserts and updates, adds to the workload of index maintenance, adds to the database size, and put additional strain on your cache and thus the PLE.

OS Performance Counters

This is one of the most misunderstood DMVs while also being one of the most useful. It’s giving you critical counters that give you an idea of how SQL Server is interacting with the hardware, but it’s doing it with several different types of counters that have to be measured differently. There are a couple great posts out on the internet to understanding how each one needs to be measured. As for right now, we’re focusing on why you want to watch it and what it will do for you. And wouldn’t you know that I saved the hardest one to describe for last, because each value is different and there will be controversy on which ones you should be watching and which ones are just a waste of your time monitoring.

PLE is one that this is no controversy about if you should watch it. This is the average age of the pages in cache, and a good measurement of when you’re doing too many physical reads. Peeking in on this from time to time is good, you can make sure it’s high enough. However, watching this will let you know when it’s dropping and help you dive into why it’s dropping. If you’re also running a trace you can see what did a lot of reads in that timeframe, and if you’re capturing Query Stats then you can find out which one of your large read queries is doing all the physical reads that destroy your PLE.

Server Target and Total memory are also great to watch even though they rarely change after the server is back up to speed following a restart of SQL services. The target memory is how much SQL Server would like to have, and Total memory is how much it actually does have. If target memory drops then there’s OS pressure you need to worry about. Also, as Total memory is increasing then it means the server is still filling the cache and PLE can pretty much be ignored. Of course it’s low, everything in there is new.

Page Lookups, Page Reads, Page Writes and Lazy Writes give you a better idea of how SQL Server is interacting with your cache and disks. If you want a real picture of what SQL Server is doing behind the scenes, this is much more useful than glancing at cache hit ratio.

Deadlocks and Memory Grants Pending are two things you’d like to always see at zero. That may be out of the question with deadlocks, but if you’re getting above zero on memory grants than you need to find out when that happened and everything that was running at that time.

SQL Compilations and Recompilations are hidden CPU hogs. They’re never going to be at zero because very few servers have stuff stay in cache forever, ad-hoc code runs, and several other reasons. However, when one of these counters jump up, you may have a hidden cause. For instance, if a piece of code that runs every minute or more was written in a way that it can’t be stored in cache, these numbers will be noticeably higher. When this happens, your CPU will be noticeably more stressed with no other indicators as to why, and this query won’t even show up in your Query Stats as it relies on showing the stats of everything that’s currently in cache.

Database Sizes

This seems simple, and it is. Most people know how to find the sizes of their databases, but do you track it? If not, you should. Some questions you should be able to answer readily are how fast your databases are growing, when will you run out of space, and is the steady decline in PLE justified by the growth of the database. Getting into more details, you can touch on when does the database grow such as an accounting app jumping in size every April, or answering if the data is growing steadily or exponentially. These are important things to know when it comes time to budget for new servers and disk space.

On the other side of things, you also have information to push back on application teams on how much space they’re using and if that’s really necessary. A lot of times you’ll find that they’re surprised by their growth to the point that you have to show them the numbers for them to believe it. It’s not uncommon for the response from this to be a cleanup project that helps keep the databases smaller and running faster, which is a big goal being accomplished.

Table Sizes

This is just building off of watching your database sizes in more detail. When a database starts filling up, and it’s filling up quick, it’s good to tell an application team what table is doing it. In my current environment it makes sense to watch every table that is over .5% of the database size AND over 100 MB. This means that some databases we don’t watch anything and other databases we watch about the 20 biggest tables. The biggest point is that we’re not trying to watch everything while watching anything big enough to make a difference to us.

If a database is filling a lot faster than normal then a lot of the time there’s a process that isn’t running like it should. The biggest table in the database may be a rather static value, but the third biggest table in the database wasn’t even on your radar two weeks ago. The app teams love it when you can tell them that not only is the database growing out of control, but we also see the growth in table X which has been growing at 200 MB per day starting on the first Saturday of last month. By the way, wasn’t last Saturday the day you changed a couple procs around or implemented an upgrade?

Now you just graduated from “Hey, there’s a problem” to also include “and here’s a huge lead to finding the root cause.” It’s rare to use this information, but it’s lightweight to capture for something that gives you a heavyweight appearance when you reference this knowledge to other teams.

My Critiques

I’d love to hear what you think about this presentation. Before we get to that, here are the complaints from the most harsh person in the room when I was presenting…

I didn’t do enough to say what each thing was. Although I marked the presentation as intermediate, you don’t know who’s going to be there. In this case, I knew a couple people in the crowd, and they ranged from data analyst to database manager. It should have been presented in more of “If you’re not ready for intermediate then you have everything you need, but it will take some effort to keep up”.

My slides were lacking. I stick with the idea that there should be few words on the screen, but I took it too far. I admitted above that the slides were there more to keep me on track then it was to help the audience, which I probably needed for my first public speaking venue. However, next time I’ll be less nervous and more prepared to make it geared for the audience better. Instead of random pictures, I should have more of graphs and data that I can talk through to give the audience visuals.

Nothing was given to the audience, and there should have been something. To be fair, none of the presentations that I know of handed anything to the audience. However, I set my own bar, and I feel that people would like a sheet of paper with the outline on it with links to online sources. My blog, of course, because it shows how to grab everything. Also, an outside link for each item. This gives a physical reminder to turn this knowledge into action.

There was no demo, and that was the audience’s biggest complaint. While I’m still not sure that I would do a live demo on my second go around, I’m going to hold myself on the remark above about my slides having graphs and data. That would also drag me away from the podium to talk through the slides, making it a more dynamic and interesting presentation; that would really help the audience take things in.

Your Turn

Now I need you to do two things. Put this information to work in your environment is what will help you the most. What will help me the most is you commenting on this post both before and after you implement any of this so you can help me become a better presenter.

Thank you for taking the time to read this!

Monitoring Database and Table Sizes

Trending database and table sizes helps give you an idea of what to expect, and, sometimes, points out problems and their root causes. I even go so far as to monitor the disk space, file growth rates, and have a report going out telling me that “according to your current growth trends, you’ll run out of space on this day”. That last part will have to be another post since getting disk space can be kinda tricky depending on how your disks are set up; you need to go outside of SQL Server to grab the size of mountpoints if you use them. Don’t worry about mountpoints right now, if you don’t know the term that usually means that you can swap out the word “drive” and have everything I say be accurate for your environment.

First, this is lightweight stuff, there’s really no reason to capture it more than once a day, and it will only add up to several MB for a year’s worth of data. You’re not going to kill yourself doing this, you’re not going to stare at the data every day taking up your time, but you’re going to look like a heavyweight when you put this into action and reference this data to others.

Second, I do have other similar posts doing snapshots of this same info for Table and File sizes. It’s basically the same thing just different in how you’re using it and if you’ll have a historical view of it.

Database sizes

Lets start with the databases because, well, they’re bigger. I capture it all, the database name, logical file name, file type (row, logs, etc), filegroup, allocated space, used space, max size, growth rate, if it’s percent growth, and the drive/mountpoint it’s on. Other than dividing the appropriate numbers by 1024 twice to store all my sizes in MB, everything gets stored in native format. The reason for that is you have raw data in a format easy for you to verify and easy for any outsider (consultant, new employee, etc) to understand everything you have. I may make some of you cringe because I don’t normalize this; it’s small data and not worth the effort, especially with page compression (see Brad McGehee’s post on compression) turned on.

What I like doing is having automated reports with logic behind it saying “if drive size is X, file growth is Y, then you can grow to a potential size of Z”, follow that up with “the used space in this database has grown X in the last 90 days and Y in the last 30 days, so according to those trends I’ll hit the potential size in Z days”. One of my favorite reports takes that last Z as a parameter and tells me anything that will run out of space before a specific number of days. I get that report daily for anything that will run out of space in the next two weeks and a separate instance of that report monthly telling me what will run out of space in the next 100 days. For me to run out of space without having a couple automated emails sitting in my inbox yelling at me saying “See, I told you that would happen!!!”, a database has to do something crazy and out of character. The only part of that last sentence that doesn’t really happen is the automated email yelling, those voices don’t exist outside of my head (do they?).

Even without monitoring disk space, you can predict autogrowths. This is useful for trying preventing autogrowth all together if you’re into that, but with instant file initialization this has never been a spot I’ve chosen to spend my time on. However, I do care about fragmentation, and I have a report that will tell me if we continue growing according to our 30 or 90 day trends then the data files will have to grow X times. If X is higher than 4 and filegrowth is below 2 GB chunks, it shows up on my report and I change the filegrowth to make it more reasonable. If you allow percent filegrowths in your environment (just say no), or even if one sneaks through, things get complex and you have to stroll through financial websites to find the calculations for compounding interest to figure out how much the files will grow.

Cast(CEILING(LOG((Cast((ProjectedGrowthInAMonthMB - FileGroupFreeMB) as Dec(20,4)) + Cast(FileSizeMB as Dec(20,4)))/(Cast(FileSizeMB as Dec(20,4)))) /LOG(1+(Cast(Growth as Dec(20,4))/100))) AS Int)

Something inside that database is growing, but what is it?

Now if a database is growing, the first thing someone will ask you is “why is it growing”. That’s an answer I can’t give you, but “where is it growing” I can do. This is why I’m also tracking the table sizes in a database. Be fair to yourself here and admit it up front that you don’t care if a table changes size radically if the largest size is still insignificant to that database. So make up rules on “I want to watch tables that are bigger than…” and only store the stats for those tables to keep your little collections from being the source of your size issues. For me, that means that a table has to be both over 100MB in total size including indexes and over 0.5% of the size of the database. I’ve used these sizes for a while now and have never had an issue on either side; no one needed info I didn’t have for space trending issues and the collection time and space I used never caused any complaints. This isn’t saying that someone won’t ask “did that table have 10 or 11 records this time last month”, but I’m not willing to collect enough to answer every little question.

If a database is growing at an alarming rate it’s easy to assume the largest table is causing it. That’s not always the case, and it’s not too uncommon for a database change to either create a new table that grows quick intentionally or accidentally cause an old table to stop cleaning itself up. No matter what the reason is, if you want to see how a database is growing, look at your table stats documenting the growth. Developers always listen to you better when you tell them this is what happened and here’s how I know while the words “I think” are viewed as instant discredidation.

How am I supposed to do this stuff?

You’ll need somewhere to store all the info. I like to store two weeks of info on the local server and pull everything to a central server nightly where data is held for a full 13 months. However, to keep me under control, lets just focus on getting everything to the local server in the following tables.

CREATE TABLE [Drives](
    [DateAdded] [smalldatetime] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
    [CapacityMB] [int] NULL,
    [FreeSpaceMB] [int] NULL,
 CONSTRAINT [DBInven_Drives_DriveLetter_MountPoint_DateAdded] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DriveLetter] ASC,
    --[MountPoint] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Files](
    [DateAdded] [smalldatetime] NOT NULL,
    [DatabaseName] [sysname] NOT NULL,
    [FileName] [nvarchar](128) NOT NULL,
    [TypeDesc] [nvarchar](60) NULL,
    [FileGroup] [sysname] NULL,
    [SizeMB] [int] NULL,
    [UsedMB] [int] NULL,
    [FreeMB] [int] NULL,
    [MaxSizeMB] [int] NULL,
    [Growth] [int] NULL,
    [IsPercentGrowth] [bit] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
 CONSTRAINT [DBInven_Files_DatabaseID_FileID_DateAdded_U_C_IX] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [FileName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Tables](
    [DateAdded] [datetime] NOT NULL,
    [DatabaseName] [nvarchar](128) NOT NULL,
    [SchemaName] [nvarchar](128) NOT NULL,
    [TableName] [nvarchar](128) NOT NULL,
    [RowCounts] [bigint] NULL,
    [AllocatedMB] [int] NULL,
    [DataSizeMB] [int] NULL,
    [IndexSizeMB] [int] NULL,
    [PercentOfDB] [decimal](5, 2) NULL,
 CONSTRAINT [PK_Tables] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [SchemaName] ASC,
    [TableName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

If you use mountpoints, and if you call them “mountpoint” followed by an incrementing digit, and you never have more than can fit into a tinyint, uncommenting those pieces of the tables will do wonders for you. In case you didn’t guess, they’re commented out because the people who do use them probably don’t use them that way.

DECLARE @RunTime SmallDateTime
DECLARE @SQL NVarChar(max)

SET @RunTime = (SELECT Max(DateAdded) FROM Perf.DBInven.Drives)

SET @SQL = N'use [?]; 
    INSERT INTO Perf..Files 
    select ''' + cast(@RunTime as nvarchar(50)) + ''' 
        , DatabaseName = db_name()
        , FileName = f.Name
        , TypeDesc = f.Type_Desc
        , FileGroup = fg.Name
        , SizeMB = ((f.size*8/1024)) 
        , UsedMB = (FileProperty(f.name, ''SpaceUsed'')*8/1024) 
        , FreeMB = ((f.size-FileProperty(f.name, ''SpaceUsed''))*8/1024) 
        , MaxSizeMB = Case f.Max_Size
            WHEN -1 THEN -1
            ELSE f.Max_Size / 128
            END
        , f.Growth
        , f.Is_Percent_Growth
        , DriveLetter = left(f.physical_name,1) 
        /*, MountPoint = CASE substring(f.physical_name,4,10)
            WHEN ''MountPoint'' THEN 
                Case isnumeric(substring(f.physical_name,14,4)) --MountPoint >= 1000
                WHEN 1 THEN substring(f.physical_name,14,4)
                ELSE Case isnumeric(substring(f.physical_name,14,3)) --MountPoint >= 100
                    WHEN 1 THEN substring(f.physical_name,14,3)
                    ELSE Case isnumeric(substring(f.physical_name,14,2)) --MountPoint >= 10
                        WHEN 1 THEN substring(f.physical_name,14,2)
                        ELSE substring(f.physical_name,14,1) -- MountPoint single digit
                        END
                    END
                END
            ELSE Cast(0 as varchar(10)) --Root drive is MountPoint 0 
            END*/
    from sys.database_files f
        LEFT JOIN sys.filegroups fg ON f.data_space_id = fg.data_space_id
    '

exec sp_MSforeachdb @SQL

And now to grab the tables. Keep in mind that I specifically only want to grab stuff over 0.5% of the database size AND over 100 MB. You’ll see that towards the end of the code.

DECLARE @SQL NVarChar(max)
DECLARE @DateAdded SmallDateTime

SELECT @DateAdded = MAX(DateAdded) FROM Perf.DBInven.Drives 

SET @SQL = ' Use [?];
    DECLARE @DatabaseSize Float

    IF OBJECT_ID(''tempdb..#TableSizes'') IS NOT NULL BEGIN
          DROP TABLE #TableSizes
    END

    SELECT DatabaseName = DB_NAME()
    , TableName = a2.name
    , SchemaName = a3.name
    , RowCounts = a1.rows
    , TableSizeMB = (a1.reserved + ISNULL(a4.reserved,0)) / 128
    , DataSizeMB = a1.data / 128
    , IndexSizeMB = (CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data 
                        THEN (a1.used + ISNULL(a4.used,0)) - a1.data 
                        ELSE 0 
                    END) /128
INTO #TableSizes
FROM (SELECT ps.object_id
            , [rows] = SUM(CASE
                                WHEN (ps.index_id < 2) THEN row_count
                                ELSE 0
                            END)
            , reserved = SUM(ps.reserved_page_count)
            , data = SUM(CASE
                            WHEN (ps.index_id < 2) 
                                THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count)
                            ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count)
                        END)
            , used = SUM (ps.used_page_count) 
        FROM sys.dm_db_partition_stats ps
        GROUP BY ps.object_id) AS a1
    INNER JOIN sys.all_objects a2  ON a1.object_id = a2.object_id
    INNER JOIN sys.schemas a3 ON a2.schema_id = a3.schema_id
    LEFT JOIN (SELECT it.parent_id
            , reserved = SUM(ps.reserved_page_count)
            , used = SUM(ps.used_page_count)
        FROM sys.dm_db_partition_stats ps
            INNER JOIN sys.internal_tables it ON it.object_id = ps.object_id
        WHERE it.internal_type IN (202,204)
        GROUP BY it.parent_id) AS a4 ON a4.parent_id = a1.object_id
WHERE a2.type <> ''S'' and a2.type <> ''IT''

    SELECT @DatabaseSize = Sum(TableSizeMB) FROM #TableSizes

    INSERT INTO Perf..[Tables]
    SELECT ''' + Cast(@DateAdded as VarChar(50)) + '''
        , DatabaseName
        , SchemaName
        , TableName
        , RowCounts
        , TableSizeMB
        , DataSizeMB
        , IndexSizeMB
        , (TableSizeMB/@DatabaseSize) * 100 
    FROM #TableSizes
    WHERE TableSizeMB > .005 * @DatabaseSize 
        AND TableSizeMB > 100

    DROP TABLE #TableSizes'

exec sp_MSforeachdb @SQL

So this leaves us with the size of the drives. This isn’t an easy subject. You can get everything you need through SQL Server if you don’t use mountpoints, but you’re left with either Powershell or VBScript to capture it with mountpoints so you can query WMI. Another post, another day. I’d actually prefer to rewrite my VBScript as PowerShell to both learn PowerShell and put a better solution out here for you.

This was my last post I needed to prep for my Monitoring and Baselines Presentation for SQL Saturday where I talk about what you want to watch and why. Since the presentation didn’t get into how to watch everything, I made sure that everything was on my blog ahead of time.

Why Blog?

The main reason I started blogging was a community thing. I leeched from the community to get me to where I am in my career, and it was to the point that the community was owed something in return. Everyone can give back in their own way, some answer forum questions, some blog, and some people find their own special way. For me, trolling for forum questions that I could answer that weren’t already answered adequately wasn’t for me; nothing against those folks, it’s just my preference. So I decided that I was going to put things in my own words.

My words were on the tip of my tongue, but there was a road block. Please, don’t laugh at me too much, but I didn’t know how to get my words out there. All these people had cool blogs where they programmed ways for people to leave comments, they programmed this odd little word cloud thing for links to keywords, WOW…they programmed ways to have keywords. I’m out of my league, I’m a SQL nerd, I don’t do this stuff, I can’t do this stuff. When I’m not too busy being bitter, it’s actually quite humorous how I took the long way around to discover WordPress. Those “geniuses” out there knew something I didn’t, and the thing they knew that I didn’t wasn’t how to be a master programmer of cool toys. Man, I feel like a fool!

I’m not selling WordPress here, it’s what I use. Other platforms are comparable; I know that now. The important thing to know is that if you can write an email, you can write a blog. The only trick is copy/pasting code from your editor to WordPress, which I found a cool tool to help you do this. There, my secrets are out, I just told you how to bypass all the big bumps in the road that looked like brick walls to me.

WAIT!!! It’s not that simple. I started and it wasn’t what I expected; same as everything in life, I guess. Blogging wasn’t a small little thing you do on the side and carry on with your life. It becomes a part of your life. 10 people viewed this stuff, and this is stuff that would have revolutionized my career 10 years ago if I would have read it back then. How can I get more viewers? I’m obsessed with view counts, I want this message to go EVERYWHERE!!! How can I write it better? I want the message to mean something and be something people want to read. How can I get myself to stop writing in spurts with 4 posts this week, no posts that month?

I hunted down my favorite blogger and begged for help even though he didn’t even know my name. He was quick to give advice and search his site doing the following searches on Google “site:brentozar.com blogging”, “site:brentozar.com writing”, “site:brentozar.com presenting”. AMAZING and SCARY all at the same time. You’ll get information ranging from starting blogs to who your target audience really is (and shooting yourself) to the 51 steps to make an awesome presentation. Some people show you the rabbit hole and let you go as deep as you want. I felt like I hit the bottom of the bottomless rabbit hole with a solid thud. Brent, I’m not sure if you’re reading this or not, but you’re the only person out there without a college degree who deserves a doctorate in communications on top of your Microsoft MVP and MCM titles you have earned. This is a cross between a “Thank You” and a “Why Me???”. More of a thank you, though. You taught me a lot.

Twitter_BrentO_BlogAdvice

Don’t go thinking that you’re going to hear Brent’s point of view restated here though. I enjoy what he has out there, but my point of view is my own. For instance, I just received an automated email from his group saying “But if you want to be a professional DBA, you should leave the developing to the developers.” in reference to creating tools to monitor the servers, leaving me almost with the impression of not reinventing the wheel, while my blog is largely comprised of a blueprint for a wheel. He continues with “I’m not saying you shouldn’t learn the dynamic management views (DMVs), how to dig through them to find the evidence you need, and how to turn that data into actions. Hopefully, I’m going to teach you how to do a lot of that over the course of the next six months.”, so maybe I’m mistaken. You know what, I’m off subject now and you really just need to got to his free email-based accidental DBA class and make up your own mind.

Back on topic now… The blog is now syndicated, so it’s showing up on SQLServerCentral.com and ToadWorld.com. More over, about 1/4 of my posts are highlighted by SQLServerCentral.com’s emails that go out to thousands of people and make my views go through the roof. At this point, on September 3rd, I have already had more views on my blog for September than I did for the entire month of July! I’m also writing ahead of time, scheduling posts to go out at 4:00 AM my time (ET) on the second business day of each work week (Wednesday for this post), giving me time to review my work while also providing some consistency and a scheduled commitment for me to keep myself inline. So many things have changed, all for the better, and I’ve only been at this for about 4 months by the time this is scheduled to go public.

It’s still not that simple, it’s not what I expected on so many other levels as well. I was just trying to give back to a community that gave to me. That’s part of it still, look at how that small part just blends into the full picture:

  • Community
    • I took a lot, giving a little is the least I can do
  • Expertise
    • I’m learning the little details to be able to explain it to others without misleading them.
    • If you can’t explain it simply, you don’t know it well enough. – Albert Einstein
      • This is my blog’s namesake, in case you were wondering.
  • Confidence
    • What I’m writing is being read by and used by the community. People look up to me, even if that’s only in my own mind.
    • Learning it well enough to teach makes you really learn it and be more confident.
  • Credibility
    • It is on my resume and employers LOVE it.
      • I’m excited about the technology I use at work enough to spend my free time on it.
      • There are examples of my work beyond the one-liner on my resume saying “I know how to watch for performance issues.”
      • You can feel confident giving me any employee and know that I can and will teach them.
      • Careful here, this means you have to write even more betterer to not make fool you out of. 🙂
  • Warm Fuzzies
    • I know if I had this early in my career, I would have been down the rabbit hole of system tables and DMVs years ago. It would have changed things for me, and I can only imagine that I’ve had that effect on one or two people.
  • Vengeance
    • Once you know about the DMVs, there is no going back. You took the wrong pill and you’ll never wake up knowing that the world was what you once thought it was. Welcome to the Matrix!
  • Obsession
    • Ok, it couldn’t all be good. I want more read counts, I want the message out there more, I keep looking for read counts, I keep looking to better myself, I keep looking to better my posts, aaaahhhhhh, I’m trapped, help me!!!!

Also, without this, I wouldn’t have the confidence and background to say that I’m ready to be a presenter. However, here I am putting the last details into my part of SQL Saturday #250! See an outline of my presentation right here on my blog.

SQL Server Central highlighted my posts on weeks 20, 21, 25, 26, and 34.  Steady growth otherwise.

SQL Server Central highlighted my posts on weeks 20, 21, 25, 26, and 34. Steady growth otherwise.

So, here I am, 4 months and 10,000 reads (WordPress + SQLServerCentral Syndication) later, and I’ve learned a thing or two.  By no means am I perfect; this is a continual learning experience.  For all those who have something to say out there, stop saying you have something to say and start saying it.  Syndicate it if you want others to read it.  The readers will come with time, just keep writing and strive for something worth a syndicator highlighting in their newsletters.  Who knows, maybe I’ll even make it to the SQL Rockstar Rankings someday. Read about writing to improve yourself.  Then, finally, be ready for a better you.

If you do blog, what has it meant to you? If you don’t, why not? I’d love to hear more people’s take on this as everyone has a unique experience and thought process.

Query Stats

DBAs rarely use the full potential of sys.dm_exec_query_stats.  It’s common to see the queries for looking at the most expensive queries according to any of the stats contained within the current cache, which is great to see.  However, if you grab snapshots of this information the proper way then you can get more detail that will really make a difference in your job.  Capturing this every 4 hours is probably enough, and here’s what it’s going to do for you.

First, using it the way most DBAs do and look at the most expensive queries according to a specific metric.  It works the same for them all, so we’ll just say we’re interested in physical reads in this case because our PLE isn’t what we’re hoping it could be.  You’ve probably seen the queries that directly query sys.dm_exec_query_stats to give you this similar to the one on MSDN for this DMV.  That will give you everything that’s in the cache right now, which could contain queries that recompile every hour in combination with other queries that have remained in the cache since the server was rebooted two weeks ago.  This means that when you’re querying to see what you can do to improve your PLE during business hours you’re getting anything that happens to be there now, not what ran in the time window you’re looking to improve.

Now look at it from the point of view that you capture this information every 4 hours.  You can get the total number of physical reads grouped by sql_handle where the interval_end time is in the last month, on weekdays, between 9:00 AM and 5:00 PM.  This is both a more complete and more accurate picture of what you care about, and you’re more likely to be focusing on the queries you should be tuning from the start.

Once you do start tuning, you want to know what effect you really had, and not just rely on the couple sets of test parameters you used in your NonProd environments.  A couple days after your changes made it to production, it’s a good idea to make sure you had the desired effect.  Query the data you have saved off for the last three days and for the same days last week.

What I went over so far is just realizing how this takes the traditional use of this to a new level, but there’s more functionality here now that it’s being saved off.  Now this is extremely useful in incident response.  Before you could see if anything was hogging resources as seen in my Recently Recompiled Resource Hogs post, but now you take it a step further by being able to verify if what’s running long now has always run that way or if something is different.  Also, with having the plan_handles saved off you can even see if it’s getting a different execution plan, which will help you determine if this was a change in data or if it’s a bad execution plan.

Not to shoot myself in the foot, but this isn’t perfect.  If you capture this data every 4 hours starting at 12:00 and it recompiles at 3:59 then you’ll only capture one minute for that interval.  Also, if you’re dealing with a query that can’t be cached then it won’t be found here at all; this is why you’re also running traces to find the big queries and watching sys.dm_os_performance_counters to know when there are a lot of compiles and recompiles.

To start capturing this, here are the tables you’ll need to capture everything.

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

/*
IF OBJECT_ID('QueryStats') IS NOT NULL BEGIN
    DROP TABLE QueryStats
END

IF OBJECT_ID('QueryStatsLast') IS NOT NULL BEGIN
    DROP TABLE QueryStatsLast
END

IF OBJECT_ID('QueryText') IS NOT NULL BEGIN
    DROP TABLE QueryText
END
--*/

CREATE TABLE [dbo].[QueryStats](
    [interval_start] [smalldatetime] NOT NULL,
    [interval_end] [smalldatetime] NOT NULL,
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [execution_count] [int] NOT NULL,
    [total_elapsed_time_ms] [int] NOT NULL,
    [min_elapsed_time_ms] [int] NOT NULL,
    [max_elapsed_time_ms] [int] NOT NULL,
    [total_worker_time_ms] [int] NOT NULL,
    [min_worker_time_ms] [int] NOT NULL,
    [max_worker_time_ms] [int] NOT NULL,
    [total_logical_reads] [int] NOT NULL,
    [min_logical_reads] [int] NOT NULL,
    [max_logical_reads] [int] NOT NULL,
    [total_physical_reads] [int] NOT NULL,
    [min_physical_reads] [int] NOT NULL,
    [max_physical_reads] [int] NOT NULL,
    [total_logical_writes] [int] NOT NULL,
    [min_logical_writes] [int] NOT NULL,
    [max_logical_writes] [int] NOT NULL,
    [creation_time] [smalldatetime] NOT NULL,
    [last_execution_time] [smalldatetime] NOT NULL,
) ON [PRIMARY]
GO

--Wide is bad, if you have a lot of nonclustered indexes.  
--This is a high-writes, low-reads table, so a single clustered index is probably all I want.
--With Interval_End being the first key field, they're inserted in order and eliminate mass fragmentation
  -- as well as bad write performance.
CREATE UNIQUE CLUSTERED INDEX IX_QueryStats_intervalend_sqlhandle_statementstartoffset_planhandle_U_C ON QueryStats
(
    interval_end
    , sql_handle
    , statement_start_offset
    , plan_handle
)

CREATE TABLE [dbo].[QueryStatsLast](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryStatsLast_sqlhandle_planhandle_statementstartoffset_U_C] ON [dbo].[QueryStatsLast] 
(
    [sql_handle] ASC,
    [plan_handle] ASC,
    [statement_start_offset] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

CREATE TABLE [dbo].[QueryText](
    [sql_handle] [varbinary](64) NOT NULL,
    [QueryText] [nvarchar](max) NOT NULL,
    [DatabaseName] [nvarchar](128) NULL,
    [objtype] [nvarchar](20) NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryText_sqlhandle_U_C] ON [dbo].[QueryText] 
(
    [sql_handle] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

Now that the tables are created, here’s how I populate them.  Set this up to run once every 4 hours to keep the data small during typical running, and kick it up to once an hour or more if needed to pinpoint an issue if you need it.

DECLARE @interval_start smalldatetime
    , @interval_end smalldatetime

IF OBJECT_ID('tempdb..#QS') IS NOT NULL BEGIN
    DROP TABLE #QS 
END 

CREATE TABLE [dbo].[#QS](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) 
/*
--This sounded like a great idea, but it just slowed it down several seconds.
CREATE UNIQUE CLUSTERED INDEX TempQS_Cluster ON #QS  
(
    sql_handle
    , plan_handle
    , statement_start_offset
)
*/

--The Cached Plans Object Type is in here in case you want to treat ad-hoc or prepared statements differently
INSERT INTO #QS 
SELECT qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , cp.objtype
    , qs.execution_count 
    , total_elapsed_time_ms = qs.total_elapsed_time/1000
    , min_elapsed_time_ms = qs.min_elapsed_time/1000
    , max_elapsed_time_ms = qs.max_elapsed_time/1000 
    , total_worker_time_ms = qs.total_worker_time/1000 
    , min_worker_time_ms = qs.min_worker_time/1000
    , max_worker_time_ms = qs.max_worker_time/1000
    , qs.total_logical_reads
    , qs.min_logical_reads
    , qs.max_logical_reads
    , qs.total_physical_reads
    , qs.min_physical_reads 
    , qs.max_physical_reads
    , qs.total_logical_writes 
    , qs.min_logical_writes 
    , qs.max_logical_writes
    , qs.creation_time
    , qs.last_execution_time
    , DateAdded = getDate()
FROM sys.dm_exec_query_stats AS qs
    INNER JOIN sys.dm_exec_cached_plans cp on qs.plan_handle = cp.plan_handle
--WHERE cp.objtype NOT IN ('Adhoc')

INSERT INTO QueryText (sql_handle, QueryText, DatabaseName, objtype)
SELECT QS.sql_handle
    , QueryText = qt.text
    , DatabaseName = DB_NAME(max(qt.dbid))
    , max(QS.objtype)
FROM (SELECT #QS.sql_handle
        , #QS.objtype  
        FROM #QS
            LEFT JOIN QueryText QST ON #QS.sql_handle = QST.sql_handle
        WHERE QST.sql_handle IS NULL) QS
    CROSS APPLY sys.dm_exec_sql_text(QS.sql_handle) qt
GROUP BY QS.sql_handle
    , qt.text

--All the values are the same in each table
SELECT TOP 1 @interval_start = dateadded FROM QueryStatsLast
SELECT TOP 1 @interval_end = dateadded FROM #QS

IF @interval_start IS NULL BEGIN
    SELECT @interval_start = create_date
    FROM sys.databases
    WHERE name = 'tempdb'
END

INSERT INTO QueryStats (interval_start
    , interval_end
    , sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time)
SELECT @interval_start
    , @interval_end
    , qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , qs.execution_count - ISNULL(qsl.execution_count, 0)
    , qs.total_elapsed_time_ms - ISNULL(qsl.total_elapsed_time_ms, 0)
    , qs.min_elapsed_time_ms 
    , qs.max_elapsed_time_ms 
    , qs.total_worker_time_ms - ISNULL(qsl.total_worker_time_ms, 0)
    , qs.min_worker_time_ms
    , qs.max_worker_time_ms 
    , qs.total_logical_reads - ISNULL(qsl.total_logical_reads, 0)
    , qs.min_logical_reads 
    , qs.max_logical_reads 
    , qs.total_physical_reads - ISNULL(qsl.total_physical_reads, 0)
    , qs.min_physical_reads 
    , qs.max_physical_reads 
    , qs.total_logical_writes - ISNULL(qsl.total_logical_writes, 0)
    , qs.min_logical_writes 
    , qs.max_logical_writes 
    , qs.creation_time
    , qs.last_execution_time
FROM #QS qs
    LEFT OUTER JOIN QueryStatsLast qsl ON qs.sql_handle = qsl.sql_handle
        AND qs.plan_handle = qsl.plan_handle
        AND qs.statement_start_offset = qsl.statement_start_offset
        AND qs.creation_time = qsl.creation_time 
WHERE qs.execution_count - ISNULL(qsl.execution_count, 0) > 0 --Only keep records for when it was executed in that interval

TRUNCATE TABLE QueryStatsLast

INSERT INTO QueryStatsLast
SELECT sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , objtype
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time
    , DateAdded
FROM #QS

SQL Saturday 250 – Pittsburgh – Advice for My Presentation on Monitoring and Baselines

I’ll be presenting at SQL Saturday 250 in Pittsburgh on Baselining and Monitoring. This is my first public presentation, and I could use some advice to make sure my presentation is the best it can possibly be for both myself and the audience.

First, my presentation. Anyone who’s been following my blog won’t be surprised by anything they see. I’m going to hit all of the the statistics I watch and why I watch them. The whole thing is a pep rally for what, why, and how it will change your job. How to do it will be completely skipped for the reasons below.

  • Brent Ozar told me to in point 13.
  • Trying to take notes on the how would be too intense for any audience member.
  • What and why are more exicting, life’s too short for boring!!!
  • I already did the how, it’s on my blog, and I’ll tell the audience all of the hows are already listed here in both my opening and closing statements.

The points I’m hitting are:

  • Traces (with mention of Extended Events)
  • Wait Stats (sys.dm_os_wait_stats)
  • Blocking (sys.dm_os_waiting_tasks with LCK% type)
  • Query Stats (sys.dm_exec_query_stats)
  • Index Stats (sys.dm_db_index_usage_stats)
  • OS Perf Counters (sys.dm_os_performance_counters)

In all of these areas I’ll go through why I watch these and how monitoring them over looking at the values you can get at any point-in-time has changed my life as a DBA. After going through each section I would like to ask the audience for questions in the middle of the presentation so the topic of the questions is fresh in everyone’s minds and more likely to be taken to heart. How you capture these will be completely documented on my blog to the point that you can practically copy/paste the code to develop a customized monitoring setup on your servers from one source.

Baselining is built directly into my monitoring and how I use my results. Instead of having a static baseline that I took on a certain day, I’m a huge fan of comparing today’s incidents to yesterday and last week with a rolling baseline. I find it to be an impressive way to handle things.

The slide deck still needs a bit more work, but it’s literally just going to be a couple words (Steve Jobs style!!!) with the information coming from what I say instead of the audience’s attention being divided between me and the screen behind me. They’ll see the topic on the screen with a picture of what it means to me while hearing my views.

This presentation is almost complete on my end, but I’m flexible enough to say I’m wrong on anything and make changes. Some questions I have for your are:

  • Is this a presentation you would enjoy seeing? Why or why not?
  • Do you feel this is a complete list of items to monitor? What would you change?
  • Are questions in the middle of a presentation a good idea?
  • How do I make this public, both assisting the audience as much as possible while keeping in mind my rights and possible future use of this presentation? I would consider posting the full thing on the internet, even a recording of me doing it, but I don’t want to think about it a year from now and think that I just shot myself in the foot doing it. I don’t know what to expect, thus the call for help.
  • Anything else you have to add will be greatly appreciated. Adding that information before September 14th would raise the level of my appreciation! 🙂

I’m extremely motivated to do this, and I want to absolutely kill this presentation! The only way to hit my expectations is to prep for it, and the best way to prep for it is to take in as much positive and negative criticism before it goes live as I possibly can.

Thank you for taking the time to read this!
Steve

Fixing Page Life Expectancy (PLE)

What is Page Life Expectancy (PLE), what makes it drop, and how can I manage memory better? Abusing disks slows many database servers, and it’s often something you can fix with tuning and not spending extra money on better disks or more memory. It takes a very long post to get through all of that, but if you stick with me through this then you’ll be looking at your servers from new angles with an effort that will be noticed by the Sys Admins, SAN Admin, the users, and your boss.

Before we being, there are some ground rules we need to get out of the way defining PLE and understanding there are external memory factors. There’s no shame in skipping that and going straight to the focus of this post. Personally, I find the next two sections a little dry, but I’m also my harshest critic!

Define PLE

Before we get too deep into it, lets make sure we’re on the same page on a couple things.

Page Life Expectancy is the number of seconds the average page of data has been in the buffer pool.  Keeping the data in memory gives SQL Server quicker access to it instead of making the long, slow trip to disk.  While none of the counters in SQL Server means everything on its own, this one can open your eyes and lead you towards issues that can be resolved.

Keep in mind that SQL reads the data pages from the buffer pool, always from the buffer pool. If the data you need isn’t there then SQL Server does a physical read to put it there. After that’s done it will do a logical read to use the page that’s now in memory. If you want to dive into the detail you can do it here, specifically with the reading pages and writing pages links on that page.

That physical read is going to disk, the slowest part of your server, to read the page(s) from disk, be processed by the CPU, possibly decrypting it if you’re using TDE, then placing it into memory. This makes PLE critical because, even if you ignore the extra load you just placed on the I/O system and CPU, you’re waiting for an additional, slower action to take place.

The speed of your disk does matter, but it also doesn’t really matter.  This is an EXTRA step with SQL Server doing a physical read THEN a logical read, not instead of a logical read.  Also, your persisted storage is the slowest part of your server, be it spinning disks, SSD, flash, or anything else.  The expensive stuff just isn’t AS slow.

You can find your page life expectancy in sys.dm_os_performance_counters using my post on OS Perf Counters. That post will also help you realize how much load you take off of your disks by raising PLE, then you’re really going to start to understand the burning drive picture you find there.  While it’s great that I have the code out there to get this yourself, the tracking of this counter should be coming from your monitoring software.

The problem is that many people see the 300 value for Page Life Expectancy you can still find documented is wrong, very wrong. If you have a server with 64 GB of memory with 56 GB allocated to SQL Server, that means you’re reading about 56 GB of data from disk every 300 seconds. If you look at Page 36 of Troubleshooting SQL Server – A Guide for the Accidental DBA by Jonathan Kehayias and Ted Krueger you’ll see an actual scalable version of this; PLE should be 300 for every 4 GB of RAM on your server. That means for 64 GB of memory you should be looking at closer to 4,800 as what you should view as a critical point.

The reason you see 300 written everywhere is because we were limited by 32-bit servers for so long while at the same time SQL Server documentation was really being developed well.  Those servers could only have 4 GB of memory, thus making this number make sense as well as the Buffer Pool / 4 GB * 300 formula.  I’ll go so far as to say that you should be cautious of anything that doesn’t have a formula to it because our servers keep getting faster, our databases keep getting bigger, and static values…don’t.

Jonathan Kehayias also put this in his blog post Finding What Queries in the Plan Cache Use a Specific Index, and that had a great series of comments going back and forth between him and Brent Ozar. Brent is discussing not putting so much faith in PLE and watching wait stats instead as different speed I/O systems can greatly sway the impact of low PLE or even make it so a high PLE just isn’t high enough. They’re both right, and if you can understand the intent of each one then you’ll be in a much more comfortable place when tuning your servers.

If you force me to take a side on that post, Jonathan wins out over Brent here. This is because Jonathan is talking about having a higher PLE, which in turn reduces load on the disks and typically reduces the amount of work that needs to be done to execute a query. Brent is basically implying that money may have already be thrown at the issue, which negates the current symptoms on what could be a less scalable solution. However, you can’t discount Brent’s logic here as the things that are limiting your server’s performance at this time lie in the Wait Stats, not this counter.

Quick Rant: We’re Not Alone

So, you want higher PLE to show that you’re using your resources better, and the only way PLE goes up is by leaving data in the data cache once it’s there. The problem is, there’s a lot going on that wants to work the other way. Even if you have 64 GB of memory dedicated to SQL Server to host a single 64 GB database it won’t all fit in memory, that’s because other things want to play as well.

First, your memory isn’t completely dedicated to the buffer pool. Memory pressure from the OS or VMs changing size can drop the size of the buffer pool has unless you’re using “lock pages in memory”, which is not recommended in most situations especially on VMs. If you’re not on a completely dedicated box, which means a physical box with no apps, no GUI, no RDP sessions, no SQL Server, no network card driver, etc., then this can happen. Wait a minute…you can’t uninstall SQL Server to make sure SQL Server’s buffer pool remains untouched, yet SQL Server uses memory outside of this area for any purpose that isn’t specifically assigned to the buffer pool and that memory usage can get out of control. If there’s memory pressure then some pages are forced out of data cache to drop the size of the buffer pool, leaving you open to more contention and a lower PLE.

Second, your buffer pool isn’t dedicated to your data cache as it also houses your proc cache, which are execution plans stored to be run again. These plans take a lot of CPU to create and reusing them saves you a lot of resources, so you can’t complain that you’re sharing space here. SQL Server determines the amount of the buffer pool to assign to each, and it’s possible for either side to have unnecessary data in it. I’m going to focus on the data cache here, so I’ll defer the abuse of proc cache conversation to Kimberly Tripp. There’s more to it than just her post, but this is a great place to start understanding what’s there and not in use.

Focus, Focus, Focus

Ok, so we know what PLE is and that external factors can affect it, but that’s not really why you’re here, is it? Are you here because you have your data cache and you want to manage it to keep your PLE going up…maybe a little. My guess is that you’re here because you heard that PLE is critical to your server’s performance and you want to increase performance by raising this counter the right way. You do that in two ways.

First, make sure you have as much room as possible. If your server can handle more memory it’s often cheaper to buy that much memory than it is to intervene on the database side. Tuning and purging scales well, will help a lot more than just memory management, and is the best thing you can do given the time budgeted to it, but it just isn’t as quick and easy as a memory upgrade in many cases. Even if you buy memory now and tune later, it will continue to help. 64 GB of memory (it’s 128 GB max on Standard Edition now), as Brent Ozar pointed out here, is only a couple hundred dollars here. FYI, I absolutely hate throwing money at issues, and I still listed this first.  Partially because it’s an easy fix to an often neglected issue, and partially because every fix is throwing money at the problem because your time isn’t worthless.

Second, keep as little in memory as possible to make efficient use of the space you have. Yeah, it’s that easy. Well, at least it’s that easy to say, but doing it takes a little more work. There are a list of ways you can trim down on the space you need.

  • Drop Unused Indexes
  • Merge Duplicate Indexes
  • Use Your Indexes – SARGability
  • Watch for Big Queries
  • Look in Your Proc Cache for Opportunities
  • Know What’s in Your Buffer Pool
  • Index Maintenance – Defrag
  • Index Maintenance – Statistics
  • Purge Your Data
  • Other

Through all of this, please remember that your initial thought may be to raise a counter, but is that your real goal? I’m betting you want your server to run faster, and PLE is merely one counter that helps put a value on performance. Servers with high PLE can run horrible, they just don’t need to keep running back to disk. Servers with low PLE can run great, they just need to go back to disk to get what they need.

A little bit ago I mentioned buying more memory, this will raise your PLE without making any queries really run more efficiently. Sure, they’re more likely to have their data in cache which is great, but they’re doing just as many logical reads and using just as much CPU to do their joins. The rest of this post isn’t like that, the rest is making queries more efficient in a way that just happens to make your PLE go up.

Drop Unused Indexes

This part’s tricky. SQL 2005-2008 R2 told you how much indexes were used since the last time SQL Server was started in the DMV dm_db_index_usage_stats. It’s still there in SQL 2012 and beyond, but these statistics get reset when you rebuild an index now. That means that best case scenario on a server that’s patched monthly, you have a month’s worth of data to go off of and you can’t even rely on that being the whole picture anymore. I get into more about how to keep track of this over time in my post Indexes – Unused and Duplicates.

This has a minor implication on PLE because as the pages of your unused indexes are updated SQL Server has to read part of the B+tree into memory to find the page, then it has to read the page to be updated into memory, it updates it in memory and marks it as a dirty page, then eventually writes it back to disk. The key here being that pieces are read into memory, AKA the buffer pool, the part of SQL Server you’re trying to clean up.

Merge Duplicated Indexes

This one’s not as tricky, but the performance issues are much worse. The problem is that you have two indexes, we’ll call them ix_A and ix_B, that are very similar. They have the same first two key fields, and stray a little after that. The differences mean that ix_A will be better for one query and ix_B will be better for another, and that means SQL Server will be dragging them both into cache. Sure, if you take the columns that are in ix_B that aren’t in ix_A and include them in ix_A then drop ix_B then the queries that used to use ix_B will probably need to do a little more work, but you’ll end up with a more efficient use of the cache. Your typical testing will show you the worst case scenario here, because you’ll run it each way multiple times and see what it’s like when the indexes are completely in cache. However, the real world is more friendly than that (that’s a first), and the query that used to use ix_B may actually run faster because now it’s using an index that’s more likely to be in cache, cutting down on physical reads. Going to the same blog post, Indexes – Unused and Duplicates, you can see details on how to help relieve this problem.

There’s a bit of an art to this at first, but you’ll get it down to a science with practice.

Use Your Indexes – SARGability

SARGability (Search ARGument capable) is it’s own topic, and it deserves at least one post of its own. You need to understand how indexes work, as stated in my post Indexing Fundamentals, and make sure that your queries can take advantage of indexes. If you look in a phone book for people with the last name like ‘%ood’ it’s no good because you need the first letter of the last name to take advantage of that index. If you look up people with the first name ‘Steve’ then that’s useless because the first key field was last name. Doing functions, implicit conversions, and many other techniques can cause you to need to scan an entire index.

In terms of PLE, that means you read the entire index into memory instead of just the pieces you need. That can add up to a lot of data being kicked out of cache to make room for a lot of physical reads from disk.

This isn’t an easy topic, and I’m not sure of any resources that say everything that needs to be said on the topic. While I have plans to write a post on it and will update this one when I do, it’s still an unwritten post. Until that time comes, Google “SQL Server SARGable” and see what pops up.

If you know how to read execution plans, there are two sections on a seek, scan, or lookup that you need to know here. Predicate and Seek Predicate. Predicate is what it had to scan for, while a seek predicate is what it could find efficiently using the fact that the key fields are sorted.

Watch for Big Queries

It doesn’t matter if you use Extended Events or Profiler Traces, you need to know when large queries are being run on your servers. If you want to watch this at a statement level then you’ll want to make the jump to Extended Events, but I’ve never seen it hurt to watch SQL Batch Completed and RPC Completed filtered only by duration over 10 seconds.

The point of this when it comes to PLE is that the queries that aren’t SARGable or don’t have an index to take advantage of will be doing a lot of scans. The bigger the scan the longer the query will take to run, and if you care about PLE then you’ll know what queries those are. Sure, some are legitimately doing enough work where they’ll always take that long to run, but most of the queries that take a while can be tuned. Be it changes to the query itself (should be your first instinct) or indexing changes, there’s often a way to have a query require less data in cache.

This is an extremely useful technique to use in correlation with watching for drops in PLE. If it drops then check to see what was running at the time. This shows you what queries were running, which would be most of your issues, but it could be outside of the query itself and actually be something like a large auto update statistics task being kicked off that you won’t see here.

Know What’s in Your Buffer Pool

Very few people do this, yet it’s so obvious once you start. You want to raise PLE, right? PLE is a measurement of how long stuff stays in your buffer pool, and drops because something else needed to be put in there. So, what’s in there that’s taking up all your space and how’d it get there? I felt like a fool that it took me years to come up with that questions, but then I realized that most people never ask it at all.

It was actually one of my biggest tuning revelations since I read Grant Fritchey’s Execution Plans book. And you always know it’s a great revelation when you find yourself yelling at yourself for not realizing it earlier…years earlier.

So, how do you do this? Query your cache to find the indexes that take up the most space there, pick one that is an excessive amount of space or doesn’t look like it belongs there, query the proc cache to find out where that index is used, and tune that query either through code changes or indexing changes. The scripts to do this and a more detailed description of the process can be found in my post Cleaning up the Buffer Pool to Increase PLE, and a newer post of mine, Query the Buffer Pool, has an even better script to see what’s in cache across all databases on the instance at once in a more efficient query.

Note that if your PLE is low then what’s in your buffer pool will be changing quickly.  That does NOT mean the scripts in the posts I just mentioned are useless, it means you’ll come up with new opportunities each time you run them.

Look In Your Proc Cache for Opportunities

Lets start by saying this isn’t perfect. These numbers get reset throughout the day and some queries never make it in here at all. That’s saying this isn’t an all-inclusive, one-stop shop. It is in no way saying that you can’t make amazing improvements on the queries you find here.

There are two way you can use the cache. Most people know that you can get your most expensive queries such as on the MSDN sys.dm_exec_query_stats page, and you can use your imagination to sort this by any of the counters available here. That will give you an idea of what needs tuning, if you needed somewhere to start. I love taking this a step further and monitoring that exact information so I have a historical view instead of just what happens to be in cache now. You can read all about that in my post Query Stats.

The other way is parsing through the XML in the plans to find specific items. I heard that grunt when I mentioned XML, and you’re not alone. Not many of us are good at XML; Jonathan Kehayias is an extreme exception here. He gives you his queries so you can run with it, do what he does, and alter it to go even further. This is exactly what I did in my post Cleaning up the Buffer Pool to Increase PLE that I just mentioned in the last section.

Index Maintenance – Defrag

Many people only thing about defragmenting their indexes to help get contiguous reads on their disk, but that’s only half the story. In fact, the better you do everything in this post the less you’ll have to worry about your disks.

However, you also have to keep in mind how full each page of your indexes is. If you have an index page that’s 100% full then you have a full 8kb of data there. Add another row and you now have 8.1kb of data that’s split between two 8kb pages, so you’re wasting almost 50%. 50% isn’t even your worst case scenario because pages are never automatically removed or merged if records are deleted unless you delete every row stored in that page. Paul Randal’s post Performance Issues From Wasted Buffer Pool Memory takes a deeper look at this problem, and Ola Hallengren’s scripts can help you with a solution with trusted scripts to clean up fragmentation.

Index Maintenance – Statistics

It may seem obvious to do your index defrag job off hours. While I’ve seen servers without the job, I’ve never seen one scheduled for mid-day. That’s a good thing, because a lot of data needs to be pulled into cache to play around with indexes like that.

However, do you have auto update statistics turned on? Do NOT turn it off because of this, but understand that it updates the statistics by reading about 1% of the index into cache. If that’s a 100 GB index then you’re reading 1 GB of data which is a descent portion of your cache. To help avoid this, you should be updating your stats off hours, and Ola Hallengren’s scripts can help with this, too.

You need to update your stats because as more and more updates are made to a table the less accurate your stats are, making your execution plans less accurate. Once you update about 20% of the rows of a table since the last time your statistics were updated then they’ll automatically be updated again if you leave the default option turned on. Again, don’t go turning it off because of this post alone. Instead, schedule your stats to be updated off-hours and only the tables that have 20% of their rows updated throughout the day will get their stats updated automatically, and that will stop most auto-updates on the larger tables that would cause issues.

Purge Your Data

We talked a lot about what to do with the data you have, but do you need all of your data? Step away from SQL Server once and go talk to an accountant. Ask them where they have paper work from last month and they’ll point to a filing cabinet. Now, knowing they’re legally obligated to keep a lot of stuff for 7 years, ask them where a 5-year-old document is. It’s not right in front of them, but they know where to look for it. It’s not taking up valuable space in their office (read: not on the main production server), often not in the same building they’re in (read: not online), and it will take a bit to get to it. They no longer have an active business use for it, and they can get to it in the rare case that it’s needed.

Ask them where something is that’s 8 years old, and they’ll find a friendly way to tell you that they bought an OCD pyromaniac a pair of scissors and a book of matches. We’re more eloquent with deleting things in batches, but their way sounds like a lot more fun.

If they kept everything in one room would you call that room a cluttered, unmanageable mess? If they kept historical records forever, would you think they were being obsessive and wasteful with purchasing storage?

So, can you pull up a detailed sales report for February 29th, 2000 from Prod for me?

Shrink TempDB

This one’s an eye opener, and one that I believe is a bug enough where I opened a connect item on the issue.  The larger your TempDB data files the more space you’ll find in your buffer pool used by unallocated pages in TempDB.

I go into details in my post TempDB Memory Leak?, but here are the basics.  TempDB can use memory up to the size of the data files, not just the size of the used space in the data files.  The only workaround I know of right now is to make the size of the data files smaller.

I’m not talking about making TempDB tiny and letting it grow, there’s no excuse for that.  Look at your monitoring software to see how large TempDB has to be to avoid hitting autogrowth outside of accidental issues, then size TempDB about 20% larger than that.  I specifically avoid the best practice of presizing TempDB to fill a dedicated drive for this reason alone.

Other…Am I Giving Up?

There are too many things to list in one blog post. You could write a book on this subject. I listed what I feel will help you the most, but I also wanted to take time to let you know that my list isn’t some magical, all-inclusive, everything you can do to make your servers run better. It’s a start. Hopefully a good one you found to be productive, but a mere start no matter how you look at it.

Speaking of “you could write a book on the subject”, well, they did. A lot of them did. Some even made them free PDFs. Not bootleg bit-torrent copies, but actual real, legal, free PDFs made possible by Red Gate in their book selection on SQL Server Central

Some books you have to buy, if that’s your thing:

Also, I’m not the only person blogging about this topic:

Results

If you did all of this and there’s something running on your servers to trend performance, you’ll notice:

  • Page Life Expectancy: Raised significantly, but you saw that coming
  • Page Reads/sec: (Physical Reads) Dropped because we’re not cycling data in and out of cache as much
  • % Processor Time: Dropped due to lower I/O, more efficient queries, less pressure on procedure cache, etc.
  • Critical query execution time: Typically less due to less chance of waiting for physical reads and lower CPU stress.
  • PageIOLatch wait types: Dropped due to fewer physical reads leading to fewer waits on physical reads.
  • DBA Pay Rate: If this isn’t on the list, try using a chart of the above counters

Keep Reading

The largest part of memory management in SQL Server is indexing.  This can be changing the indexes themselves or how your queries interact with them.  Doing either requires a great understanding of what indexes are and how they work.  I’ve written several posts on the topic and recently added them to my Indexing page to help you browse them easier.

Sys.dm_os_performance_counters Demystified

The dmv sys.dm_os_performance_counters is awesome, if you can understand it. This is how I make it easy to read and use. Here are the values I watch and why I watch them. My list isn’t going to be perfect and you’re not going to agree with it 100%, and I’m ok with that. First, there is no perfect. Second, if we agree 100% then one of us is just mindlessly following the other which isn’t good.

  • Cache Hit Ratio – I ignore this value, but I still monitor it. I will, ideally, never be the only DBA on a team again, and everyone seems to think this value is cool.
  • Page Life Exp – My favorite! When you read a page from disk into memory how many seconds will it stay there? Just don’t use the outdated “300” rule or your disks will catch on fire!!!

    Burning Drive!!!

    PLE = 300

  • Page Lookups/Sec – How many pages are read from memory.
  • Page Reads/Sec – How many pages are read from disk.
  • Page Writes/Sec – How many pages are written to disk.
  • Lazy Writes/sec – How many pages are written to disk outside of a checkpoint due to memory pressure.
  • Batch Requests/sec – How busy is the server?
  • Trans/sec – How busy is the server?
  • Total Server Memory – How much memory SQL Server is currently using. Typically ramps up to Target value and PLE is low as it ramps up since new pages are in memory dropping the average.
  • Target Server Memory – How much memory SQL Server is allowed to use. Should be the same as the max memory setting, but memory pressure can cause this to decrease.
  • Memory Grants Pending – How many processes aren’t able to get enough memory to run. Should be 0, always 0, if not then find out why.
  • Deadlocks – How many deadlocks are we getting. Most apps handle deadlocks gracefully, but they still lose time doing it. If this number starts going up, start looking into it.
  • SQL Compilations/sec – This is a hidden performance killer! Some queries can’t be cached so they’re compiled every time they’re run. I’ve seen this with a query being run once a second and a big server was running slower than my laptop. It’s normal for things to compile throughout the day, it’s not normal for this number to be 10x higher than before that last upgrade.
  • SQL Re-Compliations/sec – Same goes here. The counters aren’t that much different.

If you know a little about this DMV then you know these values are cryptic. There’s several ways this data is stored and it has to be retrieved differently for each type to be useful. Then many of these are cumulative since the server was restarted, which isn’t going to help too much. Even worse, MSDN failed us on this one and figuring out this DMV required help outside of that site. Now for the good news, the script below will take care of all of that for you and leave you with some easy reading with values you can filter to the time periods you care about. If you want to add some of your own counters then just follow my lead on one that has the same cntr_type, or you can go to Rabin’s blog post that I learned from.

IF object_id('tempdb..#OSPC') IS NOT NULL BEGIN
    DROP TABLE #OSPC
END

DECLARE @FirstCollectionTime DateTime
    , @SecondCollectionTime DateTime
    , @NumberOfSeconds Int
    , @BatchRequests Float
    , @LazyWrites Float
    , @Deadlocks BigInt
    , @PageLookups Float
    , @PageReads Float
    , @PageWrites Float
    , @SQLCompilations Float
    , @SQLRecompilations Float
    , @Transactions Float

DECLARE @CounterPrefix NVARCHAR(30)
SET @CounterPrefix = CASE WHEN @@SERVICENAME = 'MSSQLSERVER'
                            THEN 'SQLServer:'
                        ELSE 'MSSQL$' + @@SERVICENAME + ':'
                        END

--Grab the current values from dm_os_performance_counters
--Doesn't do anything by instance or database because this is good enough and works unaltered in all envirornments
SELECT counter_name, cntr_value--, cntr_type --I considered dynamically doing each counter type, but decided manual was better in this case
INTO #OSPC 
FROM sys.dm_os_performance_counters 
WHERE object_name like @CounterPrefix + '%'
    AND instance_name IN ('', '_Total')
    AND counter_name IN ( N'Batch Requests/sec'
                        , N'Buffer cache hit ratio'
                        , N'Buffer cache hit ratio base'
                        , N'Free Pages'
                        , N'Lazy Writes/sec'
                        , N'Memory Grants Pending'
                        , N'Number of Deadlocks/sec'
                        , N'Page life expectancy'
                        , N'Page Lookups/Sec'
                        , N'Page Reads/Sec'
                        , N'Page Writes/Sec'
                        , N'SQL Compilations/sec'
                        , N'SQL Re-Compilations/sec'
                        , N'Target Server Memory (KB)'
                        , N'Total Server Memory (KB)'
                        , N'Transactions/sec')

--Just collected the second batch in the query above
SELECT @SecondCollectionTime = GetDate()

--Grab the most recent values, if they are appropriate (no reboot since grabbing them last)
SELECT @FirstCollectionTime = DateAdded
    , @BatchRequests = BatchRequests
    , @LazyWrites = LazyWrites
    , @Deadlocks = Deadlocks
    , @PageLookups = PageLookups
    , @PageReads = PageReads
    , @PageWrites = PageWrites
    , @SQLCompilations = SQLCompilations
    , @SQLRecompilations = SQLRecompilations
    , @Transactions = Transactions
FROM OSPerfCountersLast 
WHERE DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'TempDB')

--If there was a reboot then all these values would have been 0 at the time the server came online (AKA: TempDB's create date)
SELECT @FirstCollectionTime = ISNULL(@FirstCollectionTime, (SELECT create_date FROM sys.databases WHERE name = 'TempDB'))
    , @BatchRequests = ISNULL(@BatchRequests, 0)
    , @LazyWrites = ISNULL(@LazyWrites, 0)
    , @Deadlocks = ISNULL(@Deadlocks, 0)
    , @PageLookups = ISNULL(@PageLookups, 0)
    , @PageReads = ISNULL(@PageReads, 0)
    , @PageWrites = ISNULL(@PageWrites, 0)
    , @SQLCompilations = ISNULL(@SQLCompilations, 0)
    , @SQLRecompilations = ISNULL(@SQLRecompilations, 0)
    , @Transactions = ISNULL(@Transactions, 0)

SELECT @NumberOfSeconds = DATEDIFF(ss, @FirstCollectionTime, @SecondCollectionTime)

--I put these in alphabetical order by counter_name, not column name.  It looks a bit odd, but makes sense to me
--Deadlocks are odd here.  I keep track of the number of deadlocks in the time period, not average number of deadlocks per second.
  --AKA, I keep track of things the way I would refer to them when I talk to someone.  "We had 2 deadlocks in the last 5 minutes", not "We averaged .00002 deadlocks per second there"
INSERT INTO OSPerfCounters (DateAdded, Batch_Requests_Sec, Cache_Hit_Ratio, Free_Pages, Lazy_Writes_Sec, Memory_Grants_Pending
    , Deadlocks, Page_Life_Exp, Page_Lookups_Sec, Page_Reads_Sec, Page_Writes_Sec, SQL_Compilations_Sec, SQL_Recompilations_Sec
    , ServerMemoryTarget_KB, ServerMemoryTotal_KB, Transactions_Sec)
SELECT @SecondCollectionTime
    , Batch_Request_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec') - @BatchRequests) / @NumberOfSeconds
    , Cache_Hit_Ratio        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio')/(SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio base')
    , Free_Pages            = (SELECT cntr_value FROM #OSPC WHERE counter_name =N'Free pages')
    , Lazy_Writes_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec') - @LazyWrites) / @NumberOfSeconds
    , Memory_Grants_Pending    = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Memory Grants Pending')
    , Deadlocks                = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec') - @Deadlocks) 
    , Page_Life_Exp         = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page life expectancy')
    , Page_Lookups_Sec      = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec') - @PageLookups) / @NumberOfSeconds
    , Page_Reads_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec') - @PageReads) / @NumberOfSeconds
    , Page_Writes_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec') - @PageWrites) / @NumberOfSeconds
    , SQL_Compilations_Sec  = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec') - @SQLCompilations) / @NumberOfSeconds
    , SQL_Recompilations_Sec= ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec') - @SQLRecompilations) / @NumberOfSeconds
    , ServerMemoryTarget_KB = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Target Server Memory (KB)')
    , ServerMemoryTotal_KB  = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Total Server Memory (KB)')
    , Transactions_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec') - @Transactions) / @NumberOfSeconds

TRUNCATE TABLE OSPerfCountersLast

--Note, only saving the last value for ones that are done per second.
INSERT INTO OSPerfCountersLast(DateAdded, BatchRequests, LazyWrites, Deadlocks, PageLookups, PageReads
    , PageWrites, SQLCompilations, SQLRecompilations, Transactions)
SELECT DateAdded            = @SecondCollectionTime
    , BatchRequests            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec')
    , LazyWrites            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec')
    , Deadlocks             = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec')
    , PageLookups            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec')
    , PageReads                = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec')
    , PageWrites            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec')
    , SQLCompilations        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec')
    , SQLRecompilations        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec')
    , Transactions             = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec')

DROP TABLE #OSPC

Throw that code above here in a proc, schedule it to run every so often (I like 5 minutes) and it’ll….fail. It kinda relies on a couple tables you should create first. Here ya go.

CREATE TABLE OSPerfCounters(
    DateAdded datetime NOT NULL
    , Batch_Requests_Sec int NOT NULL
    , Cache_Hit_Ratio float NOT NULL
    , Free_Pages int NOT NULL
    , Lazy_Writes_Sec int NOT NULL
    , Memory_Grants_Pending int NOT NULL
    , Deadlocks int NOT NULL
    , Page_Life_Exp int NOT NULL
    , Page_Lookups_Sec int NOT NULL
    , Page_Reads_Sec int NOT NULL
    , Page_Writes_Sec int NOT NULL
    , SQL_Compilations_Sec int NOT NULL
    , SQL_Recompilations_Sec int NOT NULL
    , ServerMemoryTarget_KB int NOT NULL
    , ServerMemoryTotal_KB int NOT NULL
    , Transactions_Sec int NOT NULL
)

--You'll typically only query this by one value, which is added sequentually.  No page splits!!!
CREATE UNIQUE CLUSTERED INDEX IX_OSPerfCounters_DateAdded_U_C ON OSPerfCounters
(
    DateAdded
) WITH (FillFactor = 100)

--Only holds one value at a time, indexes are a waste
CREATE TABLE OSPerfCountersLast(
    DateAdded datetime NOT NULL
    , BatchRequests bigint NOT NULL
    , LazyWrites bigint NOT NULL
    , Deadlocks bigint NOT NULL
    , PageLookups bigint NOT NULL
    , PageReads bigint NOT NULL
    , PageWrites bigint NOT NULL
    , SQLCompilations bigint NOT NULL
    , SQLRecompilations bigint NOT NULL
    , Transactions bigint NOT NULL
)

The important part of all this is how you use it. It’s tempting to just look at the last 7 records and say that you know what’s going on, that makes me want to slap you. Every server is different, every server has different loads and baselines, and you’re either underworked or you don’t know what those baselines are for every server you manage. I do simple baselines every time I look at an incident and look at the last hour, the same time yesterday, and the same time a week ago. That gives you a chance to see what’s normal for this server and what’s different right now. This query is so simple you’ll wonder why I even posted it, but it’s effective which is why it’s here. The 7 records per day thing, that’s because 21 records show up on my screen without me scrolling, it is NOT a magic number!

SELECT 'Today', * FROM (
SELECT TOP 7 *
FROM OSPerfCounters
ORDER BY dateadded DESC
) X

UNION
SELECT 'Yesterday', * FROM (SELECT TOP 7 *
FROM OSPerfCounters
WHERE dateadded <= GETDATE()-1
ORDER BY dateadded DESC
) Y

UNION
SELECT 'Last Week', * FROM (
SELECT TOP 7 *
FROM OSPerfCounters
WHERE dateadded <= GETDATE()-7
ORDER BY dateadded DESC) Z

ORDER BY dateadded DESC

And, well, something I’ve been skipping on my posts and telling people to handle cleanup on their own…. Here’s step 2 of my jobs that populate my monitoring tables to keep your data from being the ever-growing data you’re struggling with in every other app. I delete in batches according to the clustered index. It’s overkill for something deleting one row at a time, or, even if you put this in a separate daily job, 288 rows if the process is scheduled every 5 minutes. So, why the batches? Because I copy/paste my own code everywhere, batches is reusable, and this is how I chop off the tail end of EVERYTHING!

SELECT 'Start' --Give me a rowcount of 1

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (100000)
    FROM OSPerfCounters
    where dateadded < (GetDate() - 400)
END 

In the beginning I mentioned that if you agreed with me 100% then one of us is a mindless monkey. Look, I put this out there first, so I’m obviously not the mindless monkey here, am I? There’s a box below that gives you a chance to show that you’re not a mindless monkey either! Tell me I’m wrong, how I can do better, and how everyone else reading this can benefit from it even more! I’ll promote you from mindless monkey to talking monkey!

Deadlock – A quick, easy view

Deadlocks are rough to work with. Here are the scripts I use to capture deadlocks, find which ones are reoccurring, and view them along with a couple free eBooks to resolve them.

You could have thousands of deadlocks and it would take you forever to find out which ones are reoccurring or which ones are some freak accident with an annual process. I used to read through deadlock graphs one-by-one to see what was reoccurring, and I used to be less satisfied with my job as well. If you’re going to bang your head on your desk making sure a deadlock never happens again, you’ll want to make sure it wasn’t a one-time event that wouldn’t have happened again anyways.

I’m not going to get into how to figure out deadlocks, that’s a chapter in a book more than it’s a blog post. Chapter 7 to be exact in the FREE eBook or $25 physical book for Accidental DBAs by Jonathan Kehayias and Ted Krueger. Also, a whole book on blocking by Kalen Delaney in a FREE eBook or $22 physical book. I have to admit that Ted Krueger isn’t a name I recognize. However, Jonathan and Kalen are easily on my list of top 5 best MVPs out there.

Anyways, back to me. Here’s what I use to look at deadlock traces. The base code was found on the internet years ago, and you can find it in several places now. Because of that I have no idea who the original author is, but I’d be glad to throw up credit here for anyone who can show me a site dated older than when I started using it.

There are two versions here, almost identical code in each. The first one summarizes the deadlocks and gives you a count of the occurrences. The second one shows each SPID involved in the deadlock separately along with the XML for the deadlock.

DECLARE @Path VarChar(500);
SET @Path = (SELECT TOP 1 [path] FROM sys.traces WHERE [path] LIKE '%deadlock%')
--SET @Path = 'D:\Perflogs\PerfTraces\DeadlockTrace.trc'

;with CTE as
(
select 
 [TraceID] = 3, 
 [RowID] = row_number() OVER (ORDER BY StartTime), -- assign a row number to each deadlock
 [DeadlockGraph]=case when TextData like '<deadlock-list%' then convert(xml, TextData) else null end, 
 * 
 from ::fn_trace_gettable(@path, default)
where TextData like '<deadlock-list%'
    --AND StartTime BETWEEN '2013-06-17 00:00' AND '2013-06-30 00:00'
)
SELECT [Procedure], Inputbuffer, dMonth = DatePart(Month, DeadLockTime), dDay = DatePart(Day, DeadLockTime), DeadlockCount = Count(*), dMax = max(DeadlockTime), dMin = min(DeadlockTime)
FROM
(
select 
 [TimeoutID] = CTE.RowID,
 [DeadlockTime] = [StartTime],
 [DeadlockGraph],
 [PagelockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/pagelock[1]/@objectname', 'varchar(200)'),
 [DeadlockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/objectlock[1]/@objectname', 'varchar(200)'),
 [Victim] = case when Deadlock.Process.value('@id', 'varchar(50)') = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/@victim', 'varchar(50)') then 1 else 0 end,
 [Procedure] = Deadlock.Process.value('executionStack[1]/frame[1]/@procname[1]', 'varchar(200)'),
 [LockMode] = Deadlock.Process.value('@lockMode', 'char(1)'),
 [Code] = Deadlock.Process.value('executionStack[1]/frame[1]', 'varchar(1000)'),
 [ClientApp] = Deadlock.Process.value('@clientapp', 'varchar(100)'),
 [HostName] = Deadlock.Process.value('@hostname', 'varchar(20)'),
 [HostPID] = Deadlock.Process.value('@hostpid', 'varchar(20)'),
 [LoginName] = Deadlock.Process.value('@loginname', 'varchar(20)'),
 [TransactionTime] = Deadlock.Process.value('@lasttranstarted', 'datetime'),
 [InputBuffer] = Deadlock.Process.value('inputbuf[1]', 'varchar(1000)')
 from CTE
 CROSS APPLY CTE.[DeadlockGraph].nodes('/deadlock-list/deadlock/process-list/process') as Deadlock(Process)
) X
--WHERE Victim = 0 
--WHERE TimeoutID IN (SELECT RowID FROM CTE WHERE [PROCEDURE] = 'master.dbo.FakeProcName')
--WHERE inputbuffer like '%select top 10%'
GROUP BY [procedure], Inputbuffer, DatePart(Month, DeadLockTime), DatePart(Day, DeadLockTime)
--ORDER BY TimeoutID 

Then there’s this to actually let me look at the XML. Although you can easily see in the script below which deadlocks reoccur back-to-back, it’s not so easy to see reoccurrence of deadlocks that happen once every morning between 2 and 3 AM. That’s where the first script shines, and you should use and abuse it.

DECLARE @Path VarChar(500);
SET @Path = (SELECT TOP 1 [path] FROM sys.traces WHERE [path] LIKE '%deadlock%');

with CTE as
(
select 
 [TraceID] = 3, 
 [RowID] = row_number() OVER (ORDER BY StartTime), -- assign a row number to each deadlock
 [DeadlockGraph]=case when TextData like '<deadlock-list%' then convert(xml, TextData) else null end, 
 * 
 from ::fn_trace_gettable(@Path, default)
where TextData like '<deadlock-list%'
    --AND StartTime BETWEEN '2013-03-08 02:00' AND '2013-03-08 04:00'
    AND StartTime > DATEADD(Hour, -2, GetDate())
)
SELECT *
FROM
(
select 
 [TimeoutID] = CTE.RowID,
 [DeadlockTime] = [StartTime],
 [DeadlockGraph],
 [PagelockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/pagelock[1]/@objectname', 'varchar(200)'),
 [DeadlockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/objectlock[1]/@objectname', 'varchar(200)'),
 [Victim] = case when Deadlock.Process.value('@id', 'varchar(50)') = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/@victim', 'varchar(50)') then 1 else 0 end,
 [Procedure] = Deadlock.Process.value('executionStack[1]/frame[1]/@procname[1]', 'varchar(200)'),
 [LockMode] = Deadlock.Process.value('@lockMode', 'char(1)'),
 [Code] = Deadlock.Process.value('executionStack[1]/frame[1]', 'varchar(1000)'),
 [ClientApp] = CASE LEFT(Deadlock.Process.value('@clientapp', 'varchar(100)'), 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(Deadlock.Process.value('@clientapp', 'varchar(100)'),32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(Deadlock.Process.value('@clientapp', 'varchar(100)'), 67, len(Deadlock.Process.value('@clientapp', 'varchar(100)'))-67)
                    ELSE Deadlock.Process.value('@clientapp', 'varchar(100)')
                    END ,
 [HostName] = Deadlock.Process.value('@hostname', 'varchar(20)'),
 [LoginName] = Deadlock.Process.value('@loginname', 'varchar(20)'),
 [TransactionTime] = Deadlock.Process.value('@lasttranstarted', 'datetime'),
 [InputBuffer] = Deadlock.Process.value('inputbuf[1]', 'varchar(1000)')
 from CTE
 CROSS APPLY CTE.[DeadlockGraph].nodes('/deadlock-list/deadlock/process-list/process') as Deadlock(Process)
-- ORDER BY [DeadlockObject], [ClientApp], [Procedure], [Code]
) X
--WHERE Victim = 0 
--WHERE [Procedure] = 'master.dbo.FakeProcedureName'
ORDER BY TimeoutID DESC

Finally, to make all of this happen you’ll need a deadlock trace. Don’t go thinking that I open profiler, make a deadlock trace, script it out, and run it every time I need one. Reusable code is awesome, and this is reused every time I see an alert come through that we are having too many deadlocks. I have this proc in our Perf database, which is on every SQL Server we manage. So you don’t have to dig through the numbers, it grabs every event for the deadlock graph and nothing else.

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[usp_ServerSideTrace_Deadlock]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_ServerSideTrace_Deadlock]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE procedure [dbo].[usp_ServerSideTrace_Deadlock]
    @FilePath nvarchar(1000) = N'D:\Perflogs\PerfTraces\DeadlockTrace',                                            
    @maxfilesize bigint = 25,
    @maxfiles int = 10,
    @TraceDBID bit=0,
    @DBID int=0
AS

declare @rc int
declare @TraceID int

--Create Trace
exec @rc = sp_trace_create @TraceID output, 2, @FilePath, @maxfilesize, NULL, @maxfiles
if (@rc != 0) 
Begin
    select ErrorCode=@rc    
    return
End

-- Set the RPC Completed and SQL Batch Completed events
declare @on bit
set @on = 1
exec sp_trace_setevent @TraceID, 148, 11, @on
exec sp_trace_setevent @TraceID, 148, 12, @on
exec sp_trace_setevent @TraceID, 148, 14, @on
exec sp_trace_setevent @TraceID, 148, 1, @on
exec sp_trace_setevent @TraceID, 25, 15, @on
exec sp_trace_setevent @TraceID, 25, 8, @on
exec sp_trace_setevent @TraceID, 25, 32, @on
exec sp_trace_setevent @TraceID, 25, 56, @on
exec sp_trace_setevent @TraceID, 25, 1, @on
exec sp_trace_setevent @TraceID, 25, 9, @on
exec sp_trace_setevent @TraceID, 25, 57, @on
exec sp_trace_setevent @TraceID, 25, 2, @on
exec sp_trace_setevent @TraceID, 25, 10, @on
exec sp_trace_setevent @TraceID, 25, 3, @on
exec sp_trace_setevent @TraceID, 25, 11, @on
exec sp_trace_setevent @TraceID, 25, 12, @on
exec sp_trace_setevent @TraceID, 25, 13, @on
exec sp_trace_setevent @TraceID, 25, 6, @on
exec sp_trace_setevent @TraceID, 25, 14, @on
exec sp_trace_setevent @TraceID, 25, 22, @on
exec sp_trace_setevent @TraceID, 59, 32, @on
exec sp_trace_setevent @TraceID, 59, 56, @on
exec sp_trace_setevent @TraceID, 59, 1, @on
exec sp_trace_setevent @TraceID, 59, 57, @on
exec sp_trace_setevent @TraceID, 59, 2, @on
exec sp_trace_setevent @TraceID, 59, 14, @on
exec sp_trace_setevent @TraceID, 59, 22, @on
exec sp_trace_setevent @TraceID, 59, 3, @on
exec sp_trace_setevent @TraceID, 59, 12, @on

-- Set the Filters
--Default to exclude the trace from Tracing Itself
exec sp_trace_setfilter @TraceID, 10, 0, 7, N'SQL Server Profiler'

--Add other filters as specified by parameters
If @TraceDBID=1
    exec sp_trace_setfilter @TraceID, 3, 0, 0, @DBID

-- Set the trace status to start
exec sp_trace_setstatus @TraceID, 1

-- display trace id for future references
return @TraceID

GO

The basics of a deadlock is that process 1 gets a lock on data A while process 2 gets a lock on data B. Then process 1 says it needs a lock on data B to continue while process 2 needs a lock on data A to continue. It’s a simple basic concept, but it gets complex…too complex for me to cover it here and say I did it justice. Sometimes a process deadlocks on itself, doing a parallelism deadlock as demonstrated by Jason Strate. Solutions also vary from order of operations (lets make processes 1 and 2 both get data A first, then they can’t deadlock), tuning (if the locks are released quicker then there’s less time for deadlocks), reducing the number of transactions (the lock on data A was grabbed earlier in this transaction, and there’s no reason these need to be in a explicit transaction), scheduling (these are both processes that have to run between 10 PM and 6 AM, why do they both start at midnight?), and many other approaches. There isn’t one right answer, and anyone promising a single right answer in one short blog post is selling you short. That’s either a very long blog post or it’s an incomplete answer. I go with option 3, and here’s my post on the best way to view them along with a couple book chapters to get you started on fixing what you found.

Related Content:

SQL Server Concurrency: Locking, Blocking and Row Versioning By Kalen Delaney
FREE eBook or $22 physical book

Troubleshooting SQL Server: A Guide for the Accidental DBA By Jonathan Kehayias and Ted Krueger
FREE eBook or $25 physical book

SQL Server Running Slow

A user calls to say the app or server is slow today.  Here’s a quick summary of how I get started.  It depends on monitoring where you can compare baselines to recent activity, and I include links to all of that code I use.

The initial question is too vague.  You can’t blame this individual user because this is how 95% of issues make it to you.   The initial follow-up will be the following questions I ask the user along with the corresponding question your boss is asking you:

  • Can anyone else duplicate your issue?
    • Is this the user’s machine or a server issue?  Not a 100% conclusive answer, but it’s a start.
  • Is everything running slow or just one process or screen?
    • Is the whole server running slow or just one stored procedure?
  • Were any changes made to the system recently?  Depending on the user, this may be a question for someone else.
    • Did we cause it by making a change or did this just come up?

However, this is enough for me to do a very quick overall health check while I wait for those answers.  It seems like a lot at first glance, but it’s practically automated with the scripts I have on hand.  This is only opening 5 pre-written scripts, running them, and peeking at the results.

Quick Health:

  • Look at historical wait stats for the last two hours and the same time window both 1 and 7 days ago.  Documented in the Wait Stats post.
  • Look at memory and CPU contention for the same times as wait stats.  You’re specifically looking for high CPU usage, PLE drops, and changes to target memory. Documented in the OS Performance Counters post.
  • Look for blocking in the last hour, even though it would typically show up in wait stats.  You can compare this to previous days if needed, Documented in the Blocking post
  • Look at the large queries that completed in the last two hours along with the queries currently running to see if anything pops out.  This is just a quick glance looking for glaring issues, and you can look for details after getting answers to the follow-up questions.  Unfortunately, this information isn’t as cut-and-dry as the rest, and it’s going to rely on you being a little familiar with what normal is on this server.  Completed queries are documented in the Tracing Creation and Reading posts, and currently running queries are documented in the Running Processes post.
  • Look at the queries recompiled recently to see if a new execution plan is taking excessive resources.  Documented in the Recently Recompiled Resource Hogs post.

So, where do you go from here?  Yeah, you get the typical “it depends” answer from me.  The information you just looked at will normally comes up with something that doesn’t look right and needs to be looked into further.  However, there are so many directions this can go from here that a single document can’t cover it.  The purpose of this post is simply to answer the question “Where do I start looking with the most common question a user throws at me”, nothing more.  Also, I wanted a way to tie all of the random posts together so you could see them working together as the full solution.

Recently Recompiled Resource Hogs

It’s not too uncommon for a query to get a new execution plan that performs a lot worse than it could, and sometimes it’s bad enough to drag the whole server down to a halt.  When it’s something obvious such as a query going from 2 seconds duration to 30 seconds you’ll pick it up on a standard monitoring trace.  That’s not always the case, and the query that hits your server twice a second going from 100ms to 900ms can fly under the radar except for users saying the server isn’t as responsive as it was yesterday.

Before you go all out and throw up an intrusive trace that can slow down your server capturing an unfiltered sample of RPC:Completed and SQL:BatchCompleted to run aggregates on, see what the DMVs can tell you.  If you look in sys.dm_exec_query_stats you can get the total amount of resources used since the last recompilation, which is going to give you pretty much everything you need as long as you’re looking for a query that can stay in the cache.  The only trick is to look at everything in the same context.  Something that did 10,000,000 reads since it was compiled yesterday is nothing compared to something that did 1,000,000 reads since it was compiled 10 minutes ago.  Here’s my answer to this:

SELECT top 10 --WorkerTime_Sec = cast(round(total_worker_time / 1000 / 1000.0, 0) as Int)
    WorkerTimeSec_PerSec =  cast(cast(round(total_worker_time / 1000 / 1000.0, 0) as Int) / (Cast(DateDiff(second, Creation_Time, GetDate()) as DEC(20,2)) + .00000001) as DEC(20,2))
    --, LogicalReads_K = total_logical_reads / 1000
    , LogicalReads_K_PerSec = cast(total_logical_reads / 1000 / (Cast(DateDiff(second, Creation_Time, GetDate())as DEC(20,2)) + .00000001) as DEC(20,2))
    --, ElapsedTime_Sec = cast(round(total_elapsed_time / 1000 / 1000.0, 0) as Int)
    , ElapsedTimeSec_PerSec = cast(cast(round(total_elapsed_time / 1000 / 1000.0, 0) as Int)/ (Cast(DateDiff(second, Creation_Time, GetDate())as DEC(20,2)) + .00000001) as DEC(20,2))
    , Seconds_Since_Recompile = DateDiff(second, Creation_Time, GetDate())
    , Creation_Time
    , qs.execution_count
    , avg_reads = total_logical_reads / qs.execution_count
    , avg_elapsed_ms = total_elapsed_time / 1000 / qs.execution_count
    , avg_worker_ms = total_worker_time / 1000 / qs.execution_count
    , Offset = statement_start_offset
    , ObjectName = OBJECT_NAME(qt.objectid, qt.dbid)
    , qt.text
    , StatementText = SUBSTRING (qt.text, qs.statement_start_offset/2 + 1,
        abs(CASE WHEN qs.statement_end_offset = -1
               THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2 
            ELSE qs.statement_end_offset END - qs.statement_start_offset)/2 + 1)
    --, qt.*
    --, qs.*
FROM sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE Creation_Time > DateAdd(Hour, -24, GetDate())
    AND (creation_time < DATEADD(SECOND, -60, GETDATE()) or execution_count > 5) --Has to be a minute old or executed 6 times before I consider the data stable
    --AND (total_worker_time > 20000 or execution_count > 15000)
ORDER BY 1 DESC 

I comment out what I don’t typically use, same as always with me.  Those pieces don’t usually change.  The only thing that does typically change is the date filter on the creation time; if someone says there weren’t any problems this time yesterday then it stays at the default value, if they say it worked fine before lunch then it drops down to 4 hours.

There are four things to keep in mind with this DMV.  Everything is in microseconds which I don’t use, so I “fixed” the problem in my query.  It will only hold data while the query is in cache, which means it’s never tracked here if it’s never in cache.  Each row is for a single plan for a single statement, although  you could easily aggregate it if you wanted to.  Also, it’s possible for the worker time or duration per second to be over 1 due to multiple simultaneous runs or, with worker time, parallelism can cause this.  I typically try not to restate what you should be reading on MSDN or BOL, but these aren’t things everyone expects.

For me to run this, I typically have to already have a complaint that a server is running slower than expected. Check out my Server Running Slow post to see everything else that runs along with this.

Reading Traces

Capturing information is useless unless you know how to use it. Here’s what I use to get the most out of my basic trace of anything taking over so many seconds. This is also my template for more invasive traces, and it gets tweaked on a case-by-case basis.

It’s my typical style with half of it commented out where less run portions can be highlighted and run and portions of my WHERE clause don’t have to be rewritten to be added in. The first line gets highlighted as soon as I open this script every time so I can run it, copy the path of the trace I’m looking for, and paste it into my main script. As for the WHERE clause, call me lazy or whatever it is you call people who can’t remember the right word, but I don’t always remember the exact wording of each column name that I want to filter by. I simply uncomment them as I need them.

The only thing I can say is really impressive here is finding the job names, which is a piece of this script I will never forget figuring out. Other than that, just little things to help readability when both running this script and when playing with the WHERE clause. I can’t tell you why duration is microseconds and CPU is milliseconds, but I can say the SELECT and WHERE clauses are set up to use the same units I would use when I talk to someone.

--SELECT * FROM sys.traces

/*
--sp_trace_setstatus id, status
exec sp_trace_setstatus 3,0 --stop
exec sp_trace_setstatus 3,2 --close
--0 stop
--1 start
--2 close
*/

--query running trace
SELECT TOP 10000 Database_Name = DB_Name(DatabaseID)
    , DatabaseName
    , starttime
    , endtime
    , textdata = cast(textdata as varchar(4000)) 
    , Duration_Sec = cast(duration/1000/1000.0 as Dec(10,1)) 
    , CPU_Sec = cast(cpu/1000.0 as Dec(10,1)) 
    , Reads_K = cast(reads/1000.0 as Dec(10,0)) 
    , Writes_K = Cast(writes/1000.0 as Dec(10,1)) 
    , hostname  
    , LoginName 
    , NTUserName 
    , ApplicationName = CASE LEFT(ApplicationName, 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(ApplicationName,32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(ApplicationName, 67, len(ApplicationName)-67)
                    ELSE ApplicationName
                    END 
    , SPID
    , TE.*
FROM fn_trace_gettable('D:\Traces\DurationOver10sec_110.trc',default) T
    LEFT JOIN sys.trace_events TE ON  T.EventClass = TE.trace_event_id
WHERE endtime > DATEADD(Hour, -3, GetDate()) 
    --AND Duration > 1000000 * 15 --in seconds
    --AND CPU > 1000 * 10 --in seconds
    --AND Writes > 10000
    --AND Reads > 1000000
    --AND te.category_id = 2 --Auto Grow/Shrink found in default trace
    --AND t.databaseid = DB_ID('master')
    --AND t.DatabaseName = 'tempdb'
    --AND t.LoginName <> 'shood'
    --AND t.textdata like '%%'
ORDER BY endTime DESC, starttime DESC

So, for those of you not too familiar with tracing, what does all of this mean and what am I looking for? It depends on what’s wrong. The day after an app’s upgrade I’m looking for procs that snuck through testing and are killing the server. When a user complains about performance you might see a new proc showing up because it’s getting a bad execution plan. If tempdb or the logs fill up I can tell you what had 100,000 writes and just happened to end at that exact moment. Sometimes I’ll glance to see if anything is hitting or approaching the app’s 30 second timeout.

A general trace such as the RPC:Completed and SQL:BatchCompleted that take over X seconds won’t answer all of your questions, but it will be a good start. The most important things are to always have something running (could be extended events instead of a trace) to tell you the big things that are running and to have your monitoring limited enough where you aren’t the one slowing down the server. Make sure it’s customized to each server since a 1 second duration in a web app is a big deal, but other servers you would get flooded with results unless you bump it up to 10 seconds. You can run more invasive traces when you need them.

There are two hits against extended events where I’m even using traces for this. First, I still have too many SQL 2005 servers around and like to do things as consistent as reasonably possible across all servers to make it simple. Second, I’m still not too happy about extended events skipping SQL:BatchCompleted in SQL 2008 as that misses out on finding large loops of small statements. Extended events are much more efficient, can do a lot that traces can’t do, but aren’t a 100% replacement for traces.

Comment on my posts, let me know what you think. Every post I have at Simple SQL Server is written to myself 5 years ago, and is exactly what I wish I had back then in my own mind. The key words being “in my own mind”, and I’d like to hear what’s on your mind. Be it constructive criticism on this post or a completely separate SQL Server subject you wish you knew more about either now or 5 years ago that you’d like to see thrown up for everyone to use, I’ll never know until you tell me.

Tracing Introduction

SQL Server tracing is essential for troubleshooting performance issues, yet it can put loads on your server that would cause noticeable slowness, capture so much information that it fills the drive, and I’ve even seen it cause a cluster failover more than once.  With a tool this powerful you just need to know how to use it before you use it.  It’s almost like the idea where companies want their entry-level people to have a year of experience.

There is one trace I’d like to see on every server, preferably as a startup proc or a job that starts automatically when SQL Agent starts.  It doesn’t matter how you do it, if you want me to be happy you’ll tell me this was running for at least 24 hours before you had a performance incident.  All I’m looking for is RPC:Completed and SQL:BatchCompleted with a reasonable filter of duration over 10 seconds.

If you’ve never used Profiler before, here’s a quick overview to make the server-side trace I’m using.  Leave the first screen at the defaults, with the possible exception of changing the template to blank.  Select RPC:Completed and SQL:BatchCompleted and set the Duration filter to greater than or equal to 10,000 ms which you can get to using the “Column Filters…” button.  You can collect all columns if you want to, it won’t hurt with these events although there are some columns that won’t help.

Create trace with duration filter

Run the trace, then you can export it to a script.  I’m not happy that you have to run it first, but in the end this is either something that doesn’t affect the server at all or something I run on a dev server that’s at the same patch level.

Export trace to file

There, you have the basic script to start the server-side trace.  You still need to tweak it a bit before you can use it, but the hard part is done.  Everything highlighted in yellow was changed by me.  The default max file size was changed from 5 MB to 25 MB because I view 25 to 50 MB to be the range that’s not annoyingly small or too big to move around or send to someone easily.  The second parameter of sp_trace_create is set to 2 to enable rollover files, the third parameter is set to my output file sans an extension, and the last parameter is added to say how many output files there are.

/****************************************************/
/* Created by: SQL Server 2008 R2 Profiler          */
/* Date: 06/16/2013  06:10:02 AM         */
/****************************************************/

-- Create a Queue
declare @rc int
declare @TraceID int
declare @maxfilesize bigint
set @maxfilesize = 25

-- Please replace the text InsertFileNameHere, with an appropriate
-- filename prefixed by a path, e.g., c:\MyFolder\MyTrace. The .trc extension
-- will be appended to the filename automatically. If you are writing from
-- remote server to local drive, please use UNC path and make sure server has
-- write access to your network share

exec @rc = sp_trace_create @TraceID output, 2, N'D:\Traces\DurationOver10Sec', @maxfilesize, NULL, 5
if (@rc != 0) goto error

-- Client side File and Table cannot be scripted

-- Set the events
declare @on bit
set @on = 1
exec sp_trace_setevent @TraceID, 10, 15, @on
exec sp_trace_setevent @TraceID, 10, 31, @on
exec sp_trace_setevent @TraceID, 10, 8, @on
exec sp_trace_setevent @TraceID, 10, 16, @on
exec sp_trace_setevent @TraceID, 10, 48, @on
exec sp_trace_setevent @TraceID, 10, 1, @on
exec sp_trace_setevent @TraceID, 10, 9, @on
exec sp_trace_setevent @TraceID, 10, 17, @on
exec sp_trace_setevent @TraceID, 10, 10, @on
exec sp_trace_setevent @TraceID, 10, 18, @on
exec sp_trace_setevent @TraceID, 10, 34, @on
exec sp_trace_setevent @TraceID, 10, 3, @on
exec sp_trace_setevent @TraceID, 10, 11, @on
exec sp_trace_setevent @TraceID, 10, 51, @on
exec sp_trace_setevent @TraceID, 10, 12, @on
exec sp_trace_setevent @TraceID, 10, 13, @on
exec sp_trace_setevent @TraceID, 10, 14, @on
exec sp_trace_setevent @TraceID, 12, 15, @on
exec sp_trace_setevent @TraceID, 12, 31, @on
exec sp_trace_setevent @TraceID, 12, 8, @on
exec sp_trace_setevent @TraceID, 12, 16, @on
exec sp_trace_setevent @TraceID, 12, 48, @on
exec sp_trace_setevent @TraceID, 12, 1, @on
exec sp_trace_setevent @TraceID, 12, 9, @on
exec sp_trace_setevent @TraceID, 12, 17, @on
exec sp_trace_setevent @TraceID, 12, 10, @on
exec sp_trace_setevent @TraceID, 12, 14, @on
exec sp_trace_setevent @TraceID, 12, 18, @on
exec sp_trace_setevent @TraceID, 12, 3, @on
exec sp_trace_setevent @TraceID, 12, 11, @on
exec sp_trace_setevent @TraceID, 12, 51, @on
exec sp_trace_setevent @TraceID, 12, 12, @on
exec sp_trace_setevent @TraceID, 12, 13, @on

-- Set the Filters
declare @intfilter int
declare @bigintfilter bigint

set @bigintfilter = 10000000
exec sp_trace_setfilter @TraceID, 13, 0, 4, @bigintfilter

-- Set the trace status to start
exec sp_trace_setstatus @TraceID, 1

-- display trace id for future references
select TraceID=@TraceID
goto finish

error: 
select ErrorCode=@rc

finish: 
go

This is done as a server-side trace for a couple reasons. Profiler needs to stay open to keep collecting data, SQL Server has to send the data to Profiler which is usually across the network, then the trace data is held in memory. None of these are a problem for a properly filtered short-term trace. This is properly filtered, but I’m really hoping everyone will run this long-term.

I owe you another blog post on querying traces, then we can move in to putting it all together with where to start when a user makes their favorite statement to you…”The server is running slow”.

White papers, DMVs and Monitoring Concepts

According to Jason Strate (SQL Server MVP) there are 5 white papers ever SQL Server DBA should read. Considering my blogging is focused on how to monitor and tune previously installed servers, we’ll focus on the first two of these white papers: Troubleshooting Performance Problems in SQL Server 2008 and SQL Server 2005 Waits and Queues. Ignore the age of these papers, after all you’re running mostly the same code to do mostly the same tasks in SQL Server 6.5 and SQL Server 2012…well, except for the lack of DMVs in 6.5.

I don’t agree with them 100%, but not to the point that I don’t agree 100% that everyone should read them. The sources of the information, the uses of the DMVs, all of it is what you’ll need to know to progress as a DBA. Ask any DBA III or higher and they’ll be able to tell you most of what’s said in these papers, even if they went the hard route and made it to where they are without reading them. Where they fall short is by giving static values that aren’t worded as just starting points, and the trending they do is the trending you can do as a Microsoft Support Engineer where nothing is saved or long-term.

My differences range from minor complaints such as me feeling that a blanket statement saying Page Life Expectancy under 300 seconds is when you should start to worry. First, it’s a blanket statement, and I’m a firm believer in that the word always means you’re almost always wrong. Second, servers have more memory now and my SAN team would kill me if I told them I would have to read 500 GB of data every 300 seconds on some of my bigger servers. I do agree that there are times you should worry about PLE, but you have to keep in mind two things; how much memory are you cycling through in that time and what’s normal for this specific server.  I just hope that regular drops down to 3,000 are closer to the point that grabs your attention.

Lets build off of that last big there…What’s normal for this specific server? The queries they’re giving you are snapshots of what info the server has for you right now. Some of that is what’s happening now, some is what has happened since the last restart of SQL services, and other pieces fall somewhere in the middle. You can’t tell me that I’ve done X amount of work since the server was rebooted two weeks ago and expect me to tell you if that’s an issue. You also can’t tell me that “This proc has used more CPU than any other by far since that reboot two weeks ago” and expect me to tell you if that’s a critical issue or an intense, yet normal and expected, off-hours maintenance task. You can, however, tell me that on a typical Monday between 2:00 PM and 5:00 PM we do X, but this Monday we’re doing 100 times that while the business has no reason to say this Monday should be any different. Now we’re on to something…we know there’s an issue, and we just found our starting point to solving it. Chances are if you’re reading this then either you or your immediate manager would be able to create a database to keep this trending information if it doesn’t exist already, and you’re making it harder on yourself if you don’t.

These two white papers are almost 200 pages long together, so I’m not going to pretend to cover them in one post. Actually, I’m hoping my next 50 posts might cover most of them. The points I’m going for are that these white papers are telling you the perfect stats to watch, and these stats are going to be more meaningful when you monitor them. I’ve started this in some of my posts such as looking at Wait Stats and Blocking, but there’s a long way to go. More than these two DMVs need to be watched like this, but they are good examples of wait stats capturing cumulative data and blocking capturing point-in-time data from DMVs.

If you see a DMV in these papers, you’ll see it in my posts. If it’s not there yet, it will be. Some like dm_db_exec_sessions will never have a dedicated post unless you count Running Processes which links it to several DMVs. Others like dm_os_performance_counters might not fit comfortably in a single post. The point here is that all the DMVs, especially the ones mentioned here, are worth looking into more.

Read the papers. Always go beyond the question of “what should I do”, get to the “why should I do it”, and strive for “how can I make this mean more and be more useful”. Never take my scripts or anyone else’s as-is. Tinker with them, really learn the DMVs, and it will all start to come together.