Optional Parameters Causing Index Scans


Optional parameters in a stored procedure often lead to scans in the execution plan, reading through the entire table, even if it’s obvious that the right index is in place. This is extremely common in procs that are used behind search screens where, for example, you could search for a person according to their first name, last name, city, state, zip code, gender, etc.

To take a simplified look at this, make a quick and easy table consisting of an indexed identity column for us to search by and fluff to make the row size bigger.

CREATE TABLE Test 
(
    ID Int NOT NULL Identity(1,1) PRIMARY KEY
    , Value CHAR(500) NOT NULL
)
GO

INSERT INTO Test (Value)
SELECT '';
GO 250

INSERT INTO Test (Value)
SELECT TOP (1000) Value 
FROM Test 
GO 100

Now we’ll create a proc to get data. In this proc we’ll get the data several different ways, showing the advantages and disadvantages of each method.  You’ll notice that I put my comments as PRINT statements, which will make it easier for you to read through the messages tab and see what’s going on.

CREATE PROC proc_Testing 
    @ID int = 125
    , @MinID int = 100
    , @MaxID int = 150
AS 

SET STATISTICS IO ON
SET NOCOUNT ON

PRINT 'Clustered index scan, because the query optimizer has to be prepared for both null and non-null values'

SELECT *
FROM Test 
WHERE @ID = Test.ID OR @ID IS NULL

PRINT ''
PRINT 'It''s a small mess, but the query optimizer knows exactly what to expect for each query.'

IF @ID IS NOT NULL BEGIN
    SELECT *
    FROM Test 
    WHERE @ID = Test.ID 
END ELSE BEGIN
    SELECT * 
    FROM Test
END 

PRINT ''
PRINT 'Expand that to two possible parameters and it still looks simple right now, when it''s doing a scan'
SELECT *
FROM Test 
WHERE (ID >= @MinID OR @MinID IS NULL)
    AND (ID <= @MaxID OR @MaxID IS NULL)
    
PRINT ''
PRINT 'Make it optimized with IF statements and it quickly becomes a big mess.'  
PRINT 'Many "search screen" queries will have 20 or more optional parameters, making this unreasonable'

IF @MinID IS NOT NULL BEGIN
    IF @MaxID IS NOT NULL BEGIN
        SELECT * 
        FROM Test 
        WHERE ID BETWEEN @MinID and @MaxID 
    END ELSE BEGIN
        SELECT *
        FROM Test 
        WHERE ID >= @MinID 
    END
END ELSE BEGIN
    IF @MaxID IS NOT NULL BEGIN
        SELECT * 
        FROM Test 
        WHERE ID <= @MaxID 
    END ELSE BEGIN
        SELECT *
        FROM Test 
    END
END

PRINT ''
PRINT 'However, the query optimizer can get around that if it''s making a one-time use plan'

SELECT *
FROM Test 
WHERE (ID >= @MinID OR @MinID IS NULL)
    AND (ID <= @MaxID OR @MaxID IS NULL)
OPTION (RECOMPILE)

PRINT ''
PRINT 'And again with the single parameter'

SELECT * 
FROM Test
WHERE ID = @ID OR @ID IS NULL 
OPTION (RECOMPILE)

PRINT ''
PRINT 'However, this leaves two nasty side effects.'
PRINT 'First, you create a new plan each time, using CPU to do so.  Practically unnoticed in this example, but it could be in the real world.'
PRINT 'Second, you don''t have stats in your cache saying how indexes are used or what your most expensive queries are.'
PRINT ''
PRINT 'Another option is dynamic SQL, which still has most of the first flaw, and can also bloat the cache'

DECLARE @Cmd NVarChar(4000)

SELECT @Cmd = 'SELECT * FROM Test WHERE 1=1'

IF @ID IS NOT NULL BEGIN
    SELECT @Cmd = @Cmd + ' AND ID = ' + CAST(@ID AS VarChar(100))
END

IF @MinID IS NOT NULL BEGIN
    SELECT @Cmd = @Cmd + ' AND ID >= ' + CAST(@MinID AS VarChar(100))
END

IF @MaxID IS NOT NULL BEGIN
    SELECT @Cmd = @Cmd + ' AND ID <= ' + CAST(@MaxID AS VarChar(100))
END

EXEC (@Cmd) 

PRINT ''
PRINT 'Yes, you can do sp_executesql which can be parameterized, but it gets more difficult.'

Now that we have the data and proc, just execute it. It will be best if you add the actual execution plans to this, which you can get to on your menu going to Query / Include Actual Execution Plan. If you don’t know how to read execution plans then this is even better! No experience necessary with plans this simple, and you’ll start to feel more comfortable around them.

EXEC Proc_Testing

Even with the default values for the parameters the query optimizer knows that you could have easily passed in NULL for the values, and it has to plan for everything. Lets step through the results to see what this all means.  Part of that planning for everything is saying that @ID IS NULL will be used instead of ID = @ID, so the query optimizer can’t say for sure that it can just perform a seek on that column.

In the messages tab we have the number of reads done by each statement courtesy of me adding SET STATISTICS IO ON in the proc itself (bad idea in prod, by the way). For the first one we have 7272 logical reads, saying we read 7272 pages of 8kb of data, so we had ask the CPU to play around with over 56 MB of data to find records that matched.

Looking at the execution plan we can see why that happened. The Predicate you can think about as the “Scan Predicate”, since it’s saying what you couldn’t seek for and had to scan instead. In this case it’s showing the entire “@ID = Test.ID OR @ID IS NULL” in there, because it compiled a reusable plan that might be used with a NULL.

Execution Plan of Query 1

In the next statement we were able to do a seek, doing only 3 reads. This is because there are two levels of the B+tree on this index, so it read the root level, then the next level of the B+tree, and finally the single leaf level of the index that contained the data we wanted.

Take a look at the execution plan here to see the difference with not only the Seek itself, but the Seek Predicate in the details. You’ll see that it uses @ID in there still showing that it’s a reusable plan ready for any ID to be passed in, but the WHERE clause is telling the query optimizer that it can always use a seek to find this data.

Execution Plan of Query2

Now that I know you can read through a couple yourself, I’ll skip ahead to the 6th query where we used the same “@ID = Test.ID OR @ID IS NULL” that caused is trouble before, but did OPTION (RECOMPILE) at the end of the statement. Again it did the same 3 reads I just mentioned still using a seek, but the details of the execution plan looks different. The seek predicate used to say @ID so it could be reused, but the query optimizer knows nothing with OPTION (RECOMPILE) will get reused. Knowing that, it simply converted @ID into a constant value of 125 and was able to eliminate the possibility of 125 being NULL.

Execution Plan of Query 6

If you want to get more into the details, you can look at the XML version of the execution plan by right-clicking on the plan and selecting “Show Execution Plan XML…”. Here’s what you’ll find for the parameter’s value in the first two queries. It shows that the value that was sniffed when it was compiled was 125, and also that it was run this time with a value of 125. If you ran this again passing 130 in then you’d see the sniffed value staying at 125 because it’s not going to use CPU resources to recompile it, but the runtime value will be 130 for that run.

<ParameterList>
  <ColumnReference Column="@ID" ParameterCompiledValue="(125)" ParameterRuntimeValue="(125)" />
</ParameterList>

The OPTION(RECOMPILE) versions don’t look like this though, they’re changed to a constant value even at this level.

<RangeColumns>
  <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[Test]" Column="ID" />
</RangeColumns>
<RangeExpressions>
  <ScalarOperator ScalarString="(125)">
    <Const ConstValue="(125)" />
  </ScalarOperator>
</RangeExpressions>

I hope I left you with a better understanding of what happens when you write code for parameters that could be null. We’ve all done this, and many of us (me especially) have been frustrated with the performance. This is a change I put into production not too long ago that ended scans on a rather large table, and the results were noticed in query execution time, the page life expectancy counter since the entire table was no longer being read into memory, and I/O activity levels since that index and the indexes it kicked out of memory didn’t have to be read off of disk nearly as often.

Let me know what you think, and especially if you have a better way of handling this issue.

Why worry about CXPACKET


TSQL2SDAYMany people see CXPACKET at the top of their waits and start trying to fix it. I know this topic’s been done before, but people ask me about it enough that it should be done again, and again. In fact, Brent Ozar wrote an excellent article about it, and Jonathan Kehayias wrote an article about tweaking it using the cost threshold of parallelism. Even with those, it took Grant Fritchey warning us to be careful about giving guidance to lead me to Bob Ward’s take on it.  After all of these people and more saying CXPACKET isn’t what you think it is, enough people have it all wrong to the point that I felt this was the biggest piece of public opinion to speak out against for Michael Swart’s T-SQL Tuesday #052.

BOL still doesn’t state it completely clear when it defines it as “Occurs with parallel query plans when trying to synchronize the query processor exchange iterator. If waiting is excessive and cannot be reduced by tuning the query (such as adding indexes), consider adjusting the cost threshold for parallelism or lowering the degree of parallelism.” The true meaning of this wait is much easier, with Bob Ward’s explanation of “This wait type only means a parallel query is executing.”

Yes, that’s it, it’s not really all that complex. If you have a MAXDOP of 4 then you could end up with 5 threads. Think of it as having one manager who can delegate work out to 4 subordinates. The subordinates do their pieces of the job and the manager marks down CXPACKET on his time sheet while he waits for them. When they finish, the manager compiles their work (repartitions the streams if you want). There was a little extra work done here because the manager wouldn’t have to combine all the work if he did it all on his own (MAXDOP = 1), and it’s possible the subordinates did more work than they had to. However, in most cases, it does get done faster, and duration is what all your users talk about.

The advice in BOL above saying “consider adjusting the cost threshold for parallelism or lowering the degree of parallelism” is incomplete at best. Upping the cost threshold is exactly what it sounds like, making it so only more expensive tasks can be assigned to those subordinates. Then lowering the degree of parallelism is just dropping how many subordinates you can use.

Cost Threshold for Parallelism is something Brent Ozar makes fun of a bit, because it’s a default of 5 units that are undefined anymore. Jeremiah Peschka recommends setting this to 50 and tweaking it from there. I’d recommend 25 for most OLTP because I like parallelism and my procs aren’t stressed. However, the point is that you should only tweak what you understand, and now I hope you have a descent understanding of this. If you have a lot moderately large queries you wish could break out the work to get it done faster, drop it. If all of your CPUs are running hot and you wish less work was being done repartitioning streams or compiling more complex plans, raise it.

Max Degree of Parallelism is what people really like to tweak, and it’s usually a bad idea. My personal recommendation is to set it to 4 in OLTP (higher in OLAP, very dependent on your load) and forget it unless you have a really, really good reason to change it. The default of 0 is absolutely horrible for OLTP on larger servers, where this could mean that you’re breaking out a task and assigning it to 80 or more subordinates, then trying to put all of their work back together, and killing yourself doing it. This shouldn’t be above 8 unless you really like to be overworked putting your subordinates’ work back together and dealing with those performance problems. However, it’s great to have subordinates who can help you out when needed, so please don’t drop this below 4 unless you have an amazing reason to do so, and only on the query level if you only have one amazing reason.

So, what do you do when CXPACKET is taking over? The trick is to think of it in terms of how a business would work. You have a ton of work to do, enough where your managers are going crazy trying to find subordinates to do it for them. Do you tell your managers to do it themselves OPTION(MAXDOP=1), do you tell them to go out and find as many subordinates as they can OPTION(MAXDOP=0), or do you tell them you’ll see if you can help them find a more efficient way to do their jobs?

If you have the Cost Threshold for Parallelism set to 25 or 50 you have a descent limit before you consider spreading the load around. So, what takes that long for your server to handle? A good portion of the time you’ll find this when you’re scanning a large index or table, or even doing a very expensive seek. If you look at your execution plan is it doing a scan? If it’s a seek that’s starting your parallelism then does it have a small “Seek Predicate” and a lot to do in the “Predicate” in the properties for that operator? It could be that you could use better indexes, it could be that you could find a more SARGable approach, or a variety of other performance tuning opportunities.

The thing is that parallelism isn’t bad and CXPACKET is saying you’re using a good thing. A high CXPACKET could mean you’re trying to use too much of a good thing because you’re “overworking your managers”, but the workload is typically more of an issue than the fact that you let your manager delegate their work when there’s enough to go around. A CXPACKET a lot higher than your baseline could mean that you have poor statistics or parameter sniffing that’s causing a bad execution plan, or it could mean you have a new query that could use some attention.  When you’re talking about parallelism it’s also possible that the statistics were off and one thread (subordinate) ends up doing 99% of the work, making CXPACKET jump up, and that can be found in the actual execution plans; read more about that in Grant Fritchey’s book.

If you have a MAXDOP around 4 (between 2 and 8, reading these recommendations) and Cost Threshold of 25 or higher then I’d be looking into the workload more than the fact that you’re using parallelism. Sure, some tasks just don’t take well to parallelism and there’s reasons to use a MAXDOP=1 hint, but it’s rare to need to take it down to 2 on the server level and even more rare to turn if off by taking it down to 1.

What ever you do, when you have “a big query using parallelism”, try to focus on “a big query” first, it’s usually the problematic part of that statement.

Related articles

Free SQL Server Training (not a sales pitch)


My company cut the training budget this year, and it’s not that uncommon for that to happen anymore. However, I’m still getting my week of training in, and more, without paying much.

Blogs of people who talk too much – FREE

There’s so much for free online, but you’re reading my blog so I don’t have to dive too deep here. Follow a couple blogs in an RSS reader, and make sure you sign up for freebies they may offer like Paul Randal giving away training!

Personally, I try to read a majority of everything written by:

You can include me on this list if you’re feeling generous. However, they’re all smarter than me, and I’ll understand if you don’t.

Local User Groups – FREE

I know most DBAs don’t go to the local user groups. Some never go, some go when the topic is right, but you can benefit from these every time.

Even the presentations that are outside of your normal job duties are valuable. If someone’s teaching SSIS and you don’t use it, knowing what you can learn from an hour-long presentation will put you in a situation to have more intelligent conversations when these topics come up. Read over job descriptions, they expect DBAs to know everything, and here’s your chance to know a little about it.

Also, you can learn from every presentation. I don’t care if you could give that presentation, people in a field known for introverts could use practice and examples of how to present ideas to others. You’ll see good, you’ll see bad, and you’ll be better because of it. Also, one or two small details you might pick up on a topic you know extremely well will make a big difference.

Speaking of our field being known as introverts, this is your opportunity to step out of your comfort zone and present. That’s what I did. Some present because they were born to do it, and I started presenting because I was born to blend into a cube farm! I’ll be presenting at my local user group in April, have already done a SQL Saturday this year, and would like to do at least two more SQL Saturdays, which drags me (kicking and screaming) into improving my soft skills where a typical DBA, like me, is behind the curve.

Our last presentation by Bill Wolf was an excellent presentation on avoiding common coding mistakes. It wouldn’t have been too much of a stretch for me to make a similar presentation, but I still learned a couple things and met some good people. The next meeting on March 25th will be about the interdependency between SQL Server and .NET, which would be amazing for any DBA or .NET developer to go to, even if Dell wasn’t giving away free laptops as door prizes!

Finally, the presentations aren’t the real reason you’re going there, are they? You’re getting to know the other DBAs in the area, and you’re sitting there talking person-to-person to the ones that want to share everything they know with you. You get to know them, they get to know you, and when one of you wants a new job or new coworker then you have a better chance at finding a good fit that you can be comfortable with from the start.

If you’re not sure where your local user group is or if you even have one, start by looking at this link.  Most groups are affiliated with PASS, and that means they’ll be listed there.

SQL Saturday – $10 Lunch

SQL Saturday is a cross between the local user group and the PASS Summit. In fact, the last one I was at has been referred to as a mini PASS Summit because of the number of high-quality speakers there, including multiple names from the list of bloggers I mentioned above.

The cities that host these events typically do so as an annual event, and there’s probably one not too far from you each year. I live in Pittsburgh, PA, which hosts one just a couple minutes from my house. I’m also trying to make it to ones reasonably close to me such as Cleveland, OH a couple weeks ago that’s about 2 hours away, Philadelphia, PA coming up in June that’s a 5 hour drive where I have family I can stay with. I think I’ll skip Columbus, OH because it’s the week after Philly and I have three small kids. Although I won’t make it this year, I may go to Richmond, VA, which is a 6 hour drive where I also have family to stay with.

It’s been a cold winter here in Pittsburgh, and now I’m starting to think about how much I miss my relatives in Texas, California, Florida and South Carolina. It knocks this idea out of the free or low cost training blog post, but that doesn’t mean it’s a bad idea.

While my company did cut the budget, they still paid mileage and gave me Friday off as a travel day. Even without a budget, still ask to see what you can get.

Virtual PASS Chapters – FREE

Pass also does virtual online events. I know I’m risking making this sound like a PASS commercial doing the local PASS chapter, the PASS SQL Saturdays, now the Virtual PASS chapters, but it’s hard to talk about free or low cost SQL Server training without talking about these people. So, why would I avoid this sounding like a commercial for them? They deserve to be known, and with all this free junk they can’t have much of an advertising budget!

Anyways, the virtual chapters each have their own focus, and I’m a member of the PowerShell, Performance, Database Administration, and I think a couple others. They do presentations about once a month, and I try to watch them when I have time. Worst case, they usually make the recordings available later.

There are also a couple “24 hours of PASS” events throughout the year where there are 24 back-to-back sessions. The next one on the schedule is the Russian version, which I’ll pass on.

BrentOzar.com – FREE

In addition to their blogs, the Brent Ozar Unlimited team (Brent’s just a member, not their leader!!!!) does weekly webinars that are usually very useful to me. I’d say I watch 3 of them a month, missing one either due to my schedule or because it’s a rare one that I don’t feel bad about missing. They put up the recordings later, although I have to admit I usually don’t watch them if they aren’t live. I know I don’t make any sense, but you’ll get used to it.

Twitter – FREE

Seriously, if you can fit your question and #sqlhelp in 160 characters or less, you’ll have someone pointing you in the right direction within minutes. They’ll point you to the posts that you didn’t know what key words to Google to find, but keep in mind that they’re doing this for free. Always be grateful, and understand when they say that digging into the issue would mean a trip to the consultant.

The End – Sad, but still free

So, yeah, my company cut my training budget. That means that so far this year (March 3rd here) I had to stick to learning from Steve Jones, Grant Fritchey, Erin Stellato, Stacia Misner, Tom LaRock, and others in-person. Then I watched free videos by Jes Borland, Jeremiah Peschka, Brent Ozar, Kendra Little, and Doug Lane. I have a free month of the SQL Skills Pluralsight classes with a code that Paul Randal personally emailed me, with which I’m in the middle of a class by Jonathan Kehayias right now. I’m planning on taking full advantage of this trial with classes by Joe Sack, Erin Stellato, Paul Randal, Bob Beauchemin, and Kimberly Tripp, although I’ll end up running out of time and buying a subscription for $30 / month.

So, is a lack of a training budget stopping you? You can say it is, but it’s not.

I gave you a lot and I missed a lot. This is not all-inclusive, and a quick look around the internet will prove that point. There are still Free e-Books and many other resources out there. I have to start skipping stuff or this post will never make it out to you, and what good would that be?

Hey, look!!! There’s a section right below here marked “Comments” that’s perfect for your addition to the list. Maybe it can be all-inclusive!!!

The Approachable DBA


DBAs are the gatekeepers, but if we make it an unpleasant process then people will find a way around the gate. It’s common to think of DBAs and developers being polar opposites that don’t speak to each other. This is wrong! It’s not one of the database-themed answers of “it depends”, this is just flat wrong.

DBA with a bat

I may look calm, but in my head I’ve killed you 3 times.

The issue with this is that when you take a polarizing position you get a polarized result. Take Grant Fritchey’s use of his Hickory Stick of Education for example. He goes on a rant about databases needing to be backed up, which, as protector of the data, is the proper view to have. However, the harsh approach many DBAs take to this role, with this being an exaggerated extreme, alienate their coworkers.

His coworkers on the other end of this conversation, we’ll call them developers to go with the theme here, will take one of two initial views. “I better do what he says, or he’ll beat me”, or “I don’t need this, I don’t need him, I have my reasons, and I’ll do it my way without discussing the details with him.” Unfortunately, both of these answers are wrong, and the ensuing conversation with HR may also be unpleasant.

Although they’re opposite responses, they’re both wrong for the same reason. Neither developer is going to discuss the details. The first developer is going to start backing up the database at any cost, but is a full backup and restore the best recovery option? The second developer isn’t going to back anything up, but are pieces of this database unique and need to be recovered in a disaster? After all, you’re responsibility as a DBA isn’t making sure you’re backing everything up, it’s making sure you can put it back to where it was in accordance to the written agreement you have with the business owners. In this case it could be data that is pulled from 30 different locations and the best option for recovery is to pull that data again, or you could have 90% static and the best option is to have a static backup of one filegroup and nightly backups of another, or…talk to the developer.

The number of details they tell you is directly proportional to how approachable you are. If you use a polarizing approach that sets your approachability to 0 then you’ll get 0 details, even to the extreme that people will try to spin up production servers without your knowledge. I’ve seen it happen. I’ve also seen those same developers come back later and ask us to restore something that was never backed up because we never knew it was production. You can’t allow the recovery conversation to be the first time the developers talk to you because that will end as well as trying to do a restore without a backup. It’s your responsibility to make sure there is communication going on.

So, would I hire a DBA with a reputation for not working well with developers? Maybe, after a long conversation about how to work well with others, not as my only DBA, and it would take a lot more in other areas for to convince me I was making the right choice. I need to have a DBA that will talk to the developers, go out to lunch with them, and have the developers accept someone protecting our data as one of their own. When they get to that point the developers start asking questions that help them use the database as smoothly as possible and the DBAs start asking questions to find out why the bad practices are so tempting in the first place.

Stop and think about this. Would I hire you? I would hire Grant because I’ve met him outside of an interview and feel his is much more approachable than his nicknames and this blog post. Just look at his posts directly before this post where he’s acknowledges the issue saying “I’m pretty sure most people don’t talk to their Dev teams if they can help it, and it does seem like most Dev teams seem to be on less than a perfectly chatty basis with their DBAs”, and after this post where he mentions the struggle to resolve this issue stating “I spent years at my previous job trying to do a better job interacting with the development teams.” The thing is, Grant’s an odd one; he’s a SQL Server MVP which means I can find more than one piece of information about him. Does your future manager know other details about you or is it possible that one rant is the only piece of your reputation they’ve heard? Chances are, if I ever heard pieces of this about you then it’d be an uphill struggle to get to an interview, let alone a job offer. Even worse, IT is a small world, and if you live in the same city as me then I can find someone who has worked with you before that’s not on your list of references.

Now think about it from the side of your current job. Yeah, you just kinda won the fight with making sure this database is backed up, but the developers are starting another project as well. This time they’re making a new database from scratch, didn’t involve you, and it’s an ORM reading and writing directly to tables. Everything’s a clustered index scan or table scan, but it works great with 1,000 rows of sample data. There are no cases where they’re reading from a view, no writes are being done through a proc, and they’ve built a lot of stuff on this foundation before you knew any details of the project. You know if you were involved from the start that it would have been done better from the start, but, still licking their wounds from their last conversation with you, the developers avoided you as long as possible. There’s a big mess, it’s your own fault, and a lot of people are going to be paying for it for a long time.

I’m not asking for you to be a pushover. Instead the point is to discuss the situations, find all of the details, give them proof of your logic when you differ in opinion, educate them as you accept their invitations to educate you, and go out of your way to be viewed as approachable. The result you’re going for is the same, which is recoverability of data, and I’m not advocating letting off until you achieve that goal. However, that’s not the only interaction with the developers you’ll ever have, and this approach will help you “win” this interaction quicker at the expense of not being able to work together through every other interaction. The only thing I’m going for is that DBAs as a whole need to develop the proper soft-skills to work with developers. This is going to help a lot of people in the long run, including you, your family, the developers, and me.

You: Your job will be easier. Things that would be issues will be handled before they’re big issues, and a lot more things will be set up right from the first place. Your reputation will be where you want it, allowing you to enjoy the job you have and get the job you want.

Your Family: You’ll be coming home from work less stressed, actually enjoying your interactions with developers. Beyond that, you’ll be more and more desirable, getting better and better jobs, bringing home a better income to help provide for your family.

The Developers: Problems are avoided, they have a mentor on the database side, and going to work is more enjoyable.

Me: I get really tired of this whole “Us vs. Them” mentality, hearing DBAs talk and only half jokingly say “and maybe even, ‘gasp’, talk to your developers”, and developers, rightfully so, assuming DBAs their enemy from the start. If for no other reason, all of the DBAs in the world should be more approachable to make my life easier!!! The only thing you bullheaded DBAs do is give developers ways to slow down! Why do I always have be here to fix your problems? The whole profession is nothing more than a bunch of socially awkward nerds on a power trip going out and telling developers “NO” with a smile on their face every chance they get. Who needs you?!?!

You know what? The developers that don’t talk to DBAs are usually right in their logic. Why ask someone a question if you know the answer is no, you just don’t know how you’ll get belittled on your way to that answer? In many cases developers aren’t talking to their current DBAs because of experiences at past jobs, but it’s still your problem to fix.

I’ll end this post the way I started it. DBAs are the gatekeepers, but if we make it an unpleasant process then people will find a way around the gate.

ApproachableDBA.com

Related articles

Cleaning up the Buffer Pool to Increase PLE


Chances are you have extra information in the buffer pool for a bad query and it’s dragging down your PLE, causing SQL Server to run slower because it’s reading more from disk. Although this approach is taking a 180 from my post Fixing Page Life Expectancy it has the same effect, with that post focusing on fixing your worst queries and this one focused on fixing your most misused indexes.  One approach doesn’t replace the other, it’s more like burning the candle at both ends, except that you end up with a better running database instead of getting burnt out.

With this approach we start with what’s in the cache.  You’ll see the same types of issues in almost any database, and this just happens to be a production database I’m going through today.

ScreenHunter_01 2014-01-06 14.52.28

The results were found with this query:

SELECT count(1)/128 AS cached_MB 
    , name 
    , index_id 
FROM sys.dm_os_buffer_descriptors AS bd with (NOLOCK) 
    INNER JOIN 
    (
        SELECT name = OBJECT_SCHEMA_NAME(object_id) + '.' + object_name(object_id)
            --name = 'dbo.' + cast(object_id as varchar(100))
            , index_id 
            , allocation_unit_id
        FROM sys.allocation_units AS au with (NOLOCK)
            INNER JOIN sys.partitions AS p with (NOLOCK) 
                ON au.container_id = p.hobt_id 
                    AND (au.type = 1 OR au.type = 3)
        UNION ALL
        SELECT name = OBJECT_SCHEMA_NAME(object_id) + '.' + object_name(object_id) 
            --name = 'dbo.' + cast(object_id as varchar(100))   
            , index_id
            , allocation_unit_id
        FROM sys.allocation_units AS au with (NOLOCK)
            INNER JOIN sys.partitions AS p with (NOLOCK)
                ON au.container_id = p.partition_id 
                    AND au.type = 2
    ) AS obj 
        ON bd.allocation_unit_id = obj.allocation_unit_id
WHERE database_id = db_id()
GROUP BY name, index_id 
HAVING Count(*) > 128
ORDER BY 1 DESC;

Service Broker Errors Table

First, there is an errors table here with Service_Broker_Errors, and that should never be in the top 10. What are we doing wrong? The index ID of 1 tells me right away that this is a clustered index, the fact that the table is at least 1.5 GB tells me that we’re probably not purging old data, and 1.5 GB in memory on this table is hinting that we’re probably doing a clustered index scan.

I’ll start by looking at the proc cache to see what’s going on. There’s only one execution plan that used that index and it is, in fact, doing a clustered index scan as I expected.

ScreenHunter_01 2014-01-06 15.24.05

Predicates or Seek Predicates

In an execution plan you’ll have a seek predicate and just a plain old predicate. The seek predicate is what you were able to do taking advantage of the index being in order, and the predicate is what you had to scan for.

ScreenHunter_01 2014-01-06 15.26.02

This case is easy because we’re only searching by a single column, but others could have both a seek predicate and a predicate. For instance, if I had an index on my customers table with the key columns Active, First_Name, Last_Name then searched where Active = 1 and Last_Name = ‘Hood’ then it will show up as an index seek with a seek predicate of Active = 1 and a predicate of Last_Name = ‘Hood’. Anyways, lets get back on topic with the issue of my Service_Broker_Errors table…

Now this sounds like a reasonable query looking for errors. I’m looking for the errors that occurred in the last so many days. The CONVERT_IMPLICIT(datetime,[@1],0) shows me that someone typed this the lazy way of GetDate()-1, which isn’t as efficient as DateAdd(Day, -1, GetDate()), but you’re getting me off topic again.

Fixing a useless index

Looking at the indexes on this table I realize there is only one, and it has the single key column of ID. For uniqueness you can’t do much better than an ID column, but you have to ask yourself if you’ll ever use it to query by.  In this case the index has never had a seek against it, only scans.  Although there table is rarely queried with only 4 uses in the last 2 months (I limited my historical data for this query), it’s still pulling 1.5 GB into cache for every use.  After a couple seconds of shaking my head I start to create a change request to make add TimeStamp in as the first key column in the clustered index.

ScreenHunter_01 2014-01-06 15.35.27

However, I then remembered that this table is 1.5 GB. Is that right? It’s an error table, so if it’s really 1.5 GB then I should either be cleaning up old data or there are so many problems that there is no reason I should be spending my time tuning. Seeing that it has 0 updates in the last 2 months, I already know it’s old data.  To double-check on this I run a simple query, keeping in mind the ID is still the clustered index, to find the oldest record, and discover that we haven’t removed anything from this table in years.

SELECT timestamp
FROM Service_Broker_Errors
WHERE id = (SELECT Min(ID) FROM Service_Broker_Errors)

So I have to talk to people about how old an error can be before we say we just don’t care. It was determined that we probably don’t care about anything more than a month old. I’m paranoid, it comes with the job title, so I made it three months with my change being this:

DECLARE @ID Int

SELECT @ID = MAX(ID) FROM Service_Broker_Errors WHERE TimeStamp < GETDATE()-90 

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (10000)
    FROM Service_Broker_Errors
    WHERE ID <= @ID 
END 

IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[Service_Broker_Errors]') AND name = N'PK_Service_Broker_Errors') 
    ALTER TABLE [dbo].[Service_Broker_Errors] 
    DROP CONSTRAINT [PK_Service_Broker_Errors] 
GO 

ALTER TABLE [dbo].[Service_Broker_Errors] 
ADD CONSTRAINT [PK_Service_Broker_Errors] PRIMARY KEY CLUSTERED 
( 
    [TimeStamp] ASC
    , [id] ASC 
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] 
GO 

The reason I did it like this is because I don’t want to fool around with the junk values as I drop the clustered index (rebuilding the table) then recreate the clustered index (rebuilding the table) with all of that information in there, then delete it all and wreak havoc on the new index. Sure, the fragmentation at the page level would be fine since I’m deleting in order, but that’s a lot of changes to the b-tree.

Finally, I add these lines to my nightly cleanup job to keep things trimmed up from now on:

DECLARE @BatchSize Int 

SELECT @BatchSize = 10000 

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (@BatchSize)
    FROM Service_Broker_Errors
    WHERE TimeStamp < GETDATE()-90
END

I know, I just got done saying that GetDate()-90 was less efficient than DateAdd(Day, -90, GetDate()), but it’s an implicit conversion that’s done once per call. I admit this is a bad habit of mine, and any time there’s even a chance of this being called once per row I have to write it differently than I normally do. Nobody’s perfect, and the most important thing is that I’m getting rid of all those records using batches, right?

In Conclusion with Service Broker Errors

Now that all of this is done I took a 1.5 GB table with all of it in cache to a 20 MB table with only 1 MB in cache. Cleaning up the data had more of an impact that my index change, but it’s usually not that easy to clean stuff up. Even if I wasn’t able to clean up the data, the index change alone would have allowed for the cache usage to be decreased by at least 1.4 GB.

On to the next one

That’s not the only issue I’m seeing here, in fact I bet I could take half of these off the top 10 list. However, today I’m picking up low-hanging fruit and moving on to the next task on my list. In this case, External_Messages is the next one that doesn’t look right to me. Once again, it’s a clustered index (index_id = 1) of a table that isn’t part of the primary focus of the database, which is handling orders.

Starting with the same steps I look in the proc cache to see what put this here. Once again I see a single query in cache referencing the table, but this one is different. It’s the typical IF X = @X or @X IS NULL that we’ve all written, and it’s being used as the predicate.

ScreenHunter_01 2014-01-06 16.34.32

I check the indexes on the table and it has the ID as the only key column of the clustered index, so that’s not an issue. Why isn’t it able to use the clustered index with a seek predicate? After all, I’m rather certain that they almost always, if not always, call this with a parameter where it should be filtered down.

Lets look at the execution plan to make sure. The sniffed parameter value from the first time it was called, as found at the bottom of that statement in the XML version of the execution plan, is, in fact, a non-null value.

ScreenHunter_01 2014-01-06 16.39.02

However, SQL Server can’t guarantee that you’re going to pass it a non-null value, and it has to make an execution plan that can account for either possibility. I’ve seen this before, so I know the basic options for a single optional parameter (there are more options, with increasing complexity). I can either add OPTION (RECOMPILE) to the query or I can rewrite it to be two separate queries.

OPTION (RECOMPILE)

Here’s the option I didn’t choose. This will recompile the statement every run, which isn’t too much of a problem because it’s a quick compile on something that only runs a couple times a day. However, it will make it so this doesn’t show up in the proc cache anymore, which I take advantage of quite a bit (for instance, look at the blog post you’re currently reading). Also, it goes against my rule of using hints as a last resort.
That’s not saying you can’t do it this way, just that I didn’t. The reason it works is because SQL Server knows when it makes this execution plan that it will only be used once, so it doesn’t have to account for the possibility of it being a NULL value next run. In fact, it just makes the execution plan with your parameter as a constant value.

ScreenHunter_01 2014-01-06 16.51.09

Two queries

Here’s the method I decided to go with. Assuming third-normal form doesn’t apply to query text, you should be good to go here. If it’s simple enough then it will be very obvious to anyone editing this at a later point that they need to make any changes in two places.

CREATE PROC NotTheRealProcName
    @id INT = NULL
AS

IF @id IS NULL BEGIN 
    SELECT ID
        , Subject
        , Message
        , DateAdded 
    FROM ExternalMessage EM
END ELSE BEGIN
    SELECT ID
        , Subject
        , Message
        , DateAdded 
    FROM ExternalMessage EM
    WHERE ID = @ID
END

This solution isn’t too complex with just a single parameter as it only creates two queries, but if you get just three parameters and try to do this then you’re up to 8 queries. The OPTION (RECOMPILE) method starts to look pretty good as a quick and easy fix before too long. I would still prefer one of the long and drawn out methods, such as getting interim results to a temp table, making it look a little more complex, but getting it to run efficiently with a cached plan.

It worked as I expected, with no one actually getting all of the IDs. I left the code in place to make it possible to get all of them to avoid digging through all of the application code to find where it could return everything then make sure it gets changed there. The result I was looking for was cleaning up the cache, which I got when this dropped from 1.4 GB down to 25 MB in cache.

The End

Overall I was able to drop about 3 GB out of cache, making room for other data while raising the PLE, in two simple changes that were rather easy to track down just by saying “that doesn’t look like it belongs here”. Even the ones that do look like they belong there probably don’t need to have that large of a presence in the cache.

Although I didn’t actually measure the impact that it would have on the end users in these cases because I didn’t start with a performance complaint, I would imagine that SQL Server tends to run faster returning processing 1 row it found quickly instead of reading through a couple million. So, measured or not, I’ll count that as a success as well.

PS. Jonathan Kehayias helped me

My queries looking into the proc cache are slightly modified versions of Jonathan’s work on his blog. Honestly, it didn’t need tweaked or touched for any reason other than the fact that I learn by tinkering. I did help myself by adding in filtering by database because I have a couple servers that have multiple copies of the same database, but, again, it worked great before I touched it, which is something you should expect from Jonathan’s work.

Here’s my altered version:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
DECLARE @IndexName SYSNAME = '[PK_ExternalMessage]'; 
DECLARE @DatabaseName SYSNAME;

SELECT @DatabaseName = '[' + DB_NAME() + ']';

WITH XMLNAMESPACES
   (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
    n.value('(@StatementText)[1]', 'VARCHAR(4000)') AS sql_text,
    n.query('.'),
    cp.plan_handle,
    i.value('(@PhysicalOp)[1]', 'VARCHAR(128)') AS PhysicalOp,
    i.value('(./IndexScan/@Lookup)[1]', 'VARCHAR(128)') AS IsLookup,
    i.value('(./IndexScan/Object/@Database)[1]', 'VARCHAR(128)') AS DatabaseName,
    i.value('(./IndexScan/Object/@Schema)[1]', 'VARCHAR(128)') AS SchemaName,
    i.value('(./IndexScan/Object/@Table)[1]', 'VARCHAR(128)') AS TableName,
    i.value('(./IndexScan/Object/@Index)[1]', 'VARCHAR(128)') as IndexName,
    i.query('.'),
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./OutputList/ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS output_columns,
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./IndexScan/SeekPredicates/SeekPredicateNew//ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS seek_columns,
    RIGHT(i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'), len(i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)')) - charindex('.', i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'))) as Predicate,
    cp.usecounts,
    query_plan
FROM (  SELECT plan_handle, query_plan
        FROM (  SELECT DISTINCT plan_handle
                FROM sys.dm_exec_query_stats WITH(NOLOCK)) AS qs
        OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp
      ) as tab (plan_handle, query_plan)
INNER JOIN sys.dm_exec_cached_plans AS cp 
    ON tab.plan_handle = cp.plan_handle
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/*') AS q(n)
CROSS APPLY n.nodes('.//RelOp[IndexScan/Object[@Index=sql:variable("@IndexName") and @Database=sql:variable("@DatabaseName")]]' ) as s(i)
--WHERE i.value('(./IndexScan/@Lookup)[1]', 'VARCHAR(128)') = 1
OPTION(RECOMPILE, MAXDOP 1);

Indexing Fundamentals


First, lets understand what the types of indexes are. If you take a reference book, you have the two types of indexes. The book itself is a clustered index, it contains all the data and has the key field of page number. The index in the back is your nonclustered index, it’s in order by the key field of subject, then has the implied second key field in there for page number, which nonclustered indexes in SQL Server have as well. On top of that, the table of contents at the beginning of the book is just like a b-tree built on every index that helps you get to the right page faster.

The key fields I mentioned are the fields everything is ordered by, just as you would expect a book to be ordered by page number. Nonclustered indexes also have implied key fields, which are all of the clustered index’s key fields that weren’t explicitly made key fields already, which is a great reason to keep your clustered indexes as narrow as possible in most cases, which means they have fewer columns.

If you have a phone book then you have a clustered index with the key fields last_name, first_name, address, phone_number. That means it’s ordered by each one, so any duplicate last names are still in order by first name, and so on. It’s wide for a clustered index, but doesn’t have any nonclustered indexes to worry about, so wide is fine here.

However, the important thing to notice is that if I asked you for people with the last name “Hood”, you’d find them rather quickly. If I asked you whose phone number is 412-555-6789, you’d cringe. The same thing happens in SQL Server where it will do the work you ask of it, but it won’t be cheap, quick and easy. This is why the internet has reverse phone number lookups where they have a nonclustered index on phone number which lets you find the name quickly by just knowing the number.

Indexes can also have included columns which are stored in the index, but no sorting is done on them. These fields don’t even exist in the b-tree for finding the information, only the page level where everything is stored. To simplify the process, this is where you put fields that you have in your SELECT clause but not your WHERE clause since don’t need it sorted to search quicker, you just need it to fulfill your query’s needs.

How are they used, and what do they cost?

A nonclustered index is used exactly the same way you read a reference book. You look up the key field in the back of the book and see what information is there. If you covered all the information that you needed just by looking at that index then you’re done. However, if you want to know more that wasn’t listed in the index then you take the key field of the clustered index (the page number) and do a key lookup by flipping to those pages and reading more about the subject at hand.

If you noticed, reading all that information in the back of the book seems pretty simple and efficient, especially when all of the information is there so you don’t have to touch the book itself. If you don’t have to touch the book then that’s called a covering index, which can be accomplished by adding the fields you want to retrieve to an index as either included or key columns. Depending on how many of these key lookups you have to do and how much extra information it has to get from the clustered index, it may make sense to make your index a covering index.

However, none of this is free. Every index you make has to be maintained. If you ignore filtered indexes, which I’ll get to soon, every time you insert or delete rows on a table then you have to write to each index. So if you have 6 nonclustered indexes, it doesn’t matter if you have a clustered index or a heap (merely a lack of a clustered index), you have to do one write to the table itself and one write to each nonclustered indexes, so 7 writes for that one row.

Add on top of that maintaining those indexes for fragmentation, extra storage needed on disks and backups, and more data that can possibly be held in the cache, and this stuff really starts to get expensive.

How many indexes should I have?

There’s no magic number of indexes you should have on any single table. Too few makes it hard to find the data, and too many makes a maintenance nightmare with data modifications taking longer. What indexes and how many will vary wildly, but there are methods to figure out what’s appropriate for each table. Chances are, many of your tables will benefit from both dropping indexes and creating them.

Consolidate what you have

First, look at the indexes themselves. If you have indexes that have the same first key field as the clustered index then you can consider removing those indexes with little to no detrimental effects. This is regardless as to if these indexes are being used or not.

Now, look at what indexes are never used. This can be done by looking at sys.dm_db_index_usage_stats to get you up to a month of data (assuming a monthly reboot for patching), or, my preferred method, by querying your tables you have set up that monitor this DMV. If an index is never being used it means it’s never helping you, but it is extra weight that’s adding up and slowing your database down. Just make sure you have enough data to say that the index really isn’t needed before relying on this alone.

Also, look for patterns in the indexes themselves by comparing the first couple key fields. If the couple key fields are the same, these are candidates for having the indexes combined into a single index. Depending on how quickly those key fields make the data semi-unique, you may be talking one to three key fields matching before I’m itching to reduce the number of indexes. You can do this by taking the columns that are in the second index that aren’t in the first and adding them to the first index as included columns, then drop the second index. Queries that used to hit either one of these indexes should run slower by an unnoticeable amount, yet you’re only storing data, doing updates, and maintaining a single index instead of two.

When you’re going through this process you just have to be careful that you understand the implementations of dropping a unique index. It is rarely a good idea to get rid of a unique index unless that uniqueness is enforced by other means as well.

Take this example.  I have a heap with a unique index on Col1 that gets scanned, but no seeks.  Also, there’s a NC index on Col2 that has a lot of seeks then is obviously doing a lot of lookups to get more information from the heap.  In this case I happen to know that the “unique” part of the index on Col1 is because that’s an identity column, so I’m not too concerned about maintaining the unique constraint here.  After seeing the usage stats over a month I’m able to restructure this table.  Both nonclustered indexes were dropped, a new unique clustered index was created on Col4 and Col1.  User Lookups were eliminated since it’s all just a clustered index now, we saved about 700 MB on disk, and the compressed equivalent on backups as well.

Heap_With_Unique_NC_Identity

Take another example where you can see that an index is duplicated.  No, none of these are 100% the same, but the second index is the unique (is_unique field), clustered (* for included columns, it’s just my way of doing it) has a single key column on Col1, which is the same as the first nonclustered index.  Seeing the usage like this, I’m going to drop the first index, saving 435 MB of space in my database.  This will have a negative impact in that it will move 626 index scans done throughout the month to an index that’s twice the size, and that’s a problem I’ll revisit later when I’m tuning queries that have high physical reads.

Notice you can also see two issues on this table that I’ll need to look into.  First, I have a nonclustered index that’s larger than my clustered index, something is wrong here.  Perhaps I should have put the fill factor on here to show that wasn’t the issue.  The real answer is that I’m working on a table that somehow got missed on the index maintenance, and I now know that this is possible and needs to be fixed.  Another issue is that there is a filtered index on here that has about 0.1% fewer rows than the unfiltered indexes, which means that you’re getting no benefits while alienating queries that may have used this otherwise.

Unused NC2

And one last example.  The 2nd and 3rd indexes both have a single key field on Col2 and just vary on the included columns.  This is a simple task of combining the two indexes by making one of them include Col1, Col2, Col6, Col7, Col8, and Col9, then dropping the other.  That saves us about 200 MB on disk, and avoids us having duplicated data in the cache.  Ask a SAN or backup admin and they’ll tell you the disk savings is great, but ask an unrealistically informed user and you’ll realize that cache savings is what makes a difference dropping your physical reads, upping your PLE, and making your app runs faster.

Then, there’s another simple task on this table of dropping the final index, which is nothing more than 112 MB of dead weight that was being updated an average of 30 times a day.  Not much to say here, they obviously didn’t use this as much as they thought they would.

Finally, you’re left with the info that User_Lookups is almost as high as the total number of seeks and scans on the nonclustered indexes.  This means that almost every query that hits the nonclustered indexes need to do a key lookup on the clustered index.  Knowing that, if I was going to get to that level of tuning, I would be looking into the queries that hit this table and determine if I want to add more included columns to actually cover the query or admit that I’ll never cover it efficiently and remove some included columns.  My answer here will vary partially depending on how many records are involved – key lookups are ok for 10 rows, but not so much for 1,000,000 rows.

Duplicated Index

None of these are hypothetical situations I forced AdventureWorks to do.  This is a real, production database, and I wrote two queries against it.  One to help me find these problems so I can put a change request in to fix them, and a second query that does the exact same thing while obfuscating the names to protect my employment.  So when you see the duplicated indexes or the sizes I’m working with, you can’t say that would never happen in production….it just did.

You reduced the load, now add to it

Now that you have fewer indexes in your database, it’s time to add to them. There are several ways to figure out what indexes you want to add, and I prefer to do everything manually. That means there’s no DTA (Database Tuning Advisor) here, it’s all scripts with proof that what I’m doing is making a difference.

Where are you finding the queries to tune

There are a couple approaches to finding the queries in most need of attention. Query Stats will give you the queries that take the most resources during your peak times, which will reduce the load on your server the quickest. Traces on duration will give you the queries that take the longest along with sample calls with all the parameters you need. Blocking monitors will give you the queries other queries are waiting on the most. Then, one of my favorites, job shadowing gives you what the pain points are for the users.

The first thing I look at is blocking, since that silently sits in the background making queries that could run extremely fast sit there and wait for a single slow query. If there is a lot of consistent blocking then take the head blockers and start tuning them. This won’t eliminate the fact that they’re grabbing locks, but it will cause those locks to be released quicker. Locks by themselves are not an issue and shouldn’t be eliminated, but the impact the locks have on other queries are an issue that is best to be handled this way. If you don’t have many locks that go too long, you can skip this and revisit it later. I say “too long” instead of giving you a number because it varies a lot; 5 seconds is too long for a web app while a minute isn’t much for a data warehouse.

Once that is out of the way, I’ll start diving into the most expensive queries on the server. They may not be the ones with the longest average run-time, but they add up to the most on the server. If you can cut the reads from 100,000 to 1,000 on a proc called every second then you’re doing more for the server than you would by tuning an hourly query that does 10,000,000 reads. Also, focusing on physical reads will help raise your PLE if that’s a concern on your server, which it usually is. Don’t stick on this for too long, you’re just trying to get the easy tuning out of the way and hopefully change the look of your Top-10 most expensive queries list a bit.

Now I move on to looking at the traces to see if any huge queries are running that are easy tuning opportunities. Again, skim through it, pick up the low hanging fruit and move on. Right now you’re still in the phase of reducing the load on the server, but you may not have hit what the users see yet.

Another place to look is in the cache to find out what indexes are taking up the most space there. If you know your data well enough then you should know what to expect from your cache. What you see there will probably surprise you, and you’ll probably see an Index_id of 0 or 1 (Heap or Clustered Index) sitting there taking 10 times what you would expect it to do. These are cases where there are clustered index scans taking place that you need to find. I was recently able to clear up 12 GB of cache on a server doing a full index scan on a 12 GB table where a 10 MB filtered index was able to do the job, which made PLE almost double.

So, now that you have the server running a little (or a lot) better in general, I get into the work that never ends. Get a hold of the people using the app, either by yourself or asking customer service to arrange it, and find out what they hate the most about the app’s performance. What you need here is for someone to write a list of when anything took a while, what screen they were on, and approximately how long it took. Don’t go requesting anyone to use stopwatches, you’re looking for “a couple seconds”, “about a minute”, or “I click OK then go get a cup of coffee.” Your goal here is to be able to correlate their pain points to something captured in your traces, giving you the procs they’re using along with real-life example calls to dive into.

Something you may find odd is that I’m diving into the part that affects the end users the most last, but I do it on purpose. The other pieces are things I skim through and do what I can in a week or so, but they’ll make the bigger impact on the server in general along with resolving some of the customer’s complaints in the process. In the process, you’re learning more about the server and how everything goes together so that when you start talking to your customers you can make a more immediate impact on their complaints.

After getting through list once I let my ADD take over, and man do I have ADD. The customers have an idea of what a DBA can do for them now to the point that they may even be contacting you directly when something annoys them, developers realize a SQL Server specialist can be a good person to lean on, and your server’s biggest pain point hurts a lot less. Once you get to this point, jump around on this list on your servers and take up a query or two when you feel like it.

New Job Checklist


I started a new job a month ago, but didn’t want to just figure things out as they came to the surface. Instead, I walked in with a list of questions I wanted to know about their environment. For both my side and theirs, this went over very well.

For those used to my blog, you’d expect all the performance monitoring to be a part of it, but that’s not all there was to it. In fact, the most important questions had to do with backups, recoverability, and nonprod environments. Lets go through it all in the order that I feel it’s important. This is already a long enough list with just saying what to watch for without getting into the details, and I’ll do deep dives on each section either over time or upon request.

Staying Organized

Do we know what SQL Servers are out there and which ones are prod? Also, who do we contact if it either went down on its own or if we’re going to intentionally take it down for maintenance? For when you do take it down intentionally, what’s the typical maintenance window for each server? Also, what builds of SQL Server and Windows are you running, including SQL Server build number and architecture for both Window and SQL Server? There are a lot of things we should know about the servers we’re working on, right up to basic hardware stats with current and max RAM and the number of cores for licensing. Know what you’re working with and make sure you have all the access you need on all the servers you’re supporting.

Backups and Recoverability

When was the last time each Tier-1 system was recovered from offsite backups? I’m looking for two things here. First, that they are being backed up offsite. Second, that we know those backups are working by testing them regularly. I don’t care if you use tape or disk, and I don’t care if the “test” was an emergency restore to get data back that someone accidentally deleted a week ago. Do they exist and do they work? In my case, I found a bit of a surprise here and ended up getting in the middle of a business continuity plan as seen on last week’s post,Disaster Recovery and Business Continuity

Continuing on with Business Continuity, I wanted to get the typical misconceptions out of the way. So I continued with the questions of how long it would take to recovery from different types of disasters and what does the business think those expectations are. Also, how much potential data loss does the business think is possible compared to the current recoverability? I know from my last blog post that some people feel that I’m overstepping the boundaries of a DBA with some of this, and I’m ok with that. My immediate boss is the VP of Infrastructure, and I think that people would do much better trying to view things from their boss’s point of view instead of their current position’s point of view.

Security

Now that I have basic recoverability covered, I moved on to security. First, most companies have too many people with access to prod along with sensitive data in all environments. This is easy to overlook. First thing’s first, protect prod at all costs. Who has what permissions, specifically, who has sysadmin? It used to be BUILTIN/Administrators, and your old boxes will still have that default, often unchanged. Don’t trust them to tell you who, query it yourself.

The next concern is the sensitive information that needs to be in prod and gets carried down to nonprod. Information such as SSNs don’t need to be accurate in nonprod, and there doesn’t need to be a way to recreate them from the nonprod data. However, you have to be careful when making this data useless since many apps join on this field, making you have a consistent algorithm to obfuscate the data. Hopefully you have another field such as EmployeeID or ClientID that you can substitute consistently and the developers would never know the difference.

Server Builds

Is there any consistency to the servers? Is the Min/Max server memory set up properly, are they using Instant File Initialization where possible, how big and how many files are there for TempDB, what is MaxDOP set as, what accounts does SQL Server run off of and is it a different account for each server? You need to have consistency if you’re going to consistently manage these servers. Yes, the values will vary, but it’s good to know that they’re set up right before you start getting into the issues. Jonathan Kehayias and Brent Ozar both have good server build docs out there, and I think most DBAs could benefit from reviewing what they have.

Depending on how they do it, the DBA will have some role in setting up failure notifications for SQL Agent as well. I’ve seen this as Database Mail where the DBA does almost everything and as an external program that reads the error logs where the DBA just makes sure the jobs are set to write to the app log on failure. Either way, you have to query the system tables on all the servers to make sure the setting are what you expect them to be.

You also can’t forget about the maintenance jobs. For which I believe everyone can benefit from Ola Hellengren’s Maintenance Solution. You may not use all of it, but the parts you do use you won’t write this thoroughly. Ola did something amazing here, put it out there for free, and you’d be crazy not to at least look at his generous offerings.

Staying Up-To-Date

It’s best to find out up-front what the plans are to move forward with getting everything to SQL 2012 or 2014. Even below that level, it’s good to know the process of getting all servers to the latest service pack and what the plan is with hotfixes and cumulative updates. The quicker you know this information the quicker you’ll know what’s on your plate for the next couple months.

How are we doing?

How do other departments view the DBA group? How do our customers (end users of any type) view our overall performance? How will you ever know if you don’t ask? The answers to questions like these are going to make a big impact on where your focus is. I would love to have questionnaires go out to Sr. Management, Mid-Level Management, and to End Users to find out their thoughts on a lot of different things, with a mix of open-ended questions such as their 3 favorite and 3 least favorite things on different areas and simple ratings questions with a rating from 1 to 4 (there is no average, there’s horrible, bad, good, and great). I want to know what they hope the app can do differently a year from now, how they view performance, and what their biggest headaches are.

In my case I made a mistake here when starting my new job because I came in knowing they had severe performance issues. When I looked at the different servers I saw a 2 TB OLTP with PLE dipping below 1,000 at times and a 100 GB OLTP with PLE staying above 10,000. I KNEW which one they were complaining about performance on and starting making some huge improvements. I was wrong, and that 100 GB database needed a lot more help then it was letting on. That’s not to say that the improvements I made weren’t needed or appreciated, it’s that I was fixing stuff that internal users were complaining about while waiting for another manager to ask me why I wasn’t focusing on making life better for the people who were paying our bills. It flew under my initial radar because it was smaller waits and coding issues while it was a constant struggle for the users that could pick one of our competitors over us.

I made a couple large changes that the SAN admin should notice on the 2 TB database and I got a couple “good job” comments out of it. Then I made a smaller change on the 100 GB database and got genuine gratitude and appreciation from multiple people. It wasn’t that I was working harder, it’s that my effort was focused in the right place after knowing where we stood and where our pain points actually were.

Finally, Monitoring

Once you know what you have, if it’s being backed up properly, it has security, is built consistently, there are plans to stay up-to-date, and you know where the databases stand from others’ point-of-view, you are finally ready to start looking into monitoring your servers. Look at my Monitoring and Baselines Presentation to give you an idea of what I’m looking to watch on each server. To be fair, we have Idera at my new job and I think that works out rather well, so there are things that I don’t have custom code watching. What I do is make sure that there is something accessible that watches everything I need and most of the things I want without duplicating too much work.

I do have complaints about Idera, and perhaps I’ll get into the pros and cons about the product in a later post. As far as this type of product, it is by far the one I have the most pros to talk about, so if any of the folks at Idera are reading this then you shouldn’t be sweating too much.

Wrap it up already!

This is a quick overview of the things that I’d prefer to talk about in my first two weeks on a job. Nothing was hit in-depth, and I could definitely write a whole post on each subject. However, don’t count on that happening. If you want a dive into a subject, ask.

I’d love to hear your comments on this, with the warning that I will stick to my guns completely if asked about wandering outside of the DBA role. I used to stick closer to my job description and requested duties, but didn’t feel like I fully bloomed until I tried to see everything from a higher view, most often from my boss’s view. Everyone organization can benefit from more leadership, and leaders have nothing to do with position in the company. I truly hope that this post encourages at least one or two people into taking the lead and getting the full picture under control, no matter whose responsibility it was.

Disaster Recovery and Business Continuity


Fires, SAN failures, tornados, and car accidents all came up today, yet it was a very good day at work.  I just started my job two weeks ago and made a list of questions to go through with them. After asking the one about when they last pulled an offsite backup and restored each Tier-1 database, I ended up in the middle of a project to restructure the disaster recovery and business continuity plan.

As a DBA I thought my job was to ensure that a database was recoverable from any disaster and that if you focused completely on the worst disasters possible then you would be safe from anything.  In the short-sighted version of things, perhaps I’m still right.  However, in the realistic version I had much to learn, and still do.

What is my job?

First, my job isn’t to make sure the databases are recoverable, it’s to make sure that database recoverability is part of a full business continuity plan.  That goes well beyond restoring a simple database to some server that I have no idea where it’s at or how it got there connecting to an unknown network without an app server.  There is so much more to a DBA’s responsibility in this aspect than I once believed.

If the app isn’t restored with the database then there’s no purpose for the database.  If there’s nowhere for the employees to sit and use the app then there’s no purpose for the app.  Finally, if the you restore too far from your current location then no one except the most senior and dedicated employees will travel to work.  All of this has to fit together for the mere responsibilities of a DBA to matter at all.

Second, the worst of disasters isn’t the most common, and the multiple day turnaround time is unreasonable for something as simple as a SAN failing or a building burning down.  Save the week long turnaround times for the disasters that remove your ability to work in the city you’re in.  Small disasters shouldn’t cost more than a couple days of data and productivity loss.

You need to look at disasters in different levels, and you’ll realize that as you go up the levels that they are increasingly less common.  Therefore you can’t have a simple business continuity plan and expect it to cover it all.  Also, if you don’t have anything then you can’t wait until you can figure it all out before you implement the first levels.

Missing Data

At the most common, you have a processing mistake.  The server and infrastructure are perfectly fine, someone just did something they shouldn’t have done.  Forget about the details, it doesn’t matter to you at this point if it was a developer mixing up < and > (haha, brings back memories of lost weekends), a user deleting the wrong records, or a DBA making a simple mistake.  No matter what, your challenge at this point is the same, you’re missing data and the business wants life back to normal as quick as possible.  For this the business needs to define what level of recoverability you need to have and help ensure you have the budget to do it.  If they never want to lose more than 2 minutes of data, you need log backups.  If an hour is ok then you can get by with Volume Shadow Snapshots (VSS) on VMs or possibly differentials on physical boxes.  If a whole day of data loss is acceptable then you’re home free with daily fulls or differentials.  No matter what you’re talking about here, this should be easily accessible by the DBA without outside help anytime day or night.

Small Hardware Failures

Now lets move on to local hardware failures.  A single drive shouldn’t be a concern because there’s no excuse not to have a minimum of RAID-5 on absolutely every file on a server.  Not saying you don’t have to double-check that the server “Prod-012″ isn’t a laptop sitting under a stack of papers on an engineer’s desk, I’m just saying there’s no excuse.  Anyways, motherboards don’t last forever, RAM fails, and if your RAID-5 array is too large then it’s no safer than a stand-alone drive.  If you lose something, how long will it take to have something else take over?  If you lose a couple drives do you have enough spare space to get you by until you can replace them?  As for servers, Active/Passive clusters or VMs are great, one loss and you’re up and running in 5 minutes without too many complaints.  Active/Active clusters are also great if they’re set up right, as long as you’re not double-dipping on the RAM with two instances each set up to use 50 GB of RAM on a box with 64 GB of RAM where the OS will lose the fight quickly.  Standalones, however, are out of luck and need to have a plan in place.  How long does is the business able to deal without that server?  Did the business give you that answer in writing, and is the person in the business responsible for that server currently aware of that potential downtime as of your last annual review of disaster recovery and business continuity?  Wait, do you do an annual review with the business of your disaster recovery and business continuity???  Many companies don’t, and it’s your responsibility to start that conversation.

Shared Hardware Failures

Shared hardware and services also fail, no matter how fault-tolerant you think it may be. Power goes out, do you have a generator?  Networks go down, do you have two connections that can handle the traffic at an acceptable speed coming in through separate hardware?  SANs spontaneously combust, were your data files and backup files on that same rack?  Walk through your server room one day with a system admin and point to each device, discuss what happens when it stops working.  Telephone polls and transformers take time to replace, were your fault-tolerant connections coming in on the same pole?  Now we’re at the point that I’m not expecting you to have a spare on hand, but you still need a plan.  SANs are expensive, so you at least need to know where the money is going to come from if you need a new one overnight.  Again, know your risks, know your recovery time, and make sure the business has agreed to that in writing.

Building Loss

Now for what people tend to have in mind, fires, tornados, floods, blah, blah, blah, they all happen.  I’m in Pittsburgh, so hurricanes are covered in floods and there is no specific earthquake plan.  Is your server room in an area that would get flooded more than once a century?  If so, can it be moved?  If the building your server room in is no longer there, what do you have to recover with?  Offsite backups of databases aren’t enough.  You aren’t restoring databases at this point.  If that’s what you were thinking, you’re in a heap of trouble here.

First, you’re panicked; everyone’s panicked. Where is your plan, it wasn’t in that building, was it?  Who knows where to find the plan since you’re not in a disaster where you can count on everyone being available anymore.  Can everything be restored to a functioning state with acceptable data loss from data stored offsite, and was it tested to prove that it could be?  Remember, if you’ve only restored the database then you’ve restored a lot of valuable, yet useless, information.  Are the app servers in that same plan and up-to-date so the users can use the data?  Where are those users going to sit?  I know, this is on a DBA’s blog, so mostly DBA’s are reading this, we don’t tend to think about a user’s cube, or if their temporary cube is further than an hourly employee would even consider traveling to work.  However, right now our job is business continuity, not strictly databases.

So, say you do have an offsite, easily accessible plan that is written in a way that panicked and stressed employees mixed in with temps to make up for the employees that had to tend to their families instead of showing up for work could understand and implement flawlessly, but what does the business expect?  I keep coming back to this because the answer is usually that they think we have a million-dollar plan while giving us a small budget.  Realistically it may take us a week to have it all together, and we’d never know the full picture without testing it.  However, the business may be there saying that if it’s down for more than four days then we’re out of business, which would really suck after putting all that work in to this awesome, yet lower budget plan.  Make sure everyone knows the expectation and have that offsite plan not only signed off by the business, but the proof that they signed off on it stored offsite as well.  The IT department isn’t the only department stressed out when the data center burns to the ground, and stressed out people like to blame others for everything no matter who is at fault.

Oh no, I never thought that would happen

Finally, the largest of disasters.  What if the entire city where your data center in is inaccessible.  I know, how often does that really happen?  Hurricanes took out New Orleans for months and parts of New York City for weeks, earthquakes threaten to do the same to parts of the country, terrorist attacks are possible, plagues can happen (although I don’t foresee the zombie apocalypse), and I now know where the nearest nuclear power plant is.  Do you have recent backups stored outside of the region that can recover from these types of disasters, and has the business signed off on this?  For some companies it is acceptable for the answer to be that they don’t exist and it’s acceptable to go out of business for these disasters, and that’s ok.  Again, get it in writing, and store that piece of information in your personal offsite storage (yay, Google drive).  If this is the case, you’re protecting your rehirability when a company goes out of business saying their IT department failed them in a disaster.

I provide advice, not business continuity plans

Now it’s time for my disclaimer.  This is by no means all inclusive, and nothing you get from ANYONE’s blog should be taken as such.  Some companies have someone who has the job title VP of Business Continuity, and some don’t.  No matter what the case, it’s your responsibility as a DBA to make sure the DBA team has a part in this process and that the process is implementable for all levels of disaster with restorability that the business is aware of.  Unless you’re the only person in the IT department you shouldn’t have to do this alone, but it is your responsibility to start it if no one else has taken that responsibility.

Monitoring and Baselines Presentation


Here’s my full presentation for SQL Saturday #250 in Pittsburgh this past Saturday along with some notes on what I got out of it.

My Lessons

I learned a lot doing this, and I hope people learned a lot from it. For the people on the other side of the podium from me, the main lessons were in what I put in the abstract; they learned mostly about SQL Server. On my side of the podium though, the lessons were far from the same. Public speaking, creating presentation, prepping for unknown audiences, and seeing the gratitude of the SQL Server community in person. Anything that was on the abstract I learned slightly better, but that didn’t compare to what wasn’t on it.

Public speaking was always a fear of mine, and I tend to be someone who avoids my fears. However, throwing myself out there at the public as a whole was never my thing either, and I’ve been doing half decent, at least in my own mind. That being said, I decided to go all out and push myself a little further. After all, what are the chances of them picking me to present with all those professional teachers, consultants, and MVPs out there throwing up their abstracts. Best case I could say I tried, worst case I was going to throw myself out of my comfort zone and hope for the best.

They picked me, which I didn’t let myself expect. Everyone that knew I put in my abstract was also told by me that I only had a 50/50 chance at best, which was more optimistic than I really was, but I don’t like people knowing when I’m being dismissive of myself. It turns I was wrong a lot, and in every case I was wrong I was glad I was wrong.

Being Forced to Get Better

I’m not one to learn things for the sake of learning them, which is why I sucked in school. To really learn things I need a real-life use case, somewhere I’m going to apply it. I didn’t practice speaking ever before because I didn’t speak in front of large groups. I didn’t know how to put together a presentation properly because that’s not my thing…well, wasn’t is probably a better word now. Just like a couple months ago I wasn’t quite sure how to blog, but I’ve made it past that point.

Like everything in my life, blogging went from unknown to addiction quick. I’m not sure that I can say the same thing about speaking, but I can see the possibility. SQL Saturday only comes to Pittsburgh once a year and the local user group, which I’ll be joining soon, only meets once a month and has a single speaker. However, I can’t say that I didn’t look up when the SQL Saturdays in Cleveland and Washington DC were. Also, I took my own thoughts on my presentation and the feedback I received in and immediately starting thinking about how I could make that presentation better.

What’s below is the original, not touched up at all in the last 4 days. In part to show where I went wrong and how I’m going to fix it, and in part because Monday was my first day at a new job. By the way, starting a new job when between jobs is the best and worst thing you could ever do to yourself. You don’t have time to be too nervous about anything because you’re too overworked, but you’re also too overworked. Eh, you can’t win them all.

Well, on to the presentation. What you have here is my script that I talked through and taught myself before the presentation. However, if you were there, you’ll notice this isn’t exactly what I said. I didn’t read it, I presented it. There were no demos (the biggest complaint from the crowd) to avoid me from stumping myself, and the PowerPoint was just a whimsical picture for the start of each paragraph to keep me on track because I knew I’d be nervous and lose my way without a paragraph-by-paragraph guide helping me along.

I must have done at least descent because my reviews came back with two 3′s and the rest 4′s and 5′s out of 5. You’re just asking me to get off topic and start talking about why I hate ratings that are odd numbers, but no, I’m not getting off topic, I’m getting on to my presentation!!!

The Presentation

Download slide deck here.

I’ve been working on databases for over a decade, and most of that without having good standards or monitoring in place. My job was rough and I was much less effective. Now I know what I’m doing and I want to share that knowledge. In this presentation I am going to tell you what I watch and why I watch it. A presentation is a poor format to go through how to do this; having it in writing is much more beneficial to you. That is why I’m pretty much skipping that portion here and diverting you to my blog at SimpleSQLServer.com. Not so much to promote a blog that I lose money on, but instead to give you the resources in the best format.

As a DBA you will hit problems with the performance of your databases. It doesn’t matter if it’s one process or across the board, or if it’s just today or it has always been that way. No matter what it is, the more you know about your servers the easier it’s going to be to fix them.

It’s not easy, there’s no single spot to watch, no single solution to all of your problems, and there’s no “normal” values for these counters you can apply across every server you manage (Note: Thank you, Mike John, you stressed this point a lot). I watch several things on all the important servers, and most of these on every server. The important part is that you watch them continuously, even when you aren’t expecting to use the data. Some parts are cumulative and you can’t tell what was during an incident or what was from overnight maintenance. Other parts are snapshots and there’s no looking back.

(Note: Now I’m on to the stuff Brian Castle taught me. He’s the best you could hope for in a manager, and last I checked he was still hiring at EDMC in Robinson Township near Pittsburgh, PA)

To me, monitoring and baselining is the same thing. I know others will do a specific baselining process on a specific day and keep that. I feel you lose too much doing that and I watch enough to say that I have a continuous baseline going back for 13 months in most instances. I’m not all that worked up about lucky numbers or anything, it’s just nice to be able to say “that annual process we ran last week ran like this last year”. If you’re superstitious or have OCD, 400 days makes me just as happy.

Traces – Snapshot
Wait Stats – Cumulative – Resets on restart
Blocking – Snapshot
Query Stats – Cumulative – Resets on recompile
Index Stats – Cumulative – Resets on restart
OS Perf Counters – Varies
Database Sizes – Snapshot
Table Sizes – Snapshot

Traces – What long running processes have completed on the server?

You have to be careful, this one can kill your server. I have, on more than one occasion (sorry, Brian), caused a large server to reboot in the middle of the day because I filled the drives writing 40 GB of traces in 5 minutes. That being said, there is a safer way to trace, just don’t do an unfiltered trace catching everything.

I personally feel that absolutely every server should have a trace capturing RPC:Completed and SQL Batch:Completed that ran over X seconds duration. What is X? Well, that depends on the server. I’ve seen the best number for X be as low as 100 ms and as high as 10 seconds. Start high, and work your way down. You can add or do a separate trace for the Statement Completed for each of these if you want more detail.

You can get all of this except for SQL Batch:Completed with the text data from Extended Events, and that is a very good alternative. I haven’t made that jump for several reasons, although I would recommend it. First, this level of tracing hasn’t hurt me. Second, I still support several important servers that are still on SQL 2005 and I want to be as consistent as possible to provide as consistent as possible support.

What you’re trying to solve here is answering what ran long, have some hints as to why it ran long, and be able to say how often it has been running that way in recent history. A long duration doing very little work was waiting on something else, and other monitoring will help solve that problem. A lot of CPU, reads, or writes shows that you may need to look into tuning or statistics. Just keep in mind that reads can be reads of work tables, and writes can be writes to tempdb.

If you do it right you should have 4 or more days worth of history, and by doing it right I mean having 5 to 10 rollover files sized a reasonable size you can send off if needed (100 MB at most, they zip well) and capturing over the best duration threshold for your environment.

Wait Stats – What’s slowing you down?

The traces tell you what work was done, and the waist statistics tell you what went on when the query couldn’t actively do its work. This is cumulative over the server, so you can’t get details down to an individual process, however, you can see how much time is wasted and where it’s wasted at.

There are almost 500 distinct wait types in 2008 R2, but you don’t need to worry about that. The important thing is to know what are your biggest waits are, and you can look them up if you don’t recognize them. In the last slide I have links to a free eBook written by Jonathan Kehayias and Tom Kruger for the Accidental DBA that does an amazing job documenting what you can ignore and the meaning of the big ones you’re most likely to see.

If you’re looking for overall server health then you would want to look at waits over a long period of time, specifically the times of day you’d like to see better performance. For incidents, you want to look at what your wait stats are now and compare them to your running baseline. For me, the typical baseline I use is the same timeframe yesterday and 7 days ago. These are actual relevant days that have all but the changes you put in over the last couple of days, and it’s a true baseline for this specific server. If this server never waits on PageIOLatch, but it’s in your top 3 today then you instantly have a direction your heading in troubleshooting.

Blocking – Who’s in your way?

Locking occurs when one query is using data, and blocking occurs when another query needs to do something that is incompatible with that lock. If you don’t keep an eye on it then you’re looking back at a trace and seeing that a query took 1 second of CPU, did 1,000 reads, and no writes, yet it figured out a way to take 5 minutes in duration. The users complain the app is slow or unusable, and you’re giving them the answer that you’ll try to figure it out next time it happens and THEN try to find out the root cause.

That’s a bad idea. There are three types of users – those who don’t know you because things are running smooth, those that like you because you can say “I see exactly what happened and I can start working on avoiding reoccurrences”, and those users that you keep telling that you’re not sure what happened and you’ll try to get a better idea the next time it happens. I wish I could tell you that I could have more users not know your name, but watching blocking is a very easy way to get the users that like you to outnumber the users that hate you.

So, what can you do to tell them you know what just happened? My answer is to capture everything that’s blocking on the server every minute. Sure, there’s going to be a lot you capture that is actually only blocking for 10ms, and there are going to be things that were blocking for 59 seconds that were timed perfectly so you never saw it happen. The trick is to know that there is no perfect solution, and capture what you can. The DMVs are there to provide all the information, and being able to capture that on the fly along with in a proactive monitoring setup will make you look amazing to the users. If something blocked for 5 minutes, you have no excuse not to say “This query blocked this query, and this was the head blocker”. Sure you may have an excuse to say “It’s a vendor database and I passed it on to their support”, but at least you can tell the users something, which is always better than nothing.

Query Stats

The DMV dm_exec_query_stats is used by many DBAs, even if all we realized was that we were running one of the “Top 10 Most Expensive Query” scripts off the internet. The problem is that this is using it wrong. This DMV holds everything from when the query first went into cache and loses that information when the query goes out of the cache, even just for being recompiled. So the big process you ran overnight could still be in cache, leading you down a path that’s actually low priority for you. However, that relatively large process that runs every 5 minutes that just recompiled a couple minutes ago due to stats auto updating, it’s not even in there. This is too volatile to say that you’re getting good numbers out of it.

The answer isn’t to find somewhere else to get this information, it’s to capture this information more often and more intelligently. Capture it once and you have a snapshot of what it looked like, capture it again an hour later and you know what work has been done in that hour. Repeat the process over and over again, constantly adding another hour’s worth of data and saving off the latest snapshot in the process and you have some real information to go off of. This isn’t so much of what happened to be in cache at the time, it’s a pretty good summary of what ran and when it ran. Now you can query in detail, specifically saying “I want to know what queries are doing the most physical reads between 9:00 AM and 5:00 PM on weekdays so I can have the biggest impact when I tune this database.”

Index Stats

The DMV dm_exec_index_usage_stats isn’t too different from dm_exec_query_stats except that it’s cumulative since the time SQL services were started. Looking at a snapshot still leaves you half blind, unable to see what happened before the last time you rebooted the server and unable to tell when that index was used. Personally, seeing when it was used is less important to me than how much it was used long-term. Tracking this can do amazing things for you if you’re really fine-tuning a database.

This DMV is really under-used in my experience. As time goes on with a database the data changes and you add more indexes to make it run faster, but you don’t know if or when the index it used to use was being used by anything else so it stays there. Over time that means you have more and more unused indexes, and there’s no solid proof that they aren’t being used without you doing the work to collect that proof. This, in addition to duplicated indexes, adds to the workload for inserts and updates, adds to the workload of index maintenance, adds to the database size, and put additional strain on your cache and thus the PLE.

OS Performance Counters

This is one of the most misunderstood DMVs while also being one of the most useful. It’s giving you critical counters that give you an idea of how SQL Server is interacting with the hardware, but it’s doing it with several different types of counters that have to be measured differently. There are a couple great posts out on the internet to understanding how each one needs to be measured. As for right now, we’re focusing on why you want to watch it and what it will do for you. And wouldn’t you know that I saved the hardest one to describe for last, because each value is different and there will be controversy on which ones you should be watching and which ones are just a waste of your time monitoring.

PLE is one that this is no controversy about if you should watch it. This is the average age of the pages in cache, and a good measurement of when you’re doing too many physical reads. Peeking in on this from time to time is good, you can make sure it’s high enough. However, watching this will let you know when it’s dropping and help you dive into why it’s dropping. If you’re also running a trace you can see what did a lot of reads in that timeframe, and if you’re capturing Query Stats then you can find out which one of your large read queries is doing all the physical reads that destroy your PLE.

Server Target and Total memory are also great to watch even though they rarely change after the server is back up to speed following a restart of SQL services. The target memory is how much SQL Server would like to have, and Total memory is how much it actually does have. If target memory drops then there’s OS pressure you need to worry about. Also, as Total memory is increasing then it means the server is still filling the cache and PLE can pretty much be ignored. Of course it’s low, everything in there is new.

Page Lookups, Page Reads, Page Writes and Lazy Writes give you a better idea of how SQL Server is interacting with your cache and disks. If you want a real picture of what SQL Server is doing behind the scenes, this is much more useful than glancing at cache hit ratio.

Deadlocks and Memory Grants Pending are two things you’d like to always see at zero. That may be out of the question with deadlocks, but if you’re getting above zero on memory grants than you need to find out when that happened and everything that was running at that time.

SQL Compilations and Recompilations are hidden CPU hogs. They’re never going to be at zero because very few servers have stuff stay in cache forever, ad-hoc code runs, and several other reasons. However, when one of these counters jump up, you may have a hidden cause. For instance, if a piece of code that runs every minute or more was written in a way that it can’t be stored in cache, these numbers will be noticeably higher. When this happens, your CPU will be noticeably more stressed with no other indicators as to why, and this query won’t even show up in your Query Stats as it relies on showing the stats of everything that’s currently in cache.

Database Sizes

This seems simple, and it is. Most people know how to find the sizes of their databases, but do you track it? If not, you should. Some questions you should be able to answer readily are how fast your databases are growing, when will you run out of space, and is the steady decline in PLE justified by the growth of the database. Getting into more details, you can touch on when does the database grow such as an accounting app jumping in size every April, or answering if the data is growing steadily or exponentially. These are important things to know when it comes time to budget for new servers and disk space.

On the other side of things, you also have information to push back on application teams on how much space they’re using and if that’s really necessary. A lot of times you’ll find that they’re surprised by their growth to the point that you have to show them the numbers for them to believe it. It’s not uncommon for the response from this to be a cleanup project that helps keep the databases smaller and running faster, which is a big goal being accomplished.

Table Sizes

This is just building off of watching your database sizes in more detail. When a database starts filling up, and it’s filling up quick, it’s good to tell an application team what table is doing it. In my current environment it makes sense to watch every table that is over .5% of the database size AND over 100 MB. This means that some databases we don’t watch anything and other databases we watch about the 20 biggest tables. The biggest point is that we’re not trying to watch everything while watching anything big enough to make a difference to us.

If a database is filling a lot faster than normal then a lot of the time there’s a process that isn’t running like it should. The biggest table in the database may be a rather static value, but the third biggest table in the database wasn’t even on your radar two weeks ago. The app teams love it when you can tell them that not only is the database growing out of control, but we also see the growth in table X which has been growing at 200 MB per day starting on the first Saturday of last month. By the way, wasn’t last Saturday the day you changed a couple procs around or implemented an upgrade?

Now you just graduated from “Hey, there’s a problem” to also include “and here’s a huge lead to finding the root cause.” It’s rare to use this information, but it’s lightweight to capture for something that gives you a heavyweight appearance when you reference this knowledge to other teams.

My Critiques

I’d love to hear what you think about this presentation. Before we get to that, here are the complaints from the most harsh person in the room when I was presenting…

I didn’t do enough to say what each thing was. Although I marked the presentation as intermediate, you don’t know who’s going to be there. In this case, I knew a couple people in the crowd, and they ranged from data analyst to database manager. It should have been presented in more of “If you’re not ready for intermediate then you have everything you need, but it will take some effort to keep up”.

My slides were lacking. I stick with the idea that there should be few words on the screen, but I took it too far. I admitted above that the slides were there more to keep me on track then it was to help the audience, which I probably needed for my first public speaking venue. However, next time I’ll be less nervous and more prepared to make it geared for the audience better. Instead of random pictures, I should have more of graphs and data that I can talk through to give the audience visuals.

Nothing was given to the audience, and there should have been something. To be fair, none of the presentations that I know of handed anything to the audience. However, I set my own bar, and I feel that people would like a sheet of paper with the outline on it with links to online sources. My blog, of course, because it shows how to grab everything. Also, an outside link for each item. This gives a physical reminder to turn this knowledge into action.

There was no demo, and that was the audience’s biggest complaint. While I’m still not sure that I would do a live demo on my second go around, I’m going to hold myself on the remark above about my slides having graphs and data. That would also drag me away from the podium to talk through the slides, making it a more dynamic and interesting presentation; that would really help the audience take things in.

Your Turn

Now I need you to do two things. Put this information to work in your environment is what will help you the most. What will help me the most is you commenting on this post both before and after you implement any of this so you can help me become a better presenter.

Thank you for taking the time to read this!

Monitoring Database and Table Sizes


Trending database and table sizes helps give you an idea of what to expect, and, sometimes, points out problems and their root causes. I even go so far as to monitor the disk space, file growth rates, and have a report going out telling me that “according to your current growth trends, you’ll run out of space on this day”. That last part will have to be another post since getting disk space can be kinda tricky depending on how your disks are set up; you need to go outside of SQL Server to grab the size of mountpoints if you use them. Don’t worry about mountpoints right now, if you don’t know the term that usually means that you can swap out the word “drive” and have everything I say be accurate for your environment.

First, this is lightweight stuff, there’s really no reason to capture it more than once a day, and it will only add up to several MB for a year’s worth of data. You’re not going to kill yourself doing this, you’re not going to stare at the data every day taking up your time, but you’re going to look like a heavyweight when you put this into action and reference this data to others.

Second, I do have other similar posts doing snapshots of this same info for Table and File sizes. It’s basically the same thing just different in how you’re using it and if you’ll have a historical view of it.

Database sizes

Lets start with the databases because, well, they’re bigger. I capture it all, the database name, logical file name, file type (row, logs, etc), filegroup, allocated space, used space, max size, growth rate, if it’s percent growth, and the drive/mountpoint it’s on. Other than dividing the appropriate numbers by 1024 twice to store all my sizes in MB, everything gets stored in native format. The reason for that is you have raw data in a format easy for you to verify and easy for any outsider (consultant, new employee, etc) to understand everything you have. I may make some of you cringe because I don’t normalize this; it’s small data and not worth the effort, especially with page compression (see Brad McGehee’s post on compression) turned on.

What I like doing is having automated reports with logic behind it saying “if drive size is X, file growth is Y, then you can grow to a potential size of Z”, follow that up with “the used space in this database has grown X in the last 90 days and Y in the last 30 days, so according to those trends I’ll hit the potential size in Z days”. One of my favorite reports takes that last Z as a parameter and tells me anything that will run out of space before a specific number of days. I get that report daily for anything that will run out of space in the next two weeks and a separate instance of that report monthly telling me what will run out of space in the next 100 days. For me to run out of space without having a couple automated emails sitting in my inbox yelling at me saying “See, I told you that would happen!!!”, a database has to do something crazy and out of character. The only part of that last sentence that doesn’t really happen is the automated email yelling, those voices don’t exist outside of my head (do they?).

Even without monitoring disk space, you can predict autogrowths. This is useful for trying preventing autogrowth all together if you’re into that, but with instant file initialization this has never been a spot I’ve chosen to spend my time on. However, I do care about fragmentation, and I have a report that will tell me if we continue growing according to our 30 or 90 day trends then the data files will have to grow X times. If X is higher than 4 and filegrowth is below 2 GB chunks, it shows up on my report and I change the filegrowth to make it more reasonable. If you allow percent filegrowths in your environment (just say no), or even if one sneaks through, things get complex and you have to stroll through financial websites to find the calculations for compounding interest to figure out how much the files will grow.

Cast(CEILING(LOG((Cast((ProjectedGrowthInAMonthMB - FileGroupFreeMB) as Dec(20,4)) + Cast(FileSizeMB as Dec(20,4)))/(Cast(FileSizeMB as Dec(20,4)))) /LOG(1+(Cast(Growth as Dec(20,4))/100))) AS Int)

Something inside that database is growing, but what is it?

Now if a database is growing, the first thing someone will ask you is “why is it growing”. That’s an answer I can’t give you, but “where is it growing” I can do. This is why I’m also tracking the table sizes in a database. Be fair to yourself here and admit it up front that you don’t care if a table changes size radically if the largest size is still insignificant to that database. So make up rules on “I want to watch tables that are bigger than…” and only store the stats for those tables to keep your little collections from being the source of your size issues. For me, that means that a table has to be both over 100MB in total size including indexes and over 0.5% of the size of the database. I’ve used these sizes for a while now and have never had an issue on either side; no one needed info I didn’t have for space trending issues and the collection time and space I used never caused any complaints. This isn’t saying that someone won’t ask “did that table have 10 or 11 records this time last month”, but I’m not willing to collect enough to answer every little question.

If a database is growing at an alarming rate it’s easy to assume the largest table is causing it. That’s not always the case, and it’s not too uncommon for a database change to either create a new table that grows quick intentionally or accidentally cause an old table to stop cleaning itself up. No matter what the reason is, if you want to see how a database is growing, look at your table stats documenting the growth. Developers always listen to you better when you tell them this is what happened and here’s how I know while the words “I think” are viewed as instant discredidation.

How am I supposed to do this stuff?

You’ll need somewhere to store all the info. I like to store two weeks of info on the local server and pull everything to a central server nightly where data is held for a full 13 months. However, to keep me under control, lets just focus on getting everything to the local server in the following tables.

CREATE TABLE [Drives](
    [DateAdded] [smalldatetime] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
    [CapacityMB] [int] NULL,
    [FreeSpaceMB] [int] NULL,
 CONSTRAINT [DBInven_Drives_DriveLetter_MountPoint_DateAdded] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DriveLetter] ASC,
    --[MountPoint] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Files](
    [DateAdded] [smalldatetime] NOT NULL,
    [DatabaseName] [sysname] NOT NULL,
    [FileName] [nvarchar](128) NOT NULL,
    [TypeDesc] [nvarchar](60) NULL,
    [FileGroup] [sysname] NULL,
    [SizeMB] [int] NULL,
    [UsedMB] [int] NULL,
    [FreeMB] [int] NULL,
    [MaxSizeMB] [int] NULL,
    [Growth] [int] NULL,
    [IsPercentGrowth] [bit] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
 CONSTRAINT [DBInven_Files_DatabaseID_FileID_DateAdded_U_C_IX] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [FileName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Tables](
    [DateAdded] [datetime] NOT NULL,
    [DatabaseName] [nvarchar](128) NOT NULL,
    [SchemaName] [nvarchar](128) NOT NULL,
    [TableName] [nvarchar](128) NOT NULL,
    [RowCounts] [bigint] NULL,
    [AllocatedMB] [int] NULL,
    [DataSizeMB] [int] NULL,
    [IndexSizeMB] [int] NULL,
    [PercentOfDB] [decimal](5, 2) NULL,
 CONSTRAINT [PK_Tables] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [SchemaName] ASC,
    [TableName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

If you use mountpoints, and if you call them “mountpoint” followed by an incrementing digit, and you never have more than can fit into a tinyint, uncommenting those pieces of the tables will do wonders for you. In case you didn’t guess, they’re commented out because the people who do use them probably don’t use them that way.

DECLARE @RunTime SmallDateTime
DECLARE @SQL NVarChar(max)

SET @RunTime = (SELECT Max(DateAdded) FROM Perf.DBInven.Drives)

SET @SQL = N'use [?]; 
    INSERT INTO Perf..Files 
    select ''' + cast(@RunTime as nvarchar(50)) + ''' 
        , DatabaseName = db_name()
        , FileName = f.Name
        , TypeDesc = f.Type_Desc
        , FileGroup = fg.Name
        , SizeMB = ((f.size*8/1024)) 
        , UsedMB = (FileProperty(f.name, ''SpaceUsed'')*8/1024) 
        , FreeMB = ((f.size-FileProperty(f.name, ''SpaceUsed''))*8/1024) 
        , MaxSizeMB = Case f.Max_Size
            WHEN -1 THEN -1
            ELSE f.Max_Size / 128
            END
        , f.Growth
        , f.Is_Percent_Growth
        , DriveLetter = left(f.physical_name,1) 
        /*, MountPoint = CASE substring(f.physical_name,4,10)
            WHEN ''MountPoint'' THEN 
                Case isnumeric(substring(f.physical_name,14,4)) --MountPoint >= 1000
                WHEN 1 THEN substring(f.physical_name,14,4)
                ELSE Case isnumeric(substring(f.physical_name,14,3)) --MountPoint >= 100
                    WHEN 1 THEN substring(f.physical_name,14,3)
                    ELSE Case isnumeric(substring(f.physical_name,14,2)) --MountPoint >= 10
                        WHEN 1 THEN substring(f.physical_name,14,2)
                        ELSE substring(f.physical_name,14,1) -- MountPoint single digit
                        END
                    END
                END
            ELSE Cast(0 as varchar(10)) --Root drive is MountPoint 0 
            END*/
    from sys.database_files f
        LEFT JOIN sys.filegroups fg ON f.data_space_id = fg.data_space_id
    '

exec sp_MSforeachdb @SQL

And now to grab the tables. Keep in mind that I specifically only want to grab stuff over 0.5% of the database size AND over 100 MB. You’ll see that towards the end of the code.

DECLARE @SQL NVarChar(max)
DECLARE @DateAdded SmallDateTime

SELECT @DateAdded = MAX(DateAdded) FROM Perf.DBInven.Drives 

SET @SQL = ' Use [?];
    DECLARE @DatabaseSize Float

    IF OBJECT_ID(''tempdb..#TableSizes'') IS NOT NULL BEGIN
          DROP TABLE #TableSizes
    END

    SELECT DatabaseName = DB_NAME()
    , TableName = a2.name
    , SchemaName = a3.name
    , RowCounts = a1.rows
    , TableSizeMB = (a1.reserved + ISNULL(a4.reserved,0)) / 128
    , DataSizeMB = a1.data / 128
    , IndexSizeMB = (CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data 
                        THEN (a1.used + ISNULL(a4.used,0)) - a1.data 
                        ELSE 0 
                    END) /128
INTO #TableSizes
FROM (SELECT ps.object_id
            , [rows] = SUM(CASE
                                WHEN (ps.index_id < 2) THEN row_count
                                ELSE 0
                            END)
            , reserved = SUM(ps.reserved_page_count)
            , data = SUM(CASE
                            WHEN (ps.index_id < 2) 
                                THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count)
                            ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count)
                        END)
            , used = SUM (ps.used_page_count) 
        FROM sys.dm_db_partition_stats ps
        GROUP BY ps.object_id) AS a1
    INNER JOIN sys.all_objects a2  ON a1.object_id = a2.object_id
    INNER JOIN sys.schemas a3 ON a2.schema_id = a3.schema_id
    LEFT JOIN (SELECT it.parent_id
            , reserved = SUM(ps.reserved_page_count)
            , used = SUM(ps.used_page_count)
        FROM sys.dm_db_partition_stats ps
            INNER JOIN sys.internal_tables it ON it.object_id = ps.object_id
        WHERE it.internal_type IN (202,204)
        GROUP BY it.parent_id) AS a4 ON a4.parent_id = a1.object_id
WHERE a2.type <> ''S'' and a2.type <> ''IT''

    SELECT @DatabaseSize = Sum(TableSizeMB) FROM #TableSizes

    INSERT INTO Perf..[Tables]
    SELECT ''' + Cast(@DateAdded as VarChar(50)) + '''
        , DatabaseName
        , SchemaName
        , TableName
        , RowCounts
        , TableSizeMB
        , DataSizeMB
        , IndexSizeMB
        , (TableSizeMB/@DatabaseSize) * 100 
    FROM #TableSizes
    WHERE TableSizeMB > .005 * @DatabaseSize 
        AND TableSizeMB > 100

    DROP TABLE #TableSizes'

exec sp_MSforeachdb @SQL

So this leaves us with the size of the drives. This isn’t an easy subject. You can get everything you need through SQL Server if you don’t use mountpoints, but you’re left with either Powershell or VBScript to capture it with mountpoints so you can query WMI. Another post, another day. I’d actually prefer to rewrite my VBScript as PowerShell to both learn PowerShell and put a better solution out here for you.

This was my last post I needed to prep for my Monitoring and Baselines Presentation for SQL Saturday where I talk about what you want to watch and why. Since the presentation didn’t get into how to watch everything, I made sure that everything was on my blog ahead of time.

Why Blog?


The main reason I started blogging was a community thing. I leeched from the community to get me to where I am in my career, and it was to the point that the community was owed something in return. Everyone can give back in their own way, some answer forum questions, some blog, and some people find their own special way. For me, trolling for forum questions that I could answer that weren’t already answered adequately wasn’t for me; nothing against those folks, it’s just my preference. So I decided that I was going to put things in my own words.

My words were on the tip of my tongue, but there was a road block. Please, don’t laugh at me too much, but I didn’t know how to get my words out there. All these people had cool blogs where they programmed ways for people to leave comments, they programmed this odd little word cloud thing for links to keywords, WOW…they programmed ways to have keywords. I’m out of my league, I’m a SQL nerd, I don’t do this stuff, I can’t do this stuff. When I’m not too busy being bitter, it’s actually quite humorous how I took the long way around to discover WordPress. Those “geniuses” out there knew something I didn’t, and the thing they knew that I didn’t wasn’t how to be a master programmer of cool toys. Man, I feel like a fool!

I’m not selling WordPress here, it’s what I use. Other platforms are comparable; I know that now. The important thing to know is that if you can write an email, you can write a blog. The only trick is copy/pasting code from your editor to WordPress, which I found a cool tool to help you do this. There, my secrets are out, I just told you how to bypass all the big bumps in the road that looked like brick walls to me.

WAIT!!! It’s not that simple. I started and it wasn’t what I expected; same as everything in life, I guess. Blogging wasn’t a small little thing you do on the side and carry on with your life. It becomes a part of your life. 10 people viewed this stuff, and this is stuff that would have revolutionized my career 10 years ago if I would have read it back then. How can I get more viewers? I’m obsessed with view counts, I want this message to go EVERYWHERE!!! How can I write it better? I want the message to mean something and be something people want to read. How can I get myself to stop writing in spurts with 4 posts this week, no posts that month?

I hunted down my favorite blogger and begged for help even though he didn’t even know my name. He was quick to give advice and search his site doing the following searches on Google “site:brentozar.com blogging”, “site:brentozar.com writing”, “site:brentozar.com presenting”. AMAZING and SCARY all at the same time. You’ll get information ranging from starting blogs to who your target audience really is (and shooting yourself) to the 51 steps to make an awesome presentation. Some people show you the rabbit hole and let you go as deep as you want. I felt like I hit the bottom of the bottomless rabbit hole with a solid thud. Brent, I’m not sure if you’re reading this or not, but you’re the only person out there without a college degree who deserves a doctorate in communications on top of your Microsoft MVP and MCM titles you have earned. This is a cross between a “Thank You” and a “Why Me???”. More of a thank you, though. You taught me a lot.

Twitter_BrentO_BlogAdvice

Don’t go thinking that you’re going to hear Brent’s point of view restated here though. I enjoy what he has out there, but my point of view is my own. For instance, I just received an automated email from his group saying “But if you want to be a professional DBA, you should leave the developing to the developers.” in reference to creating tools to monitor the servers, leaving me almost with the impression of not reinventing the wheel, while my blog is largely comprised of a blueprint for a wheel. He continues with “I’m not saying you shouldn’t learn the dynamic management views (DMVs), how to dig through them to find the evidence you need, and how to turn that data into actions. Hopefully, I’m going to teach you how to do a lot of that over the course of the next six months.”, so maybe I’m mistaken. You know what, I’m off subject now and you really just need to got to his free email-based accidental DBA class and make up your own mind.

Back on topic now… The blog is now syndicated, so it’s showing up on SQLServerCentral.com and ToadWorld.com. More over, about 1/4 of my posts are highlighted by SQLServerCentral.com’s emails that go out to thousands of people and make my views go through the roof. At this point, on September 3rd, I have already had more views on my blog for September than I did for the entire month of July! I’m also writing ahead of time, scheduling posts to go out at 4:00 AM my time (ET) on the second business day of each work week (Wednesday for this post), giving me time to review my work while also providing some consistency and a scheduled commitment for me to keep myself inline. So many things have changed, all for the better, and I’ve only been at this for about 4 months by the time this is scheduled to go public.

It’s still not that simple, it’s not what I expected on so many other levels as well. I was just trying to give back to a community that gave to me. That’s part of it still, look at how that small part just blends into the full picture:

  • Community
    • I took a lot, giving a little is the least I can do
  • Expertise
    • I’m learning the little details to be able to explain it to others without misleading them.
    • If you can’t explain it simply, you don’t know it well enough. – Albert Einstein
      • This is my blog’s namesake, in case you were wondering.
  • Confidence
    • What I’m writing is being read by and used by the community. People look up to me, even if that’s only in my own mind.
    • Learning it well enough to teach makes you really learn it and be more confident.
  • Credibility
    • It is on my resume and employers LOVE it.
      • I’m excited about the technology I use at work enough to spend my free time on it.
      • There are examples of my work beyond the one-liner on my resume saying “I know how to watch for performance issues.”
      • You can feel confident giving me any employee and know that I can and will teach them.
      • Careful here, this means you have to write even more betterer to not make fool you out of. :-)
  • Warm Fuzzies
    • I know if I had this early in my career, I would have been down the rabbit hole of system tables and DMVs years ago. It would have changed things for me, and I can only imagine that I’ve had that effect on one or two people.
  • Vengeance
    • Once you know about the DMVs, there is no going back. You took the wrong pill and you’ll never wake up knowing that the world was what you once thought it was. Welcome to the Matrix!
  • Obsession
    • Ok, it couldn’t all be good. I want more read counts, I want the message out there more, I keep looking for read counts, I keep looking to better myself, I keep looking to better my posts, aaaahhhhhh, I’m trapped, help me!!!!

Also, without this, I wouldn’t have the confidence and background to say that I’m ready to be a presenter. However, here I am putting the last details into my part of SQL Saturday #250! See an outline of my presentation right here on my blog.

SQL Server Central highlighted my posts on weeks 20, 21, 25, 26, and 34.  Steady growth otherwise.

SQL Server Central highlighted my posts on weeks 20, 21, 25, 26, and 34. Steady growth otherwise.

So, here I am, 4 months and 10,000 reads (WordPress + SQLServerCentral Syndication) later, and I’ve learned a thing or two.  By no means am I perfect; this is a continual learning experience.  For all those who have something to say out there, stop saying you have something to say and start saying it.  Syndicate it if you want others to read it.  The readers will come with time, just keep writing and strive for something worth a syndicator highlighting in their newsletters.  Who knows, maybe I’ll even make it to the SQL Rockstar Rankings someday. Read about writing to improve yourself.  Then, finally, be ready for a better you.

If you do blog, what has it meant to you? If you don’t, why not? I’d love to hear more people’s take on this as everyone has a unique experience and thought process.

Query Stats


DBAs rarely use the full potential of sys.dm_exec_query_stats.  It’s common to see the queries for looking at the most expensive queries according to any of the stats contained within the current cache, which is great to see.  However, if you grab snapshots of this information the proper way then you can get more detail that will really make a difference in your job.  Capturing this every 4 hours is probably enough, and here’s what it’s going to do for you.

First, using it the way most DBAs do and look at the most expensive queries according to a specific metric.  It works the same for them all, so we’ll just say we’re interested in physical reads in this case because our PLE isn’t what we’re hoping it could be.  You’ve probably seen the queries that directly query sys.dm_exec_query_stats to give you this similar to the one on MSDN for this DMV.  That will give you everything that’s in the cache right now, which could contain queries that recompile every hour in combination with other queries that have remained in the cache since the server was rebooted two weeks ago.  This means that when you’re querying to see what you can do to improve your PLE during business hours you’re getting anything that happens to be there now, not what ran in the time window you’re looking to improve.

Now look at it from the point of view that you capture this information every 4 hours.  You can get the total number of physical reads grouped by sql_handle where the interval_end time is in the last month, on weekdays, between 9:00 AM and 5:00 PM.  This is both a more complete and more accurate picture of what you care about, and you’re more likely to be focusing on the queries you should be tuning from the start.

Once you do start tuning, you want to know what effect you really had, and not just rely on the couple sets of test parameters you used in your NonProd environments.  A couple days after your changes made it to production, it’s a good idea to make sure you had the desired effect.  Query the data you have saved off for the last three days and for the same days last week.

What I went over so far is just realizing how this takes the traditional use of this to a new level, but there’s more functionality here now that it’s being saved off.  Now this is extremely useful in incident response.  Before you could see if anything was hogging resources as seen in my Recently Recompiled Resource Hogs post, but now you take it a step further by being able to verify if what’s running long now has always run that way or if something is different.  Also, with having the plan_handles saved off you can even see if it’s getting a different execution plan, which will help you determine if this was a change in data or if it’s a bad execution plan.

Not to shoot myself in the foot, but this isn’t perfect.  If you capture this data every 4 hours starting at 12:00 and it recompiles at 3:59 then you’ll only capture one minute for that interval.  Also, if you’re dealing with a query that can’t be cached then it won’t be found here at all; this is why you’re also running traces to find the big queries and watching sys.dm_os_performance_counters to know when there are a lot of compiles and recompiles.

To start capturing this, here are the tables you’ll need to capture everything.

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

/*
IF OBJECT_ID('QueryStats') IS NOT NULL BEGIN
    DROP TABLE QueryStats
END

IF OBJECT_ID('QueryStatsLast') IS NOT NULL BEGIN
    DROP TABLE QueryStatsLast
END

IF OBJECT_ID('QueryText') IS NOT NULL BEGIN
    DROP TABLE QueryText
END
--*/

CREATE TABLE [dbo].[QueryStats](
    [interval_start] [smalldatetime] NOT NULL,
    [interval_end] [smalldatetime] NOT NULL,
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [execution_count] [int] NOT NULL,
    [total_elapsed_time_ms] [int] NOT NULL,
    [min_elapsed_time_ms] [int] NOT NULL,
    [max_elapsed_time_ms] [int] NOT NULL,
    [total_worker_time_ms] [int] NOT NULL,
    [min_worker_time_ms] [int] NOT NULL,
    [max_worker_time_ms] [int] NOT NULL,
    [total_logical_reads] [int] NOT NULL,
    [min_logical_reads] [int] NOT NULL,
    [max_logical_reads] [int] NOT NULL,
    [total_physical_reads] [int] NOT NULL,
    [min_physical_reads] [int] NOT NULL,
    [max_physical_reads] [int] NOT NULL,
    [total_logical_writes] [int] NOT NULL,
    [min_logical_writes] [int] NOT NULL,
    [max_logical_writes] [int] NOT NULL,
    [creation_time] [smalldatetime] NOT NULL,
    [last_execution_time] [smalldatetime] NOT NULL,
) ON [PRIMARY]
GO

--Wide is bad, if you have a lot of nonclustered indexes.  
--This is a high-writes, low-reads table, so a single clustered index is probably all I want.
--With Interval_End being the first key field, they're inserted in order and eliminate mass fragmentation
  -- as well as bad write performance.
CREATE UNIQUE CLUSTERED INDEX IX_QueryStats_intervalend_sqlhandle_statementstartoffset_planhandle_U_C ON QueryStats
(
    interval_end
    , sql_handle
    , statement_start_offset
    , plan_handle
)

CREATE TABLE [dbo].[QueryStatsLast](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryStatsLast_sqlhandle_planhandle_statementstartoffset_U_C] ON [dbo].[QueryStatsLast] 
(
    [sql_handle] ASC,
    [plan_handle] ASC,
    [statement_start_offset] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

CREATE TABLE [dbo].[QueryText](
    [sql_handle] [varbinary](64) NOT NULL,
    [QueryText] [nvarchar](max) NOT NULL,
    [DatabaseName] [nvarchar](128) NULL,
    [objtype] [nvarchar](20) NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryText_sqlhandle_U_C] ON [dbo].[QueryText] 
(
    [sql_handle] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

Now that the tables are created, here’s how I populate them.  Set this up to run once every 4 hours to keep the data small during typical running, and kick it up to once an hour or more if needed to pinpoint an issue if you need it.

DECLARE @interval_start smalldatetime
    , @interval_end smalldatetime

IF OBJECT_ID('tempdb..#QS') IS NOT NULL BEGIN
    DROP TABLE #QS 
END 

CREATE TABLE [dbo].[#QS](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) 
/*
--This sounded like a great idea, but it just slowed it down several seconds.
CREATE UNIQUE CLUSTERED INDEX TempQS_Cluster ON #QS  
(
    sql_handle
    , plan_handle
    , statement_start_offset
)
*/

--The Cached Plans Object Type is in here in case you want to treat ad-hoc or prepared statements differently
INSERT INTO #QS 
SELECT qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , cp.objtype
    , qs.execution_count 
    , total_elapsed_time_ms = qs.total_elapsed_time/1000
    , min_elapsed_time_ms = qs.min_elapsed_time/1000
    , max_elapsed_time_ms = qs.max_elapsed_time/1000 
    , total_worker_time_ms = qs.total_worker_time/1000 
    , min_worker_time_ms = qs.min_worker_time/1000
    , max_worker_time_ms = qs.max_worker_time/1000
    , qs.total_logical_reads
    , qs.min_logical_reads
    , qs.max_logical_reads
    , qs.total_physical_reads
    , qs.min_physical_reads 
    , qs.max_physical_reads
    , qs.total_logical_writes 
    , qs.min_logical_writes 
    , qs.max_logical_writes
    , qs.creation_time
    , qs.last_execution_time
    , DateAdded = getDate()
FROM sys.dm_exec_query_stats AS qs
    INNER JOIN sys.dm_exec_cached_plans cp on qs.plan_handle = cp.plan_handle
--WHERE cp.objtype NOT IN ('Adhoc')

INSERT INTO QueryText (sql_handle, QueryText, DatabaseName, objtype)
SELECT QS.sql_handle
    , QueryText = qt.text
    , DatabaseName = DB_NAME(max(qt.dbid))
    , max(QS.objtype)
FROM (SELECT #QS.sql_handle
        , #QS.objtype  
        FROM #QS
            LEFT JOIN QueryText QST ON #QS.sql_handle = QST.sql_handle
        WHERE QST.sql_handle IS NULL) QS
    CROSS APPLY sys.dm_exec_sql_text(QS.sql_handle) qt
GROUP BY QS.sql_handle
    , qt.text

--All the values are the same in each table
SELECT TOP 1 @interval_start = dateadded FROM QueryStatsLast
SELECT TOP 1 @interval_end = dateadded FROM #QS

IF @interval_start IS NULL BEGIN
    SELECT @interval_start = create_date
    FROM sys.databases
    WHERE name = 'tempdb'
END

INSERT INTO QueryStats (interval_start
    , interval_end
    , sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time)
SELECT @interval_start
    , @interval_end
    , qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , qs.execution_count - ISNULL(qsl.execution_count, 0)
    , qs.total_elapsed_time_ms - ISNULL(qsl.total_elapsed_time_ms, 0)
    , qs.min_elapsed_time_ms 
    , qs.max_elapsed_time_ms 
    , qs.total_worker_time_ms - ISNULL(qsl.total_worker_time_ms, 0)
    , qs.min_worker_time_ms
    , qs.max_worker_time_ms 
    , qs.total_logical_reads - ISNULL(qsl.total_logical_reads, 0)
    , qs.min_logical_reads 
    , qs.max_logical_reads 
    , qs.total_physical_reads - ISNULL(qsl.total_physical_reads, 0)
    , qs.min_physical_reads 
    , qs.max_physical_reads 
    , qs.total_logical_writes - ISNULL(qsl.total_logical_writes, 0)
    , qs.min_logical_writes 
    , qs.max_logical_writes 
    , qs.creation_time
    , qs.last_execution_time
FROM #QS qs
    LEFT OUTER JOIN QueryStatsLast qsl ON qs.sql_handle = qsl.sql_handle
        AND qs.plan_handle = qsl.plan_handle
        AND qs.statement_start_offset = qsl.statement_start_offset
        AND qs.creation_time = qsl.creation_time 
WHERE qs.execution_count - ISNULL(qsl.execution_count, 0) > 0 --Only keep records for when it was executed in that interval

TRUNCATE TABLE QueryStatsLast

INSERT INTO QueryStatsLast
SELECT sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , objtype
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time
    , DateAdded
FROM #QS

SQL Saturday 250 – Pittsburgh – Advice for My Presentation on Monitoring and Baselines


I’ll be presenting at SQL Saturday 250 in Pittsburgh on Baselining and Monitoring. This is my first public presentation, and I could use some advice to make sure my presentation is the best it can possibly be for both myself and the audience.

First, my presentation. Anyone who’s been following my blog won’t be surprised by anything they see. I’m going to hit all of the the statistics I watch and why I watch them. The whole thing is a pep rally for what, why, and how it will change your job. How to do it will be completely skipped for the reasons below.

  • Brent Ozar told me to in point 13.
  • Trying to take notes on the how would be too intense for any audience member.
  • What and why are more exicting, life’s too short for boring!!!
  • I already did the how, it’s on my blog, and I’ll tell the audience all of the hows are already listed here in both my opening and closing statements.

The points I’m hitting are:

  • Traces (with mention of Extended Events)
  • Wait Stats (sys.dm_os_wait_stats)
  • Blocking (sys.dm_os_waiting_tasks with LCK% type)
  • Query Stats (sys.dm_exec_query_stats)
  • Index Stats (sys.dm_db_index_usage_stats)
  • OS Perf Counters (sys.dm_os_performance_counters)

In all of these areas I’ll go through why I watch these and how monitoring them over looking at the values you can get at any point-in-time has changed my life as a DBA. After going through each section I would like to ask the audience for questions in the middle of the presentation so the topic of the questions is fresh in everyone’s minds and more likely to be taken to heart. How you capture these will be completely documented on my blog to the point that you can practically copy/paste the code to develop a customized monitoring setup on your servers from one source.

Baselining is built directly into my monitoring and how I use my results. Instead of having a static baseline that I took on a certain day, I’m a huge fan of comparing today’s incidents to yesterday and last week with a rolling baseline. I find it to be an impressive way to handle things.

The slide deck still needs a bit more work, but it’s literally just going to be a couple words (Steve Jobs style!!!) with the information coming from what I say instead of the audience’s attention being divided between me and the screen behind me. They’ll see the topic on the screen with a picture of what it means to me while hearing my views.

This presentation is almost complete on my end, but I’m flexible enough to say I’m wrong on anything and make changes. Some questions I have for your are:

  • Is this a presentation you would enjoy seeing? Why or why not?
  • Do you feel this is a complete list of items to monitor? What would you change?
  • Are questions in the middle of a presentation a good idea?
  • How do I make this public, both assisting the audience as much as possible while keeping in mind my rights and possible future use of this presentation? I would consider posting the full thing on the internet, even a recording of me doing it, but I don’t want to think about it a year from now and think that I just shot myself in the foot doing it. I don’t know what to expect, thus the call for help.
  • Anything else you have to add will be greatly appreciated. Adding that information before September 14th would raise the level of my appreciation! :-)

I’m extremely motivated to do this, and I want to absolutely kill this presentation! The only way to hit my expectations is to prep for it, and the best way to prep for it is to take in as much positive and negative criticism before it goes live as I possibly can.

Thank you for taking the time to read this!
Steve

Fixing Page Life Expectancy (PLE)


There are several reasons Page Life Expectancy (PLE) can drop, and understanding those root causes will help you improve SQL Server performance. This is a very long post to go with a very important counter showing how to make a real impact on your server’s performance. If you stick with me through this, you’ll be looking at your servers from new angles and that effort will be noticed by the Sys Admins, SAN Admin, the users, and your boss.

Before we being, there are some ground rules we need to get out of the way defining PLE and understanding there are external memory factors. There’s no shame in skipping that and going straight to the focus of this post. Personally, I find the next two sections a little dry, but I’m also my harshest critic!

Define PLE

Before we get too deep into it, lets make sure we’re on the same page on a couple things.

Page Life Expectancy is the number of seconds the average page of data has been the data cache (aka buffer cache). The data cache is part of the buffer pool (the similar name is why I use data cache instead of buffer cache, although they’re the same thing), which is the memory you allocated to SQL Server. SQL Server uses this area to give quicker access to data instead of taking the long, slow trip to persisted storage (disk).

Keep in mind that SQL reads the data pages from the cache, always from cache. If the data you need isn’t there then SQL Server does a physical read to put it there. After that’s done it will do a logical read to use the page that’s now in memory. If you want to dive into the detail you can do it here, specifically with the reading pages and writing pages links on that page.

That physical read is going to the slowest part of your server (even expensive flash is the slowest part, it’s just not as slow) to read the page(s) from persisted media, and it also uses the CPU in the process. This makes PLE critical because, even if you ignore the extra load you just placed on the I/O system and CPU, you’re waiting for an additional, slower action to take place.

You can find your page life expectancy in sys.dm_os_performance_counters using my post on OS Perf Counters. That post will also help you realize how much load you take off of your disks by raising PLE, then you’re really going to start to understand the burning drive picture you find there.

The problem is that many people see the 300 value for Page Life Expectancy you can still find documented is wrong, very wrong. If this is simplified to be how long the average data will stay in memory and you have a server with 64 GB of memory with 56 GB allocated to SQL Server, that means you’re reading about 56 GB of data from disk every 300 seconds. If you look at Page 36 of Troubleshooting SQL Server – A Guide for the Accidental DBA by Jonathan Kehayias and Ted Krueger you’ll see an actual scalable version of this; PLE should be 300 for every 4 GB of RAM on your server. That means for 64 GB of memory you should be looking at closer to 4,800 as what you should view as a critical point.

I’ll go so far as to say that you should be cautious of anything that doesn’t have a formula to it because our servers keep getting faster, our databases keep getting bigger, and static values…don’t. Jonathan Kehayias and Ted Kruger giving a formula is great, because the old value became outdated the day SQL Server went 64-bit, or even before that with the AWE setting.

Jonathan Kehayias also put this in his blog post Finding What Queries in the Plan Cache Use a Specific Index, and that had a great series of comments going back and forth between him and Brent Ozar. Brent is discussing not putting so much faith in PLE and watching wait stats instead as different speed I/O systems can greatly sway the impact of low PLE or even make it so a high PLE just isn’t high enough. They’re both right, and if you can understand the intent of each one then you’ll be in a much more comfortable place when tuning your servers.

If you force me to take a side on that post, Jonathan wins out over Brent here. This is because Jonathan is talking about having a higher PLE, which in turn reduces load on the disks and typically reduces the amount of work that needs to be done to execute a query. Brent is basically implying that money may have already be thrown at the issue, which negates the current symptoms on what could be a less scalable solution. However, you can’t discount Brent’s logic here as the things that are limiting your server’s performance at this time lie in the Wait Stats, not this counter.

Quick Rant: We’re Not Alone

So, you want higher PLE to show that you’re using your resources better, and the only way PLE goes up is by leaving data in the data cache once it’s there. The problem is, there’s a lot going on that wants to work the other way. Even if you have 64 GB of memory dedicated to SQL Server to host a single 64 GB database it won’t all fit in memory, that’s because other things want to play as well.

First, your memory isn’t completely dedicated to the buffer pool. Memory pressure from the OS or VMs changing size can drop the size of the buffer pool has unless you’re using “lock pages in memory”, which is not recommended in most situations especially on VMs. If you’re not on a completely dedicated box, which means a physical box with no apps, no GUI, no RDP sessions, no SQL Server, no network card driver, etc., then this can happen. Wait a minute…you can’t uninstall SQL Server to make sure SQL Server’s buffer pool remains untouched, yet SQL Server uses memory outside of this area for any purpose that isn’t specifically assigned to the buffer pool and that memory usage can get out of control. If there’s memory pressure then some pages are forced out of data cache to drop the size of the buffer pool, leaving you open to more contention and a lower PLE.

Second, your buffer pool isn’t dedicated to your data cache as it also houses your proc cache, which are execution plans stored to be run again. These plans take a lot of CPU to create and reusing them saves you a lot of resources, so you can’t complain that you’re sharing space here. SQL Server determines the amount of the buffer pool to assign to each, and it’s possible for either side to have unnecessary data in it. I’m going to focus on the data cache here, so I’ll defer the abuse of proc cache conversation to Kimberly Tripp. There’s more to it than just her post, but this is a great place to start understanding what’s there and not in use.

Focus, Focus, Focus

Ok, so we know what PLE is and that external factors can affect it, but that’s not really why you’re here, is it? Are you here because you have your data cache and you want to manage it to keep your PLE going up…maybe a little. My guess is that you’re here because you heard that PLE is critical to your server’s performance and you want to increase performance by raising this counter the right way. You do that in two ways.

First, make sure you have as much room as possible. If your server can handle more memory it’s often cheaper to buy that much memory than it is to intervene on the database side. Tuning and purging scales well, will help a lot more than just memory management, and is the best thing you can do given the time budgeted to it, but it just isn’t as quick and easy as a memory upgrade in many cases. Even if you buy memory now and tune later, it will continue to help. 64 GB of memory, as Brent Ozar pointed out here, is only a couple hundred dollars here. FYI, I absolutely hate throwing money at issues, and I still listed this first.

Second, keep as little in memory as possible to make efficient use of the space you have. Yeah, it’s that easy. Well, at least it’s that easy to say, but doing it takes a little more work. There are a list of ways you can trim down on the space you need.

  • Drop Unused Indexes
  • Merge Duplicate Indexes
  • Use Your Indexes – SARGability
  • Watch for Big Queries
  • Look in Your Proc Cache for Opportunities
  • Know What’s in Your Buffer Cache
  • Index Maintenance – Defrag
  • Index Maintenance – Statistics
  • Purge Your Data
  • Other

Through all of this, please remember that your initial thought may be to raise a counter, but is that your real goal? I’m betting you want your server to run faster, and PLE is merely one counter that helps put a value on performance. Servers with high PLE can run horrible, they just don’t need to keep running back to disk. Servers with low PLE can run great, they just need to go back to disk to get what they need.

A little bit ago I mentioned buying more memory, this will raise your PLE without making any queries really run more efficiently. Sure, they’re more likely to have their data in cache which is great, but they’re doing just as many logical reads and using just as much CPU to do their joins. The rest of this post isn’t like that, the rest is making queries more efficient in a way that just happens to make your PLE go up.

Drop Unused Indexes

This part’s tricky. SQL 2005-2008 R2 told you how much indexes were used since the last time SQL Server was started in the DMV dm_db_index_usage_stats. It’s still there in SQL 2012 and beyond, but these statistics get reset when you rebuild an index now. That means that best case scenario on a server that’s patched monthly, you have a month’s worth of data to go off of and you can’t even rely on that being the whole picture anymore. I get into more about how to keep track of this over time in my post Indexes – Unused and Duplicates.

This has a minor implication on PLE because as the pages of your unused indexes are updated SQL Server has to read part of the B+tree into memory to find the page, then it has to read the page to be updated into memory, it updates it in memory and marks it as a dirty page, then eventually writes it back to disk. The key here being that pieces are read into memory, AKA the buffer pool, the part of SQL Server you’re trying to clean up.

Merge Duplicated Indexes

This one’s not as tricky, but the performance issues are much worse. The problem is that you have two indexes, we’ll call them ix_A and ix_B, that are very similar. They have the same first two key fields, and stray a little after that. The differences mean that ix_A will be better for one query and ix_B will be better for another, and that means SQL Server will be dragging them both into cache. Sure, if you take the columns that are in ix_B that aren’t in ix_A and include them in ix_A then drop ix_B then the queries that used to use ix_B will probably need to do a little more work, but you’ll end up with a more efficient use of the cache. Your typical testing will show you the worst case scenario here, because you’ll run it each way multiple times and see what it’s like when the indexes are completely in cache. However, the real world is more friendly than that (that’s a first), and the query that used to use ix_B may actually run faster because now it’s using an index that’s more likely to be in cache, cutting down on physical reads. Going to the same blog post, Indexes – Unused and Duplicates, you can see details on how to help relieve this problem.

There’s a bit of an art to this at first, but you’ll get it down to a science with practice.

Use Your Indexes – SARGability

SARGability (Search ARGument capable) is it’s own topic, and it deserves at least one post of its own. You need to understand how indexes work, as stated in my post Indexing Fundamentals, and make sure that your queries can take advantage of indexes. If you look in a phone book for people with the last name like ‘%ood’ it’s no good because you need the first letter of the last name to take advantage of that index. If you look up people with the first name ‘Steve’ then that’s useless because the first key field was last name. Doing functions, implicit conversions, and many other techniques can cause you to need to scan an entire index.

In terms of PLE, that means you read the entire index into memory instead of just the pieces you need. That can add up to a lot of data being kicked out of cache to make room for a lot of physical reads from disk.

This isn’t an easy topic, and I’m not sure of any resources that say everything that needs to be said on the topic. While I have plans to write a post on it and will update this one when I do, it’s still an unwritten post. Until that time comes, Google “SQL Server SARGable” and see what pops up.

If you know how to read execution plans, there are two sections on a seek, scan, or lookup that you need to know here. Predicate and Seek Predicate. Predicate is what it had to scan for, while a seek predicate is what it could find efficiently using the fact that the key fields are sorted.

Watch for Big Queries

It doesn’t matter if you use Extended Events or Profiler Traces, you need to know when large queries are being run on your servers. If you want to watch this at a statement level then you’ll want to make the jump to Extended Events, but I’ve never seen it hurt to watch SQL Batch Completed and RPC Completed filtered only by duration over 10 seconds.

The point of this when it comes to PLE is that the queries that aren’t SARGable or don’t have an index to take advantage of will be doing a lot of scans. The bigger the scan the longer the query will take to run, and if you care about PLE then you’ll know what queries those are. Sure, some are legitimately doing enough work where they’ll always take that long to run, but most of the queries that take a while can be tuned. Be it changes to the query itself (should be your first instinct) or indexing changes, there’s often a way to have a query require less data in cache.

This is an extremely useful technique to use in correlation with watching for drops in PLE. If it drops then check to see what was running at the time. This shows you what queries were running, which would be most of your issues, but it could be outside of the query itself and actually be something like a large auto update statistics task being kicked off that you won’t see here.

Know What’s in Your Buffer Cache

Very few people do this, yet it’s so obvious once you start. You want to raise PLE, right? PLE is a measurement of how long stuff stays in your buffer cache, and drops because something else needed to be put in there. So, what’s in there that’s taking up all your space and how’d it get there? It took me years to come up with that question, and I feel like a bit of a fool for that.

It was actually one of my biggest tuning revelations since I read Grant Fritchey’s Execution Plans book. And you always know it’s a great revelation when you find yourself yelling at yourself for not realizing it earlier…years earlier.

So, how do you do this? Query your cache to find the indexes that take up the most space there, pick one that is an excessive amount of space or doesn’t look like it belongs there, query the proc cache to find out where that index is used, and tune that query either through code changes or indexing changes. The scripts to do this and a more detailed description of the process can be found in my post Cleaning up the Buffer Pool to Increase PLE

Look In Your Proc Cache for Opportunities

Lets start by saying this isn’t perfect. These numbers get reset throughout the day and some queries never make it in here at all. That’s saying this isn’t an all-inclusive, one-stop shop. It is in no way saying that you can’t make amazing improvements on the queries you find here.

There are two way you can use the cache. Most people know that you can get your most expensive queries such as on the MSDN sys.dm_exec_query_stats page, and you can use your imagination to sort this by any of the counters available here. That will give you an idea of what needs tuning, if you needed somewhere to start. I love taking this a step further and monitoring that exact information so I have a historical view instead of just what happens to be in cache now. You can read all about that in my post Query Stats.

The other way is parsing through the XML in the plans to find specific items. I heard that grunt when I mentioned XML, and you’re not alone. Not many of us are good at XML; Jonathan Kehayias is an extreme exception here. He gives you his queries so you can run with it, do what he does, and alter it to go even further. This is exactly what I did in my post Cleaning up the Buffer Pool to Increase PLE that I just mentioned in the last section.

Index Maintenance – Defrag

Many people only thing about defragmenting their indexes to help get contiguous reads on their disk, but that’s only half the story. In fact, the better you do everything in this post the less you’ll have to worry about your disks.

However, you also have to keep in mind how full each page of your indexes is. If you have an index page that’s 100% full then you have a full 8kb of data there. Add another row and you now have 8.1kb of data that’s split between two 8kb pages, so you’re wasting almost 50%. 50% isn’t even your worst case scenario because pages are never automatically removed or merged if records are deleted unless you delete every row stored in that page. Paul Randal’s post Performance Issues From Wasted Buffer Pool Memory takes a deeper look at this problem, and Ola Hallengren’s scripts can help you with a solution.

Index Maintenance – Statistics

It may seem obvious to do your index defrag job off hours. While I’ve seen servers without the job, I’ve never seen one scheduled for mid-day. That’s a good thing, because a lot of data needs to be pulled into cache to play around with indexes like that.

However, do you have auto update statistics turned on? Do NOT turn it off because of this, but understand that it updates the statistics by reading about 1% of the index into cache. If that’s a 100 GB index then you’re reading 1 GB of data which is a descent portion of your cache. To help avoid this, you should be updating your stats off hours, and Ola Hallengren’s scripts can help with this, too.

You need to update your stats because as more and more updates are made to a table the less accurate your stats are, making your execution plans less accurate. Once you update about 20% of the rows of a table since the last time your statistics were updated then they’ll automatically be updated again if you leave the default option turned on. Again, don’t go turning it off because of this post alone. Instead, schedule your stats to be updated off-hours and only the tables that have 20% of their rows updated throughout the day will get their stats updated automatically, and that will stop most auto-updates on the larger tables that would cause issues.

Purge Your Data

We talked a lot about what to do with the data you have, but do you need all of your data? Step away from SQL Server once and go talk to an accountant. Ask them where they have paper work from last month and they’ll point to a filing cabinet. Now, knowing they’re legally obligated to keep a lot of stuff for 7 years, ask them where a 5-year-old document is. It’s not right in front of them, but they know where to look for it. It’s not taking up valuable space in their office (read: not on the main production server), often not in the same building they’re in (read: not online), and it will take a bit to get to it. They no longer have an active business use for it, and they can get to it in the rare case that it’s needed.

Ask them where something is that’s 8 years old, and they’ll find a friendly way to tell you that they bought an OCD pyromaniac a pair of scissors and a book of matches. We’re more eloquent with deleting things in batches, but their way sounds like more fun.

If they kept everything in one room would you call that room a cluttered, unmanageable mess? If they kept historical records forever, would you think they were being obsessive and wasteful with purchasing storage?

So, can you pull up a detailed sales report for February 29th, 2000 from Prod for me?

Other…Am I Giving Up?

There are too many things to list in one blog post. You could write a book on this subject. I listed what I feel will help you the most, but I also wanted to take time to let you know that my list isn’t some magical, all-inclusive, everything you can do to make your servers run better. It’s a start. Hopefully a good one you found to be productive, but a mere start no matter how you look at it.

Speaking of “you could write a book on the subject”, well, they did. A lot of them did. Some even made them free PDFs. Not bootleg bit-torrent copies, but actual real, legal, free PDFs made possible by Red Gate in their book selection on SQL Server Central

Some books you have to buy, if that’s your thing:

Also, I’m not the only person blogging about this topic:

Results

If you did all of this and there’s something running on your servers to trend performance, you’ll notice:

  • Page Life Expectancy: Raised significantly, but you saw that coming
  • Page Reads/sec: (Physical Reads) Dropped because we’re not cycling data in and out of cache as fast
  • Page Lookups/sec: (Logical Reads) Will drop with some of these
  • % Processor Time: Dropped due to lower I/O, more efficient queries, less pressure on procedure cache, etc.
  • Critical query execution time: Typically less
  • DBA Pay Rate: If this isn’t on the list, try using a chart of the above counters

Rewritten Post

Some readers may have realized that I rewrote this post, and you can find the original version here. I did this because I didn’t feel the old post lived up to my expectations, and I’d love to hear feedback on what you think of the differences.

Sys.dm_os_performance_counters Demystified


The dmv sys.dm_os_performance_counters is awesome, if you can understand it. This is how I make it easy to read and use. Here are the values I watch and why I watch them. My list isn’t going to be perfect and you’re not going to agree with it 100%, and I’m ok with that. First, there is no perfect. Second, if we agree 100% then one of us is just mindlessly following the other which isn’t good.

  • Cache Hit Ratio – I ignore this value, but I still monitor it. I will, ideally, never be the only DBA on a team again, and everyone seems to think this value is cool.
  • Page Life Exp – My favorite! When you read a page from disk into memory how many seconds will it stay there? Just don’t use the outdated “300″ rule or your disks will catch on fire!!!

    Burning Drive!!!

    PLE = 300

  • Page Lookups/Sec – How many pages are read from memory.
  • Page Reads/Sec – How many pages are read from disk.
  • Page Writes/Sec – How many pages are written to disk.
  • Lazy Writes/sec – How many pages are written to disk outside of a checkpoint due to memory pressure.
  • Batch Requests/sec – How busy is the server?
  • Trans/sec – How busy is the server?
  • Total Server Memory – How much memory SQL Server is currently using. Typically ramps up to Target value and PLE is low as it ramps up since new pages are in memory dropping the average.
  • Target Server Memory – How much memory SQL Server is allowed to use. Should be the same as the max memory setting, but memory pressure can cause this to decrease.
  • Memory Grants Pending – How many processes aren’t able to get enough memory to run. Should be 0, always 0, if not then find out why.
  • Deadlocks – How many deadlocks are we getting. Most apps handle deadlocks gracefully, but they still lose time doing it. If this number starts going up, start looking into it.
  • SQL Compilations/sec – This is a hidden performance killer! Some queries can’t be cached so they’re compiled every time they’re run. I’ve seen this with a query being run once a second and a big server was running slower than my laptop. It’s normal for things to compile throughout the day, it’s not normal for this number to be 10x higher than before that last upgrade.
  • SQL Re-Compliations/sec – Same goes here. The counters aren’t that much different.

If you know a little about this DMV then you know these values are cryptic. There’s several ways this data is stored and it has to be retrieved differently for each type to be useful. Then many of these are cumulative since the server was restarted, which isn’t going to help too much. Even worse, MSDN failed us on this one and figuring out this DMV required help outside of that site. Now for the good news, the script below will take care of all of that for you and leave you with some easy reading with values you can filter to the time periods you care about. If you want to add some of your own counters then just follow my lead on one that has the same cntr_type, or you can go to Rabin’s blog post that I learned from.

IF object_id('tempdb..#OSPC') IS NOT NULL BEGIN
    DROP TABLE #OSPC
END

DECLARE @FirstCollectionTime DateTime
    , @SecondCollectionTime DateTime
    , @NumberOfSeconds Int
    , @BatchRequests Float
    , @LazyWrites Float
    , @Deadlocks BigInt
    , @PageLookups Float
    , @PageReads Float
    , @PageWrites Float
    , @SQLCompilations Float
    , @SQLRecompilations Float
    , @Transactions Float

DECLARE @CounterPrefix NVARCHAR(30)
SET @CounterPrefix = CASE WHEN @@SERVICENAME = 'MSSQLSERVER'
                            THEN 'SQLServer:'
                        ELSE 'MSSQL$' + @@SERVICENAME + ':'
                        END

--Grab the current values from dm_os_performance_counters
--Doesn't do anything by instance or database because this is good enough and works unaltered in all envirornments
SELECT counter_name, cntr_value--, cntr_type --I considered dynamically doing each counter type, but decided manual was better in this case
INTO #OSPC 
FROM sys.dm_os_performance_counters 
WHERE object_name like @CounterPrefix + '%'
    AND instance_name IN ('', '_Total')
    AND counter_name IN ( N'Batch Requests/sec'
                        , N'Buffer cache hit ratio'
                        , N'Buffer cache hit ratio base'
                        , N'Free Pages'
                        , N'Lazy Writes/sec'
                        , N'Memory Grants Pending'
                        , N'Number of Deadlocks/sec'
                        , N'Page life expectancy'
                        , N'Page Lookups/Sec'
                        , N'Page Reads/Sec'
                        , N'Page Writes/Sec'
                        , N'SQL Compilations/sec'
                        , N'SQL Re-Compilations/sec'
                        , N'Target Server Memory (KB)'
                        , N'Total Server Memory (KB)'
                        , N'Transactions/sec')

--Just collected the second batch in the query above
SELECT @SecondCollectionTime = GetDate()

--Grab the most recent values, if they are appropriate (no reboot since grabbing them last)
SELECT @FirstCollectionTime = DateAdded
    , @BatchRequests = BatchRequests
    , @LazyWrites = LazyWrites
    , @Deadlocks = Deadlocks
    , @PageLookups = PageLookups
    , @PageReads = PageReads
    , @PageWrites = PageWrites
    , @SQLCompilations = SQLCompilations
    , @SQLRecompilations = SQLRecompilations
    , @Transactions = Transactions
FROM OSPerfCountersLast 
WHERE DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'TempDB')

--If there was a reboot then all these values would have been 0 at the time the server came online (AKA: TempDB's create date)
SELECT @FirstCollectionTime = ISNULL(@FirstCollectionTime, (SELECT create_date FROM sys.databases WHERE name = 'TempDB'))
    , @BatchRequests = ISNULL(@BatchRequests, 0)
    , @LazyWrites = ISNULL(@LazyWrites, 0)
    , @Deadlocks = ISNULL(@Deadlocks, 0)
    , @PageLookups = ISNULL(@PageLookups, 0)
    , @PageReads = ISNULL(@PageReads, 0)
    , @PageWrites = ISNULL(@PageWrites, 0)
    , @SQLCompilations = ISNULL(@SQLCompilations, 0)
    , @SQLRecompilations = ISNULL(@SQLRecompilations, 0)
    , @Transactions = ISNULL(@Transactions, 0)

SELECT @NumberOfSeconds = DATEDIFF(ss, @FirstCollectionTime, @SecondCollectionTime)

--I put these in alphabetical order by counter_name, not column name.  It looks a bit odd, but makes sense to me
--Deadlocks are odd here.  I keep track of the number of deadlocks in the time period, not average number of deadlocks per second.
  --AKA, I keep track of things the way I would refer to them when I talk to someone.  "We had 2 deadlocks in the last 5 minutes", not "We averaged .00002 deadlocks per second there"
INSERT INTO OSPerfCounters (DateAdded, Batch_Requests_Sec, Cache_Hit_Ratio, Free_Pages, Lazy_Writes_Sec, Memory_Grants_Pending
    , Deadlocks, Page_Life_Exp, Page_Lookups_Sec, Page_Reads_Sec, Page_Writes_Sec, SQL_Compilations_Sec, SQL_Recompilations_Sec
    , ServerMemoryTarget_KB, ServerMemoryTotal_KB, Transactions_Sec)
SELECT @SecondCollectionTime
    , Batch_Request_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec') - @BatchRequests) / @NumberOfSeconds
    , Cache_Hit_Ratio        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio')/(SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio base')
    , Free_Pages            = (SELECT cntr_value FROM #OSPC WHERE counter_name =N'Free pages')
    , Lazy_Writes_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec') - @LazyWrites) / @NumberOfSeconds
    , Memory_Grants_Pending    = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Memory Grants Pending')
    , Deadlocks                = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec') - @Deadlocks) 
    , Page_Life_Exp         = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page life expectancy')
    , Page_Lookups_Sec      = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec') - @PageLookups) / @NumberOfSeconds
    , Page_Reads_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec') - @PageReads) / @NumberOfSeconds
    , Page_Writes_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec') - @PageWrites) / @NumberOfSeconds
    , SQL_Compilations_Sec  = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec') - @SQLCompilations) / @NumberOfSeconds
    , SQL_Recompilations_Sec= ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec') - @SQLRecompilations) / @NumberOfSeconds
    , ServerMemoryTarget_KB = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Target Server Memory (KB)')
    , ServerMemoryTotal_KB  = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Total Server Memory (KB)')
    , Transactions_Sec        = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec') - @Transactions) / @NumberOfSeconds

TRUNCATE TABLE OSPerfCountersLast

--Note, only saving the last value for ones that are done per second.
INSERT INTO OSPerfCountersLast(DateAdded, BatchRequests, LazyWrites, Deadlocks, PageLookups, PageReads
    , PageWrites, SQLCompilations, SQLRecompilations, Transactions)
SELECT DateAdded            = @SecondCollectionTime
    , BatchRequests            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec')
    , LazyWrites            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec')
    , Deadlocks             = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec')
    , PageLookups            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec')
    , PageReads                = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec')
    , PageWrites            = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec')
    , SQLCompilations        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec')
    , SQLRecompilations        = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec')
    , Transactions             = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec')

DROP TABLE #OSPC

Throw that code above here in a proc, schedule it to run every so often (I like 5 minutes) and it’ll….fail. It kinda relies on a couple tables you should create first. Here ya go.

CREATE TABLE OSPerfCounters(
    DateAdded datetime NOT NULL
    , Batch_Requests_Sec int NOT NULL
    , Cache_Hit_Ratio float NOT NULL
    , Free_Pages int NOT NULL
    , Lazy_Writes_Sec int NOT NULL
    , Memory_Grants_Pending int NOT NULL
    , Deadlocks int NOT NULL
    , Page_Life_Exp int NOT NULL
    , Page_Lookups_Sec int NOT NULL
    , Page_Reads_Sec int NOT NULL
    , Page_Writes_Sec int NOT NULL
    , SQL_Compilations_Sec int NOT NULL
    , SQL_Recompilations_Sec int NOT NULL
    , ServerMemoryTarget_KB int NOT NULL
    , ServerMemoryTotal_KB int NOT NULL
    , Transactions_Sec int NOT NULL
)

--You'll typically only query this by one value, which is added sequentually.  No page splits!!!
CREATE UNIQUE CLUSTERED INDEX IX_OSPerfCounters_DateAdded_U_C ON OSPerfCounters
(
    DateAdded
) WITH (FillFactor = 100)

--Only holds one value at a time, indexes are a waste
CREATE TABLE OSPerfCountersLast(
    DateAdded datetime NOT NULL
    , BatchRequests bigint NOT NULL
    , LazyWrites bigint NOT NULL
    , Deadlocks bigint NOT NULL
    , PageLookups bigint NOT NULL
    , PageReads bigint NOT NULL
    , PageWrites bigint NOT NULL
    , SQLCompilations bigint NOT NULL
    , SQLRecompilations bigint NOT NULL
    , Transactions bigint NOT NULL
)

The important part of all this is how you use it. It’s tempting to just look at the last 7 records and say that you know what’s going on, that makes me want to slap you. Every server is different, every server has different loads and baselines, and you’re either underworked or you don’t know what those baselines are for every server you manage. I do simple baselines every time I look at an incident and look at the last hour, the same time yesterday, and the same time a week ago. That gives you a chance to see what’s normal for this server and what’s different right now. This query is so simple you’ll wonder why I even posted it, but it’s effective which is why it’s here. The 7 records per day thing, that’s because 21 records show up on my screen without me scrolling, it is NOT a magic number!

SELECT 'Today', * FROM (
SELECT TOP 7 *
FROM OSPerfCounters
ORDER BY dateadded DESC
) X

UNION
SELECT 'Yesterday', * FROM (SELECT TOP 7 *
FROM OSPerfCounters
WHERE dateadded <= GETDATE()-1
ORDER BY dateadded DESC
) Y

UNION
SELECT 'Last Week', * FROM (
SELECT TOP 7 *
FROM OSPerfCounters
WHERE dateadded <= GETDATE()-7
ORDER BY dateadded DESC) Z

ORDER BY dateadded DESC

And, well, something I’ve been skipping on my posts and telling people to handle cleanup on their own…. Here’s step 2 of my jobs that populate my monitoring tables to keep your data from being the ever-growing data you’re struggling with in every other app. I delete in batches according to the clustered index. It’s overkill for something deleting one row at a time, or, even if you put this in a separate daily job, 288 rows if the process is scheduled every 5 minutes. So, why the batches? Because I copy/paste my own code everywhere, batches is reusable, and this is how I chop off the tail end of EVERYTHING!

SELECT 'Start' --Give me a rowcount of 1

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (100000)
    FROM OSPerfCounters
    where dateadded < (GetDate() - 400)
END 

In the beginning I mentioned that if you agreed with me 100% then one of us is a mindless monkey. Look, I put this out there first, so I’m obviously not the mindless monkey here, am I? There’s a box below that gives you a chance to show that you’re not a mindless monkey either! Tell me I’m wrong, how I can do better, and how everyone else reading this can benefit from it even more! I’ll promote you from mindless monkey to talking monkey!

Deadlock – A quick, easy view


Deadlocks are rough to work with. Here are the scripts I use to capture deadlocks, find which ones are reoccurring, and view them along with a couple free eBooks to resolve them.

You could have thousands of deadlocks and it would take you forever to find out which ones are reoccurring or which ones are some freak accident with an annual process. I used to read through deadlock graphs one-by-one to see what was reoccurring, and I used to be less satisfied with my job as well. If you’re going to bang your head on your desk making sure a deadlock never happens again, you’ll want to make sure it wasn’t a one-time event that wouldn’t have happened again anyways.

I’m not going to get into how to figure out deadlocks, that’s a chapter in a book more than it’s a blog post. Chapter 7 to be exact in the FREE eBook or $25 physical book for Accidental DBAs by Jonathan Kehayias and Ted Krueger. Also, a whole book on blocking by Kalen Delaney in a FREE eBook or $22 physical book. I have to admit that Ted Krueger isn’t a name I recognize. However, Jonathan and Kalen are easily on my list of top 5 best MVPs out there.

Anyways, back to me. Here’s what I use to look at deadlock traces. The base code was found on the internet years ago, and you can find it in several places now. Because of that I have no idea who the original author is, but I’d be glad to throw up credit here for anyone who can show me a site dated older than when I started using it.

There are two versions here, almost identical code in each. The first one summarizes the deadlocks and gives you a count of the occurrences. The second one shows each SPID involved in the deadlock separately along with the XML for the deadlock.

DECLARE @Path VarChar(500);
SET @Path = (SELECT TOP 1 [path] FROM sys.traces WHERE [path] LIKE '%deadlock%')
--SET @Path = 'D:\Perflogs\PerfTraces\DeadlockTrace.trc'

;with CTE as
(
select 
 [TraceID] = 3, 
 [RowID] = row_number() OVER (ORDER BY StartTime), -- assign a row number to each deadlock
 [DeadlockGraph]=case when TextData like '<deadlock-list%' then convert(xml, TextData) else null end, 
 * 
 from ::fn_trace_gettable(@path, default)
where TextData like '<deadlock-list%'
    --AND StartTime BETWEEN '2013-06-17 00:00' AND '2013-06-30 00:00'
)
SELECT [Procedure], Inputbuffer, dMonth = DatePart(Month, DeadLockTime), dDay = DatePart(Day, DeadLockTime), DeadlockCount = Count(*), dMax = max(DeadlockTime), dMin = min(DeadlockTime)
FROM
(
select 
 [TimeoutID] = CTE.RowID,
 [DeadlockTime] = [StartTime],
 [DeadlockGraph],
 [PagelockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/pagelock[1]/@objectname', 'varchar(200)'),
 [DeadlockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/objectlock[1]/@objectname', 'varchar(200)'),
 [Victim] = case when Deadlock.Process.value('@id', 'varchar(50)') = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/@victim', 'varchar(50)') then 1 else 0 end,
 [Procedure] = Deadlock.Process.value('executionStack[1]/frame[1]/@procname[1]', 'varchar(200)'),
 [LockMode] = Deadlock.Process.value('@lockMode', 'char(1)'),
 [Code] = Deadlock.Process.value('executionStack[1]/frame[1]', 'varchar(1000)'),
 [ClientApp] = Deadlock.Process.value('@clientapp', 'varchar(100)'),
 [HostName] = Deadlock.Process.value('@hostname', 'varchar(20)'),
 [HostPID] = Deadlock.Process.value('@hostpid', 'varchar(20)'),
 [LoginName] = Deadlock.Process.value('@loginname', 'varchar(20)'),
 [TransactionTime] = Deadlock.Process.value('@lasttranstarted', 'datetime'),
 [InputBuffer] = Deadlock.Process.value('inputbuf[1]', 'varchar(1000)')
 from CTE
 CROSS APPLY CTE.[DeadlockGraph].nodes('/deadlock-list/deadlock/process-list/process') as Deadlock(Process)
) X
--WHERE Victim = 0 
--WHERE TimeoutID IN (SELECT RowID FROM CTE WHERE [PROCEDURE] = 'master.dbo.FakeProcName')
--WHERE inputbuffer like '%select top 10%'
GROUP BY [procedure], Inputbuffer, DatePart(Month, DeadLockTime), DatePart(Day, DeadLockTime)
--ORDER BY TimeoutID 

Then there’s this to actually let me look at the XML. Although you can easily see in the script below which deadlocks reoccur back-to-back, it’s not so easy to see reoccurrence of deadlocks that happen once every morning between 2 and 3 AM. That’s where the first script shines, and you should use and abuse it.

DECLARE @Path VarChar(500);
SET @Path = (SELECT TOP 1 [path] FROM sys.traces WHERE [path] LIKE '%deadlock%');

with CTE as
(
select 
 [TraceID] = 3, 
 [RowID] = row_number() OVER (ORDER BY StartTime), -- assign a row number to each deadlock
 [DeadlockGraph]=case when TextData like '<deadlock-list%' then convert(xml, TextData) else null end, 
 * 
 from ::fn_trace_gettable(@Path, default)
where TextData like '<deadlock-list%'
    --AND StartTime BETWEEN '2013-03-08 02:00' AND '2013-03-08 04:00'
    AND StartTime > DATEADD(Hour, -2, GetDate())
)
SELECT *
FROM
(
select 
 [TimeoutID] = CTE.RowID,
 [DeadlockTime] = [StartTime],
 [DeadlockGraph],
 [PagelockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/pagelock[1]/@objectname', 'varchar(200)'),
 [DeadlockObject] = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/resource-list[1]/objectlock[1]/@objectname', 'varchar(200)'),
 [Victim] = case when Deadlock.Process.value('@id', 'varchar(50)') = CTE.[DeadlockGraph].value('/deadlock-list[1]/deadlock[1]/@victim', 'varchar(50)') then 1 else 0 end,
 [Procedure] = Deadlock.Process.value('executionStack[1]/frame[1]/@procname[1]', 'varchar(200)'),
 [LockMode] = Deadlock.Process.value('@lockMode', 'char(1)'),
 [Code] = Deadlock.Process.value('executionStack[1]/frame[1]', 'varchar(1000)'),
 [ClientApp] = CASE LEFT(Deadlock.Process.value('@clientapp', 'varchar(100)'), 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(Deadlock.Process.value('@clientapp', 'varchar(100)'),32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(Deadlock.Process.value('@clientapp', 'varchar(100)'), 67, len(Deadlock.Process.value('@clientapp', 'varchar(100)'))-67)
                    ELSE Deadlock.Process.value('@clientapp', 'varchar(100)')
                    END ,
 [HostName] = Deadlock.Process.value('@hostname', 'varchar(20)'),
 [LoginName] = Deadlock.Process.value('@loginname', 'varchar(20)'),
 [TransactionTime] = Deadlock.Process.value('@lasttranstarted', 'datetime'),
 [InputBuffer] = Deadlock.Process.value('inputbuf[1]', 'varchar(1000)')
 from CTE
 CROSS APPLY CTE.[DeadlockGraph].nodes('/deadlock-list/deadlock/process-list/process') as Deadlock(Process)
-- ORDER BY [DeadlockObject], [ClientApp], [Procedure], [Code]
) X
--WHERE Victim = 0 
--WHERE [Procedure] = 'master.dbo.FakeProcedureName'
ORDER BY TimeoutID DESC

Finally, to make all of this happen you’ll need a deadlock trace. Don’t go thinking that I open profiler, make a deadlock trace, script it out, and run it every time I need one. Reusable code is awesome, and this is reused every time I see an alert come through that we are having too many deadlocks. I have this proc in our Perf database, which is on every SQL Server we manage. So you don’t have to dig through the numbers, it grabs every event for the deadlock graph and nothing else.

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[usp_ServerSideTrace_Deadlock]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_ServerSideTrace_Deadlock]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE procedure [dbo].[usp_ServerSideTrace_Deadlock]
    @FilePath nvarchar(1000) = N'D:\Perflogs\PerfTraces\DeadlockTrace',                                            
    @maxfilesize bigint = 25,
    @maxfiles int = 10,
    @TraceDBID bit=0,
    @DBID int=0
AS

declare @rc int
declare @TraceID int

--Create Trace
exec @rc = sp_trace_create @TraceID output, 2, @FilePath, @maxfilesize, NULL, @maxfiles
if (@rc != 0) 
Begin
    select ErrorCode=@rc    
    return
End

-- Set the RPC Completed and SQL Batch Completed events
declare @on bit
set @on = 1
exec sp_trace_setevent @TraceID, 148, 11, @on
exec sp_trace_setevent @TraceID, 148, 12, @on
exec sp_trace_setevent @TraceID, 148, 14, @on
exec sp_trace_setevent @TraceID, 148, 1, @on
exec sp_trace_setevent @TraceID, 25, 15, @on
exec sp_trace_setevent @TraceID, 25, 8, @on
exec sp_trace_setevent @TraceID, 25, 32, @on
exec sp_trace_setevent @TraceID, 25, 56, @on
exec sp_trace_setevent @TraceID, 25, 1, @on
exec sp_trace_setevent @TraceID, 25, 9, @on
exec sp_trace_setevent @TraceID, 25, 57, @on
exec sp_trace_setevent @TraceID, 25, 2, @on
exec sp_trace_setevent @TraceID, 25, 10, @on
exec sp_trace_setevent @TraceID, 25, 3, @on
exec sp_trace_setevent @TraceID, 25, 11, @on
exec sp_trace_setevent @TraceID, 25, 12, @on
exec sp_trace_setevent @TraceID, 25, 13, @on
exec sp_trace_setevent @TraceID, 25, 6, @on
exec sp_trace_setevent @TraceID, 25, 14, @on
exec sp_trace_setevent @TraceID, 25, 22, @on
exec sp_trace_setevent @TraceID, 59, 32, @on
exec sp_trace_setevent @TraceID, 59, 56, @on
exec sp_trace_setevent @TraceID, 59, 1, @on
exec sp_trace_setevent @TraceID, 59, 57, @on
exec sp_trace_setevent @TraceID, 59, 2, @on
exec sp_trace_setevent @TraceID, 59, 14, @on
exec sp_trace_setevent @TraceID, 59, 22, @on
exec sp_trace_setevent @TraceID, 59, 3, @on
exec sp_trace_setevent @TraceID, 59, 12, @on

-- Set the Filters
--Default to exclude the trace from Tracing Itself
exec sp_trace_setfilter @TraceID, 10, 0, 7, N'SQL Server Profiler'

--Add other filters as specified by parameters
If @TraceDBID=1
    exec sp_trace_setfilter @TraceID, 3, 0, 0, @DBID

-- Set the trace status to start
exec sp_trace_setstatus @TraceID, 1

-- display trace id for future references
return @TraceID

GO

The basics of a deadlock is that process 1 gets a lock on data A while process 2 gets a lock on data B. Then process 1 says it needs a lock on data B to continue while process 2 needs a lock on data A to continue. It’s a simple basic concept, but it gets complex…too complex for me to cover it here and say I did it justice. Sometimes a process deadlocks on itself, doing a parallelism deadlock as demonstrated by Jason Strate. Solutions also vary from order of operations (lets make processes 1 and 2 both get data A first, then they can’t deadlock), tuning (if the locks are released quicker then there’s less time for deadlocks), reducing the number of transactions (the lock on data A was grabbed earlier in this transaction, and there’s no reason these need to be in a explicit transaction), scheduling (these are both processes that have to run between 10 PM and 6 AM, why do they both start at midnight?), and many other approaches. There isn’t one right answer, and anyone promising a single right answer in one short blog post is selling you short. That’s either a very long blog post or it’s an incomplete answer. I go with option 3, and here’s my post on the best way to view them along with a couple book chapters to get you started on fixing what you found.

Related Content:

SQL Server Concurrency: Locking, Blocking and Row Versioning By Kalen Delaney
FREE eBook or $22 physical book

Troubleshooting SQL Server: A Guide for the Accidental DBA By Jonathan Kehayias and Ted Krueger
FREE eBook or $25 physical book

SQL Server Running Slow


A user calls to say the app or server is slow today.  Here’s a quick summary of how I get started.  It depends on monitoring where you can compare baselines to recent activity, and I include links to all of that code I use.

The initial question is too vague.  You can’t blame this individual user because this is how 95% of issues make it to you.   The initial follow-up will be the following questions I ask the user along with the corresponding question your boss is asking you:

  • Can anyone else duplicate your issue?
    • Is this the user’s machine or a server issue?  Not a 100% conclusive answer, but it’s a start.
  • Is everything running slow or just one process or screen?
    • Is the whole server running slow or just one stored procedure?
  • Were any changes made to the system recently?  Depending on the user, this may be a question for someone else.
    • Did we cause it by making a change or did this just come up?

However, this is enough for me to do a very quick overall health check while I wait for those answers.  It seems like a lot at first glance, but it’s practically automated with the scripts I have on hand.  This is only opening 5 pre-written scripts, running them, and peeking at the results.

Quick Health:

  • Look at historical wait stats for the last two hours and the same time window both 1 and 7 days ago.  Documented in the Wait Stats post.
  • Look at memory and CPU contention for the same times as wait stats.  You’re specifically looking for high CPU usage, PLE drops, and changes to target memory. Documented in the OS Performance Counters post.
  • Look for blocking in the last hour, even though it would typically show up in wait stats.  You can compare this to previous days if needed, Documented in the Blocking post
  • Look at the large queries that completed in the last two hours along with the queries currently running to see if anything pops out.  This is just a quick glance looking for glaring issues, and you can look for details after getting answers to the follow-up questions.  Unfortunately, this information isn’t as cut-and-dry as the rest, and it’s going to rely on you being a little familiar with what normal is on this server.  Completed queries are documented in the Tracing Creation and Reading posts, and currently running queries are documented in the Running Processes post.
  • Look at the queries recompiled recently to see if a new execution plan is taking excessive resources.  Documented in the Recently Recompiled Resource Hogs post.

So, where do you go from here?  Yeah, you get the typical “it depends” answer from me.  The information you just looked at will normally comes up with something that doesn’t look right and needs to be looked into further.  However, there are so many directions this can go from here that a single document can’t cover it.  The purpose of this post is simply to answer the question “Where do I start looking with the most common question a user throws at me”, nothing more.  Also, I wanted a way to tie all of the random posts together so you could see them working together as the full solution.

Recently Recompiled Resource Hogs


It’s not too uncommon for a query to get a new execution plan that performs a lot worse than it could, and sometimes it’s bad enough to drag the whole server down to a halt.  When it’s something obvious such as a query going from 2 seconds duration to 30 seconds you’ll pick it up on a standard monitoring trace.  That’s not always the case, and the query that hits your server twice a second going from 100ms to 900ms can fly under the radar except for users saying the server isn’t as responsive as it was yesterday.

Before you go all out and throw up an intrusive trace that can slow down your server capturing an unfiltered sample of RPC:Completed and SQL:BatchCompleted to run aggregates on, see what the DMVs can tell you.  If you look in sys.dm_exec_query_stats you can get the total amount of resources used since the last recompilation, which is going to give you pretty much everything you need as long as you’re looking for a query that can stay in the cache.  The only trick is to look at everything in the same context.  Something that did 10,000,000 reads since it was compiled yesterday is nothing compared to something that did 1,000,000 reads since it was compiled 10 minutes ago.  Here’s my answer to this:

SELECT top 10 --WorkerTime_Sec = cast(round(total_worker_time / 1000 / 1000.0, 0) as Int)
    WorkerTimeSec_PerSec =  cast(cast(round(total_worker_time / 1000 / 1000.0, 0) as Int) / (Cast(DateDiff(second, Creation_Time, GetDate()) as DEC(20,2)) + .00000001) as DEC(20,2))
    --, LogicalReads_K = total_logical_reads / 1000
    , LogicalReads_K_PerSec = cast(total_logical_reads / 1000 / (Cast(DateDiff(second, Creation_Time, GetDate())as DEC(20,2)) + .00000001) as DEC(20,2))
    --, ElapsedTime_Sec = cast(round(total_elapsed_time / 1000 / 1000.0, 0) as Int)
    , ElapsedTimeSec_PerSec = cast(cast(round(total_elapsed_time / 1000 / 1000.0, 0) as Int)/ (Cast(DateDiff(second, Creation_Time, GetDate())as DEC(20,2)) + .00000001) as DEC(20,2))
    , Seconds_Since_Recompile = DateDiff(second, Creation_Time, GetDate())
    , Creation_Time
    , qs.execution_count
    , avg_reads = total_logical_reads / qs.execution_count
    , avg_elapsed_ms = total_elapsed_time / 1000 / qs.execution_count
    , avg_worker_ms = total_worker_time / 1000 / qs.execution_count
    , Offset = statement_start_offset
    , ObjectName = OBJECT_NAME(qt.objectid, qt.dbid)
    , qt.text
    , StatementText = SUBSTRING (qt.text, qs.statement_start_offset/2 + 1,
        abs(CASE WHEN qs.statement_end_offset = -1
               THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2 
            ELSE qs.statement_end_offset END - qs.statement_start_offset)/2 + 1)
    --, qt.*
    --, qs.*
FROM sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
WHERE Creation_Time > DateAdd(Hour, -24, GetDate())
    AND (creation_time < DATEADD(SECOND, -60, GETDATE()) or execution_count > 5) --Has to be a minute old or executed 6 times before I consider the data stable
    --AND (total_worker_time > 20000 or execution_count > 15000)
ORDER BY 1 DESC 

I comment out what I don’t typically use, same as always with me.  Those pieces don’t usually change.  The only thing that does typically change is the date filter on the creation time; if someone says there weren’t any problems this time yesterday then it stays at the default value, if they say it worked fine before lunch then it drops down to 4 hours.

There are four things to keep in mind with this DMV.  Everything is in microseconds which I don’t use, so I “fixed” the problem in my query.  It will only hold data while the query is in cache, which means it’s never tracked here if it’s never in cache.  Each row is for a single plan for a single statement, although  you could easily aggregate it if you wanted to.  Also, it’s possible for the worker time or duration per second to be over 1 due to multiple simultaneous runs or, with worker time, parallelism can cause this.  I typically try not to restate what you should be reading on MSDN or BOL, but these aren’t things everyone expects.

For me to run this, I typically have to already have a complaint that a server is running slower than expected. Check out my Server Running Slow post to see everything else that runs along with this.

Reading Traces


Capturing information is useless unless you know how to use it. Here’s what I use to get the most out of my basic trace of anything taking over so many seconds. This is also my template for more invasive traces, and it gets tweaked on a case-by-case basis.

It’s my typical style with half of it commented out where less run portions can be highlighted and run and portions of my WHERE clause don’t have to be rewritten to be added in. The first line gets highlighted as soon as I open this script every time so I can run it, copy the path of the trace I’m looking for, and paste it into my main script. As for the WHERE clause, call me lazy or whatever it is you call people who can’t remember the right word, but I don’t always remember the exact wording of each column name that I want to filter by. I simply uncomment them as I need them.

The only thing I can say is really impressive here is finding the job names, which is a piece of this script I will never forget figuring out. Other than that, just little things to help readability when both running this script and when playing with the WHERE clause. I can’t tell you why duration is microseconds and CPU is milliseconds, but I can say the SELECT and WHERE clauses are set up to use the same units I would use when I talk to someone.

--SELECT * FROM sys.traces

/*
--sp_trace_setstatus id, status
exec sp_trace_setstatus 3,0 --stop
exec sp_trace_setstatus 3,2 --close
--0 stop
--1 start
--2 close
*/

--query running trace
SELECT TOP 10000 Database_Name = DB_Name(DatabaseID)
    , DatabaseName
    , starttime
    , endtime
    , textdata = cast(textdata as varchar(4000)) 
    , Duration_Sec = cast(duration/1000/1000.0 as Dec(10,1)) 
    , CPU_Sec = cast(cpu/1000.0 as Dec(10,1)) 
    , Reads_K = cast(reads/1000.0 as Dec(10,0)) 
    , Writes_K = Cast(writes/1000.0 as Dec(10,1)) 
    , hostname  
    , LoginName 
    , NTUserName 
    , ApplicationName = CASE LEFT(ApplicationName, 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(ApplicationName,32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(ApplicationName, 67, len(ApplicationName)-67)
                    ELSE ApplicationName
                    END 
    , SPID
    , TE.*
FROM fn_trace_gettable('D:\Traces\DurationOver10sec_110.trc',default) T
    LEFT JOIN sys.trace_events TE ON  T.EventClass = TE.trace_event_id
WHERE endtime > DATEADD(Hour, -3, GetDate()) 
    --AND Duration > 1000000 * 15 --in seconds
    --AND CPU > 1000 * 10 --in seconds
    --AND Writes > 10000
    --AND Reads > 1000000
    --AND te.category_id = 2 --Auto Grow/Shrink found in default trace
    --AND t.databaseid = DB_ID('master')
    --AND t.DatabaseName = 'tempdb'
    --AND t.LoginName <> 'shood'
    --AND t.textdata like '%%'
ORDER BY endTime DESC, starttime DESC

So, for those of you not too familiar with tracing, what does all of this mean and what am I looking for? It depends on what’s wrong. The day after an app’s upgrade I’m looking for procs that snuck through testing and are killing the server. When a user complains about performance you might see a new proc showing up because it’s getting a bad execution plan. If tempdb or the logs fill up I can tell you what had 100,000 writes and just happened to end at that exact moment. Sometimes I’ll glance to see if anything is hitting or approaching the app’s 30 second timeout.

A general trace such as the RPC:Completed and SQL:BatchCompleted that take over X seconds won’t answer all of your questions, but it will be a good start. The most important things are to always have something running (could be extended events instead of a trace) to tell you the big things that are running and to have your monitoring limited enough where you aren’t the one slowing down the server. Make sure it’s customized to each server since a 1 second duration in a web app is a big deal, but other servers you would get flooded with results unless you bump it up to 10 seconds. You can run more invasive traces when you need them.

There are two hits against extended events where I’m even using traces for this. First, I still have too many SQL 2005 servers around and like to do things as consistent as reasonably possible across all servers to make it simple. Second, I’m still not too happy about extended events skipping SQL:BatchCompleted in SQL 2008 as that misses out on finding large loops of small statements. Extended events are much more efficient, can do a lot that traces can’t do, but aren’t a 100% replacement for traces.

Comment on my posts, let me know what you think. Every post I have at Simple SQL Server is written to myself 5 years ago, and is exactly what I wish I had back then in my own mind. The key words being “in my own mind”, and I’d like to hear what’s on your mind. Be it constructive criticism on this post or a completely separate SQL Server subject you wish you knew more about either now or 5 years ago that you’d like to see thrown up for everyone to use, I’ll never know until you tell me.

Tracing Introduction


SQL Server tracing is essential for troubleshooting performance issues, yet it can put loads on your server that would cause noticeable slowness, capture so much information that it fills the drive, and I’ve even seen it cause a cluster failover more than once.  With a tool this powerful you just need to know how to use it before you use it.  It’s almost like the idea where companies want their entry-level people to have a year of experience.

There is one trace I’d like to see on every server, preferably as a startup proc or a job that starts automatically when SQL Agent starts.  It doesn’t matter how you do it, if you want me to be happy you’ll tell me this was running for at least 24 hours before you had a performance incident.  All I’m looking for is RPC:Completed and SQL:BatchCompleted with a reasonable filter of duration over 10 seconds.

If you’ve never used Profiler before, here’s a quick overview to make the server-side trace I’m using.  Leave the first screen at the defaults, with the possible exception of changing the template to blank.  Select RPC:Completed and SQL:BatchCompleted and set the Duration filter to greater than or equal to 10,000 ms which you can get to using the “Column Filters…” button.  You can collect all columns if you want to, it won’t hurt with these events although there are some columns that won’t help.

Create trace with duration filter

Run the trace, then you can export it to a script.  I’m not happy that you have to run it first, but in the end this is either something that doesn’t affect the server at all or something I run on a dev server that’s at the same patch level.

Export trace to file

There, you have the basic script to start the server-side trace.  You still need to tweak it a bit before you can use it, but the hard part is done.  Everything highlighted in yellow was changed by me.  The default max file size was changed from 5 MB to 25 MB because I view 25 to 50 MB to be the range that’s not annoyingly small or too big to move around or send to someone easily.  The second parameter of sp_trace_create is set to 2 to enable rollover files, the third parameter is set to my output file sans an extension, and the last parameter is added to say how many output files there are.

/****************************************************/
/* Created by: SQL Server 2008 R2 Profiler          */
/* Date: 06/16/2013  06:10:02 AM         */
/****************************************************/

-- Create a Queue
declare @rc int
declare @TraceID int
declare @maxfilesize bigint
set @maxfilesize = 25

-- Please replace the text InsertFileNameHere, with an appropriate
-- filename prefixed by a path, e.g., c:\MyFolder\MyTrace. The .trc extension
-- will be appended to the filename automatically. If you are writing from
-- remote server to local drive, please use UNC path and make sure server has
-- write access to your network share

exec @rc = sp_trace_create @TraceID output, 2, N'D:\Traces\DurationOver10Sec', @maxfilesize, NULL, 5
if (@rc != 0) goto error

-- Client side File and Table cannot be scripted

-- Set the events
declare @on bit
set @on = 1
exec sp_trace_setevent @TraceID, 10, 15, @on
exec sp_trace_setevent @TraceID, 10, 31, @on
exec sp_trace_setevent @TraceID, 10, 8, @on
exec sp_trace_setevent @TraceID, 10, 16, @on
exec sp_trace_setevent @TraceID, 10, 48, @on
exec sp_trace_setevent @TraceID, 10, 1, @on
exec sp_trace_setevent @TraceID, 10, 9, @on
exec sp_trace_setevent @TraceID, 10, 17, @on
exec sp_trace_setevent @TraceID, 10, 10, @on
exec sp_trace_setevent @TraceID, 10, 18, @on
exec sp_trace_setevent @TraceID, 10, 34, @on
exec sp_trace_setevent @TraceID, 10, 3, @on
exec sp_trace_setevent @TraceID, 10, 11, @on
exec sp_trace_setevent @TraceID, 10, 51, @on
exec sp_trace_setevent @TraceID, 10, 12, @on
exec sp_trace_setevent @TraceID, 10, 13, @on
exec sp_trace_setevent @TraceID, 10, 14, @on
exec sp_trace_setevent @TraceID, 12, 15, @on
exec sp_trace_setevent @TraceID, 12, 31, @on
exec sp_trace_setevent @TraceID, 12, 8, @on
exec sp_trace_setevent @TraceID, 12, 16, @on
exec sp_trace_setevent @TraceID, 12, 48, @on
exec sp_trace_setevent @TraceID, 12, 1, @on
exec sp_trace_setevent @TraceID, 12, 9, @on
exec sp_trace_setevent @TraceID, 12, 17, @on
exec sp_trace_setevent @TraceID, 12, 10, @on
exec sp_trace_setevent @TraceID, 12, 14, @on
exec sp_trace_setevent @TraceID, 12, 18, @on
exec sp_trace_setevent @TraceID, 12, 3, @on
exec sp_trace_setevent @TraceID, 12, 11, @on
exec sp_trace_setevent @TraceID, 12, 51, @on
exec sp_trace_setevent @TraceID, 12, 12, @on
exec sp_trace_setevent @TraceID, 12, 13, @on

-- Set the Filters
declare @intfilter int
declare @bigintfilter bigint

set @bigintfilter = 10000000
exec sp_trace_setfilter @TraceID, 13, 0, 4, @bigintfilter

-- Set the trace status to start
exec sp_trace_setstatus @TraceID, 1

-- display trace id for future references
select TraceID=@TraceID
goto finish

error: 
select ErrorCode=@rc

finish: 
go

This is done as a server-side trace for a couple reasons. Profiler needs to stay open to keep collecting data, SQL Server has to send the data to Profiler which is usually across the network, then the trace data is held in memory. None of these are a problem for a properly filtered short-term trace. This is properly filtered, but I’m really hoping everyone will run this long-term.

I owe you another blog post on querying traces, then we can move in to putting it all together with where to start when a user makes their favorite statement to you…”The server is running slow”.

White papers, DMVs and Monitoring Concepts


According to Jason Strate (SQL Server MVP) there are 5 white papers ever SQL Server DBA should read. Considering my blogging is focused on how to monitor and tune previously installed servers, we’ll focus on the first two of these white papers: Troubleshooting Performance Problems in SQL Server 2008 and SQL Server 2005 Waits and Queues. Ignore the age of these papers, after all you’re running mostly the same code to do mostly the same tasks in SQL Server 6.5 and SQL Server 2012…well, except for the lack of DMVs in 6.5.

I don’t agree with them 100%, but not to the point that I don’t agree 100% that everyone should read them. The sources of the information, the uses of the DMVs, all of it is what you’ll need to know to progress as a DBA. Ask any DBA III or higher and they’ll be able to tell you most of what’s said in these papers, even if they went the hard route and made it to where they are without reading them. Where they fall short is by giving static values that aren’t worded as just starting points, and the trending they do is the trending you can do as a Microsoft Support Engineer where nothing is saved or long-term.

My differences range from minor complaints such as me feeling that a blanket statement saying Page Life Expectancy under 300 seconds is when you should start to worry. First, it’s a blanket statement, and I’m a firm believer in that the word always means you’re almost always wrong. Second, servers have more memory now and my SAN team would kill me if I told them I would have to read 500 GB of data every 300 seconds on some of my bigger servers. I do agree that there are times you should worry about PLE, but you have to keep in mind two things; how much memory are you cycling through in that time and what’s normal for this specific server.  I just hope that regular drops down to 3,000 are closer to the point that grabs your attention.

Lets build off of that last big there…What’s normal for this specific server? The queries they’re giving you are snapshots of what info the server has for you right now. Some of that is what’s happening now, some is what has happened since the last restart of SQL services, and other pieces fall somewhere in the middle. You can’t tell me that I’ve done X amount of work since the server was rebooted two weeks ago and expect me to tell you if that’s an issue. You also can’t tell me that “This proc has used more CPU than any other by far since that reboot two weeks ago” and expect me to tell you if that’s a critical issue or an intense, yet normal and expected, off-hours maintenance task. You can, however, tell me that on a typical Monday between 2:00 PM and 5:00 PM we do X, but this Monday we’re doing 100 times that while the business has no reason to say this Monday should be any different. Now we’re on to something…we know there’s an issue, and we just found our starting point to solving it. Chances are if you’re reading this then either you or your immediate manager would be able to create a database to keep this trending information if it doesn’t exist already, and you’re making it harder on yourself if you don’t.

These two white papers are almost 200 pages long together, so I’m not going to pretend to cover them in one post. Actually, I’m hoping my next 50 posts might cover most of them. The points I’m going for are that these white papers are telling you the perfect stats to watch, and these stats are going to be more meaningful when you monitor them. I’ve started this in some of my posts such as looking at Wait Stats and Blocking, but there’s a long way to go. More than these two DMVs need to be watched like this, but they are good examples of wait stats capturing cumulative data and blocking capturing point-in-time data from DMVs.

If you see a DMV in these papers, you’ll see it in my posts. If it’s not there yet, it will be. Some like dm_db_exec_sessions will never have a dedicated post unless you count Running Processes which links it to several DMVs. Others like dm_os_performance_counters might not fit comfortably in a single post. The point here is that all the DMVs, especially the ones mentioned here, are worth looking into more.

Read the papers. Always go beyond the question of “what should I do”, get to the “why should I do it”, and strive for “how can I make this mean more and be more useful”. Never take my scripts or anyone else’s as-is. Tinker with them, really learn the DMVs, and it will all start to come together.

Blocking – Capturing and Monitoring


If a query is taking longer to run than normal, there’s a good chance it’s being blocked by something else. This is especially true when you’re doing something rediculously simple and SQL Server just sits there thinking. Symptoms of blocking problems include a trace of SQL:BatchCompleted and RPC:Completed with durations over 10 seconds comes back with results using less than a second of CPU time and very few reads and writes. Also, if you’re watching your wait stats, then you’ll start to see more waits that start with LCK_ than normal.  Not the mention the overly obvious blocking found in your Running Processes

This query will show you the blocking currently occurring on your server.

SELECT tl.resource_type
    , database_name = DB_NAME(tl.resource_database_id)
    , assoc_entity_id = tl.resource_associated_entity_id
    , lock_req = tl.request_mode
    , waiter_sid = tl.request_session_id
    , wait_duration = wt.wait_duration_ms
    , wt.wait_type
    , waiter_batch = wait_st.text
    , waiter_stmt = substring(wait_st.text,er.statement_start_offset/2 + 1,
                abs(case when er.statement_end_offset = -1
                then len(convert(nvarchar(max), wait_st.text)) * 2
                else er.statement_end_offset end - er.statement_start_offset)/2 + 1)
    , waiter_host = es.host_name
    , waiter_user = es.login_name
    , blocker_sid = wt.blocking_session_id
    , blocker_stmt = block_st.text 
    , blocker_host = block_es.host_name
    , blocker_user = block_es.login_name
FROM sys.dm_tran_locks tl (nolock)
    INNER JOIN sys.dm_os_waiting_tasks wt (nolock) ON tl.lock_owner_address = wt.resource_address
    INNER JOIN sys.dm_os_tasks ot (nolock) ON tl.request_session_id = ot.session_id AND tl.request_request_id = ot.request_id AND tl.request_exec_context_id = ot.exec_context_id
    INNER JOIN sys.dm_exec_requests er (nolock) ON tl.request_session_id = er.session_id AND tl.request_request_id = er.request_id
    INNER JOIN sys.dm_exec_sessions es (nolock) ON tl.request_session_id = es.session_id
    LEFT JOIN sys.dm_exec_requests block_er (nolock) ON wt.blocking_session_id = block_er.session_id
    LEFT JOIN sys.dm_exec_sessions block_es (nolock) ON wt.blocking_session_id = block_es.session_id 
    CROSS APPLY sys.dm_exec_sql_text(er.sql_handle) wait_st
    OUTER APPLY sys.dm_exec_sql_text(block_er.sql_handle) block_st

However, if you have time to watch your servers all the time and keep running that query then you’re overstaffed in a world that leans towards being understaffed. For way too many reasons to list here, throw the results of this into a table every minute. If someone says a query ran long and you see the server was waiting on locks, look here. If you want to be proactive and look for ways you can improve server performance, look here.

CREATE TABLE Blocking (
    BlockingID BigInt Identity(1,1) NOT NULL
    , resource_type NVarChar(60)
    , database_name SysName
    , assoc_entity_id BigInt
    , lock_req NVarChar(60)
    , wait_spid Int
    , wait_duration_ms Int
    , wait_type NVarChar(60)
    , wait_batch NVarChar(max)
    , wait_stmt NVarChar(max)
    , wait_host SysName
    , wait_user SysName
    , block_spid Int
    , block_stmt NVarChar(max)
    , block_host SysName
    , block_user SysName
    , DateAdded datetime NOT NULL DEFAULT (GetDate())
)
GO

CREATE UNIQUE CLUSTERED INDEX IX_Blocking_DateAdded_BlockingID_U_C ON Blocking
(
    DateAdded
    , BlockingID
) WITH (Fillfactor = 95)
GO

So, now you know what your blocking problems are. So, how do you fix them? That’s a deeper dive than I’ll be doing now, but here’s they key words to look into:

  • NOLOCK hint
  • Query tuning
  • Process timing
  • Eliminate cursors

Every situation is going to be unique, and I can’t pretend like I can give you all the answers. Finding that there is an issue can be difficult enough, and those are the answers I can help with.

Wait Stats – Monitoring and Using


sys.dm_os_wait_stats is one of the most important DMVs out there, and one that you should know the historical values of for every server you care about. This DMV will tell you the cumulative amount of time you waited for each wait type since SQL Server services started, which is nice to know, but limited in its usefulness as-is.

If, however, you knew what the values were 30 minutes ago and what they are now then you could know why recent queries were waiting. Take it a step further and know what you were waiting for in the same 30 minute period yesterday and you have a bit of a baseline to go off of. Now you can compare what you’re waiting for now to your baseline so you can say if what you’re seeing now is just normal for the server or if you’re seeing a true incident. Remember, the only way to know if numbers on a server are a problem are not is to know what’s normal for that specific server.

A half-hour increment is a bit much for me, and I would prefer to know it every 5 minutes to give a little more detail. Also, just looking at yesterday’s stats for comparison isn’t good enough for me, especially on when looking into an incident on a Monday where Sunday isn’t a very good baseline. Because of that, I want to look at the baseline data from the same time period yesterday and exactly a week ago.

To capture the data you need to capture a snapshot of sys.dm_os_wait_stats along with the time you grabbed it. Then grab it again, do a diff between the two values, and save that off with the starting and ending times. Keep this running every so many minutes in a SQL job and you have a running baseline to compare what you’re seeing now to what the server is typically doing.

After that, keep the data trimmed down to what you’d actually use. Personally, I’m a huge fan of 13 months. This running every 5 minutes for 13 months will accumulate about 750 MB of data, and you can adjust these numbers as you see fit.

Here are the tables; I comment out stuff if I would cry if I accidently ran it at the wrong time.

/*
IF OBJECT_ID('WaitStats') IS NOT NULL BEGIN
    DROP TABLE WaitStats
END

IF OBJECT_ID('WaitStatsLast') IS NOT NULL BEGIN
    DROP TABLE WaitStatsLast
END
*/
GO

CREATE TABLE WaitStats (
    DateStart datetime
    , DateEnd datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStats_DateStart_waittype_U_C ON WaitStats
(
    DateStart
    , wait_type
) WITH (Fillfactor = 95)
GO

CREATE TABLE WaitStatsLast (
    DateAdded datetime
    , wait_type nvarchar(60)
    , waiting_tasks_count bigint
    , wait_time_ms bigint
    , max_wait_time_ms bigint
    , signal_wait_time_ms bigint
)
GO

CREATE CLUSTERED INDEX IX_WaitStatsLast_waittype_U_C ON WaitStatsLast
(
    wait_type
) WITH (Fillfactor = 95)
GO

Notice the lack of creativity. It is very intentional. The code is more stable as I’m copying the field sizes directly from the DMV, and the data is more legible to an outsider (new employee, Microsoft engineer, consultant, vendor) because the fields are exactly the same as the DMV they’re used to seeing. I’m not a fan of non-creative solutions, so there’s a reason to go with it when I decide to be boring.

Now to populate the tables. This script will need to be ran on a regular schedule. Preferably, throw it in a proc and run that proc every 5 minutes, but that’s just my opinion in a discussion about how to watch your servers.

IF OBJECT_ID('tempdb..#WaitStatsNew') IS NOT NULL BEGIN
    DROP TABLE #WaitStatsNew 
END

SELECT DateAdded = GETDATE()
    , wait_type
    , waiting_tasks_count 
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms 
INTO #WaitStatsNew
FROM sys.dm_os_wait_stats

INSERT INTO WaitStats (DateStart, DateEnd, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateStart = ISNULL(l.DateAdded, (SELECT create_date FROM sys.databases WHERE name = 'tempdb'))
    , DateEnd = n.DateAdded
    , wait_type = n.wait_type
    , waiting_tasks_count = n.waiting_tasks_count - ISNULL(l.waiting_tasks_count, 0)
    , wait_time_ms = n.wait_time_ms - ISNULL(l.wait_time_ms, 0)
    , max_wait_time_ms = n.max_wait_time_ms --It's a max, not cumulative
    , signal_wait_time_ms = n.signal_wait_time_ms - ISNULL(l.signal_wait_time_ms, 0)
FROM #WaitStatsNew n
    LEFT OUTER JOIN WaitStatsLast l ON n.wait_type = l.wait_type AND l.DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'tempdb')

TRUNCATE TABLE WaitStatsLast 

INSERT INTO WaitStatsLast (DateAdded, wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_ms)
SELECT DateAdded
    , wait_type
    , waiting_tasks_count
    , wait_time_ms
    , max_wait_time_ms
    , signal_wait_time_ms
FROM #WaitStatsNew 

DROP TABLE #WaitStatsNew

Finally, querying it. Give it a descent time frame, I typically start at 2 hours to get an overall picture without having a single event skew the numbers too much. Then look at your most relevant baseline data – yesterday and 7 days ago during the same timeframe.

DECLARE @dStart datetime
    , @dEnd datetime

SELECT @dStart = DATEADD(Hour, -2, GETDATE())
    , @dEnd = GETDATE()

IF OBJECT_ID('tempdb..#IgnoredWaits') IS NOT NULL BEGIN
    DROP TABLE #IgnoredWaits
END

CREATE TABLE #IgnoredWaits (Wait_Type NVarChar(60) PRIMARY KEY) 
--/*
--Values taken from p25 of "Troubleshooting SQL Server - A Guide for the Accidental DBA" by Jonathan Kehayias and Ted Krueger  
INSERT INTO #IgnoredWaits 
SELECT 'BAD_PAGE_PROCESS'
UNION SELECT 'BROKER_EVENTHANDLER'
UNION SELECT 'BROKER_RECEIVE_WAITFOR'
UNION SELECT 'BROKER_TASK_STOP'
UNION SELECT 'BROKER_TO_FLUSH'
UNION SELECT 'BROKER_TRANSMITTER'
UNION SELECT 'CHECKPOINT_QUEUE'
UNION SELECT 'CLR_AUTO_EVENT'
UNION SELECT 'CLR_MANUAL_EVENT'
UNION SELECT 'DBMIRROR_EVENTS_QUEUE'
UNION SELECT 'DISPATCHER_QUEUE_SEMAPHORE'
UNION SELECT 'FT_IFTS_SCHEDULER_IDLE_WAIT'
UNION SELECT 'FT_IFTSHC_MUTEX'
UNION SELECT 'KSOURCE_WAKEUP'
UNION SELECT 'LAZYWRITER_SLEEP'
UNION SELECT 'LOGMGR_QUEUE'
UNION SELECT 'ONDEMAND_TASK_QUEUE'
UNION SELECT 'PREEMPTIVE_OS_AUTHENTICATIONOPS'
UNION SELECT 'PREEMPTIVE_OS_GETPROCADDRESS'
UNION SELECT 'REQUEST_FOR_DEADLOCK_SEARCH'
UNION SELECT 'RESOURCE_QUEUE'
UNION SELECT 'SLEEP_BPOOL_FLUSH'
UNION SELECT 'SLEEP_SYSTEMTASK'
UNION SELECT 'SLEEP_TASK' 
UNION SELECT 'SQLTRACE_BUFFER_FLUSH'
UNION SELECT 'WAITFOR'
UNION SELECT 'XE_DISPATCHER_JOIN'
UNION SELECT 'XE_DISPATCHER_WAIT'
UNION SELECT 'XE_TIMER_EVENT'

SELECT TOP 7 TimeFrame = 'Right Now'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart 
    AND DateEnd <= @dEnd 
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Yesterday'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 1
    AND DateEnd <= @dEnd - 1
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

SELECT TOP 7 TimeFrame = 'Last Week'
    , DateStart = MIN(DateStart)
    , DateEnd = MAX(DateEnd)
    , wait_type
    , wait_minutes = SUM(wait_time_ms)/1000/60
FROM WaitStats
WHERE DateStart >= @dStart - 7
    AND DateEnd <= @dEnd - 7
    AND Wait_Type NOT IN (SELECT Wait_Type FROM #IgnoredWaits)
GROUP BY wait_type
ORDER BY 5 DESC

The only thing left to do is wait and interpret the data. All the wait types mean something, with some easily pointing to an issue and others just being normal operations. Doing a dive into what wait types are out there, what they mean to you, and what they mean when they’re not what they were yesterday isn’t something I’m going to get into with this post.  However, there is an Accidental DBA book out there that explains this amazingly well.  Here are links to the FREE PDF Version and the Physical Book.

Indexes – Understanding basic types and their components


The biggest problem developers and newer DBAs have with understanding indexes is that you don’t realize when you’re using the exact same thing away from your computer.  Pick up any reference style book and you have one clustered index and one nonclustered index. The clustered index is also split into the b-tree (table of contents) and the leaf levels (actual book, which is also why you can only have one clustered index).  The nonclustered index in the back of the book is, however, very basic.

Here’s how the clustered index to table of contents comparison works.  The table of contents will tell you exactly what page to start on for a specific subject in the book.  The b-tree of a clustered index is the table of contents, and will tell SQL Server exactly what page to start at to look at a specific value or range of values you’re looking for.  Then when you turn to that page in the book you can flip to the next page to continue reading until you found everything you’re looking for.  SQL Server does the same thing where one page tells you where the next logical page is, which is typically the next physical page on the disk.  If you get too many instances where the next physical and logical pages aren’t the same then it’s time to rebuild the index to fix your fragmentation.

The key columns in the clustered index are best viewed in the context of a phone book.  The clustered index there has two key fields, last name and first name, in that order.  The “in that order” is more important than people newer to indexing would guess, and here’s why.  If I asked you to find every person with the last name “Hood” in a phone book, you’d flip through until you found the H’s and find what you were looking for rather quickly.  On the other hand, if I asked you to find every person with the first name “Steve” then you’d get mad at me and I’d have an impression of a phone book on my head.  SQL Server uses more technical terms, where it will seek (clustered index seek) for the last name, and scan (clustered index scan) the entire table for the first name.  Luckily, SQL Server rarely gets mad and just does the clustered index scan reading the entire table.

A nonclustered index follows the same rules for the key columns, and if you’re not searching by the first key column then you’re reading the whole index.  A nonclustered index, however, is a bit different from a book in that it has more information.  In the book example you have a single key field and it automatically includes the clustered index key (the page number) as a key field so you can look up the rest of the information.  In SQL Server, your nonclustered index can have more than one key value and can include more information than just the clustered index key field(s).  The clustered index key will always be implicitly added to every nonclustered index, and SQL Server will use that information to both look up the any fields not in the nonclustered index and ensure each row is actually unique, even in a nonunique index.  Since you’re duplicating all the key fields of a clustered index like this, it really makes you think twice before making a wide clustered index that has multiple key fields. Kendra Little did an amazing job at taking a deep dive into this on her post “How to Find Secret Columns in Nonclustered Indexes“.

Why would you want more than one key field?  If you have a query where it’s looking at the employee table for the entire federal government and you constantly run queries looking for people according to what part of the government they work for, their last name, and gender, then a single key field on a nonclustered index will leave you searching through hundreds of thousands of records.  On the other hand, if you have three key fields in the order I listed above, you’d easily get down to the mere hundreds of records that work for the Army, have a last name of Hood, and are male very quickly.

In addition, you can include fields in a nonclustered index.  There’s two reasons for this.  First, you may be able to filter down your results even further.  Second, you may not have to flip through your clustered index (key lookup for those looking at your execution plans) to get the rest of the information you require.  Say in the example above you only had the key fields for the department and last name in your index, but you included the gender column.  The nonclustered index would still store the gender, but it wouldn’t store it in order.  Therefore SQL would have to read through all the records where the department and last name were the same, but at least you would be able to filter it out without going back to the clustered index.

If the only thing returned by this query outside of those fields was the salary of the employee, you could also include that field in the index.  This is called a covering index, because the index covers all of your needs.  That means that your entire query would show up in an execution plan as an index seek, no key lookups.  This is awesome, and here’s why.  In my book context above, this is the difference between looking at the index in the back of the book and seeing that the information you need is on pages 2, 5, 8, 11, 45, 88, 128, 224, and 455, or looking in the back of the book and seeing all of the information is right there.

At this point you may be thinking this is the best thing ever, include everything, make everything a key field.  Picture what that would do to the size of a book if absolutely everything was included in the index, it’s be as big as you’re about to make your SQL tables.  Not only that, but while a book may have static values after it’s published, a SQL table keeps on changing.  Every time you change a field that’s in an index (key field or included field) you have to change that index as well.  If you have 5 indexes that have the salary column and you want to give someone a raise, you’re really doing 6 updates.  One clustered index, and 5 nonclustered indexes.

I think all of us have heaps out there…did I ignore them in this post?  No.  Every table has a clustered index as far as I’m concerned, it’s just handled implicitly (heap) or explicitly (clustered index).  A heap has a RID, which is a unique number for each row in a table, and it’s implicity included in every nonclustered index on a table.  If anyone can point out an advantage to a RID over a clustered index on a bigint identity column, I’d like to hear it.  The only difference I can point out is that you can use the bigint column to reference that row from another table, where you can’t do that with a RID.

Also, another pet peeve of mine, every clustered index is unique.  Even if you create one by saying “CREATE NONUNIQUE CLUSTERED INDEX”, it’s unique.  SQL does this by including a uniquifier (it’s a real word, even if spell check disagrees) that has a 0 for the first value and is incrementally higher for each duplicate value.  This, like a heap, puts an overhead on SQL Server that neither you nor your developers can take advantage of.  My answer, make sure every table you care about has a unique clustered index, and make sure you care about every table that has over 50 rows.

Indexes – Unused and Duplicates


Indexes aren’t free, and many databases end up with unused indexes. Every time you make any update to a table you will be updating the clustered index (I assume no heaps, because I hate heaps), and every index that has uses one of the columns that were updated. Inserts and Deletes affect every column and will affect every index, with exceptions for non-typical indexes. These updates cost a lot, to the point that I have several processes that disable certain nonclustered indexes, do all of the work, then rebuild the indexes afterwards.

The problem is that you can’t tell very easily if an index is being used or not. You can look at sys.dm_db_index_usage_stats, but that information is very limited in that gets wiped out on index rebuilds, SQL service restarts (server reboot included), and anytime you ask it to. That’s not mentioning that it also ignores obscure usage of the index. However, the obscure usage is more of something to keep in mind than it is something to base all of your decisions on.

From my point of view, there are two answers to this problem.  There isn’t a right and wrong answer, it’s more of a step 1 and step 2.

The long and hard one is to capture the values in sys.dm_db_index_usage_stats on a normal basis (daily, before index maintenance) and give yourself long term statistics that you can determine if an index’s usage justifies its cost. Some indexes are updated 1,000,000 times for two uses. You obviously don’t need to keep this index around all the time, but you’re still paranoid about what those uses were. Were they ad-hoc queries run by a developer, or were they year-end reports run by the CEO? You can get some idea if you look at your history and check what day it was run. I would want to make a rollback script, drop the index, and make a searchable document of indexes I dropped in case anyone complained about year-end reports not working.

That was the long answer, which is right and is not replaced by the short answer. The script below is the short answer. If all of the key fields in an index match, or at least the first three in a larger index, then there’s a good chance that one of the indexes isn’t required or they could be consolidated.

For instance, if you have two indexes that have the same two key fields in the same order but the second index also includes two columns, then you’ll probably see usage on both indexes (see the long-hard solution above). If one index isn’t in use in this situation, your job is easy. If both are used, then look into dropping the index that doesn’t include any columns. Why? Because a slightly larger index rarely increases the overhead too much, and the queries that used the first index aren’t usually impacted too much by using a slightly larger index that has the same foundation.

This area gets shakey on how you should handle it, and is really one of the few areas in SQL Server I consider more of an art. For instance, the script below will show you indexes that have 5 key columns where the first three match. You look and find that both indexes are being used, but users are complaining that updates are a bit slow, too. I’d lean towards condensing by taking the the last two key columns of one index and including them on the other, but which index gets dropped and how do you test it?

Do my uses of “look into”, “rarely”, “usually”, and “too much” seem uncertain to you? Welcome to indexing.

Here are the scripts I use for both the short and long answers mentioned.  The first one is for the short answer, giving you indexes that are for the most part duplicated.

DECLARE @KeyFieldMatches INT

SET @KeyFieldMatches = 3 --Number of key fields to match in order

IF object_id('tempdb..#IndexList') IS NOT NULL BEGIN
    DROP TABLE #IndexList
END

IF object_id('tempdb..#IndexListShort') IS NOT NULL BEGIN
    DROP TABLE #IndexListShort
END

CREATE TABLE #IndexList (
      object_id    bigint not null
    , table_name varchar(150) not null
    , index_id int not null
    , index_name varchar(150) not null
    , index_column int not null
    , included bit not null
    , column_name varchar(150) not null
    , index_type int not null
)

CREATE CLUSTERED INDEX IndexList_Clu ON #IndexList (object_id, index_id, index_column, included)

CREATE TABLE #IndexListShort (
      object_id bigint not null
    , table_name varchar(150) not null
    , index_id int not null
    , index_name varchar(150) not null
    , column_names_first_n varchar(450) not null
    , column_names_key varchar(4000) not null
    , column_names_included varchar(4000) not null
)

DECLARE @object_id bigint
DECLARE @index_id int
DECLARE @List varchar(4000)
DECLARE @ListIncl varchar(4000)
DECLARE @ListShort varchar(450)

INSERT INTO #IndexList
SELECT o.object_id
    , table_name = o.name
    , i.index_id
    , index_name = i.name
    , index_column = ic.index_column_id
    , included = ic.is_included_column
    , column_name = c.name
    , index_type = i.type
FROM sys.objects o
    INNER JOIN sys.indexes i ON o.object_id = i.object_id
    INNER JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id
    INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id

SET @object_id = (SELECT TOP 1 object_id FROM #IndexList)
SET @index_id = (SELECT TOP 1 index_id FROM #IndexList WHERE object_id = @object_id)

WHILE @object_id IS NOT NULL BEGIN
    SET @List = ''
    SET @ListIncl = ''
    SET @ListShort = ''

    SELECT @List = @List + COALESCE(i.column_name + ', ', '') FROM #IndexList i WHERE i.object_id = @object_id and i.index_id = @index_id and i.included = 0 ORDER BY i.index_column
    SELECT @List = substring(@List, 0, Len(@List))

    SELECT @ListIncl = @ListIncl + COALESCE(i.column_name + ', ', '') FROM #IndexList i WHERE i.object_id = @object_id and i.index_id = @index_id and i.included = 1 ORDER BY i.index_column
    SELECT @ListIncl = substring(@ListIncl, 0, Len(@ListIncl))

    SELECT @ListShort = @ListShort + COALESCE(i.column_name + ', ', '') FROM #IndexList i WHERE i.object_id = @object_id and i.index_id = @index_id and i.included = 0 and index_column <= @KeyFieldMatches ORDER BY i.index_column
    SELECT @ListShort = substring(@ListShort, 0, Len(@ListShort))

    INSERT INTO #IndexListShort 
    SELECT TOP 1 i.object_id
        , i.table_name
        , i.index_id
        , i.index_name
        , @ListShort
        , @List
        , CASE i.index_type --Clustered indexes include everything
            WHEN 1 THEN '*'
            ELSE @ListIncl
            END
    FROM #IndexList i
    WHERE i.object_id = @object_id and i.index_id = @index_id

    DELETE #IndexList WHERE object_id = @object_id and index_id = @index_id

    SET @object_id = (SELECT TOP 1 object_id FROM #IndexList)
    SET @index_id = (SELECT TOP 1 index_id FROM #IndexList WHERE object_id = @object_id)
END

SELECT table_name, index_name, column_names_key, column_names_included
FROM #IndexListShort i
WHERE EXISTS (SELECT * FROM #IndexListShort i2 WHERE i.object_id = i2.object_id AND i.column_names_first_n = i2.column_names_first_n GROUP BY object_id, column_names_first_n HAVING Count(*) > 1)
ORDER BY table_name, column_names_key

DROP TABLE #IndexList
DROP TABLE #IndexListShort

The long answer takes a bit more work to figure out. You need to store the information long-term, keep a snapshot of what it looked like most recently, and do all the calculations to make sure you have the right numbers. It’s really no different than what I do for other DMVs, so you’ll see this process duplicated a lot if you follow my work. The tables for storing this information:

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[DMVIndex_Usage_Stats](
    [StartTime] [smalldatetime] NOT NULL,
    [EndTime] [smalldatetime] NOT NULL,
    [database_id] [smallint] NOT NULL,
    [object_id] [int] NOT NULL,
    [index_id] [int] NOT NULL,
    [user_seeks] [bigint] NOT NULL,
    [user_scans] [bigint] NOT NULL,
    [user_lookups] [bigint] NOT NULL,
    [user_updates] [bigint] NOT NULL,
    [system_seeks] [bigint] NOT NULL,
    [system_scans] [bigint] NOT NULL,
    [system_lookups] [bigint] NOT NULL,
    [system_updates] [bigint] NOT NULL,
 CONSTRAINT [PK_DMVIndex_Usage_Stats] PRIMARY KEY CLUSTERED 
(
    [StartTime] ASC,
    [database_id] ASC,
    [object_id] ASC,
    [index_id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[DMVIndex_Usage_Stats_Temp](
    [TS] [smalldatetime] NOT NULL,
    [database_id] [smallint] NOT NULL,
    [object_id] [int] NOT NULL,
    [index_id] [int] NOT NULL,
    [user_seeks] [bigint] NOT NULL,
    [user_scans] [bigint] NOT NULL,
    [user_lookups] [bigint] NOT NULL,
    [user_updates] [bigint] NOT NULL,
    [system_seeks] [bigint] NOT NULL,
    [system_scans] [bigint] NOT NULL,
    [system_lookups] [bigint] NOT NULL,
    [system_updates] [bigint] NOT NULL,
 CONSTRAINT [PK_DMVIndex_Usage_Stats_Temp] PRIMARY KEY CLUSTERED 
(
    [TS] ASC,
    [database_id] ASC,
    [object_id] ASC,
    [index_id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

GO

And now the script that captures the information and does all of your calculations. You may have noticed that the tables themselves do NOT store the index or table names which was intentional. Name changes happen, tables and indexes get dropped and created, all of that is extra work for little benefit. Is the benefit worth it? I thought it wasn’t. I’m not expecting everyone to agree with me here because there are arguments going the other way. My advice is STEAL MY WORK!!! Don’t copy it, you need to steal it. Make it your own, learn it well enough that you can improve it and build upon it. At that point you’ll truly understand every bit of it and hopefully start up your own blog. I hear EvenSimplerSQLServer.com is available, but that’s subject to change. If you don’t want to go that far, then write your ideas in the comment box below. I change these posts from time to time, and this script was actually added months after the original post was made. Just don’t ask me why it wasn’t here in the first place, my answer is just….DOH!!!! Even saying I was only 6 days into blogging at the time I wrote the original doesn’t make that acceptable.

DECLARE @TS as SmallDateTime
DECLARE @Old_TS as SmallDateTime
DECLARE @Reboot_TS as SmallDateTime

SET @TS = GetDate()
SET @Reboot_TS = (SELECT create_date FROM sys.databases WHERE [name] = 'tempdb')

--If update wasn't already done this minute
  -- Delete all but the newest two updates - Only the newest will be used, so one spare is kept
  -- Add the current stats
  -- Calculate the usage between the last two and save in the permanent table
    -- If any of the values go down then the DMV reset for that index
      --It's a horrible join clause, but it makes everything null and uses the isnull(x,0)
      --AND N.user_seeks >= O.user_seeks will be hit if database is refreshed, detached, etc.
IF 0 = (SELECT Count(*) FROM Perf..DMVIndex_Usage_Stats_Temp WHERE TS = @TS) BEGIN
    DELETE Perf..DMVIndex_Usage_Stats_Temp 
    WHERE TS NOT IN 
        (SELECT DISTINCT TOP 2 TS 
         FROM Perf..DMVIndex_Usage_Stats_Temp 
         ORDER BY TS DESC)

    SET @Old_TS = --Newest TS in the temp table
        (SELECT TOP 1 TS
        FROM Perf..DMVIndex_Usage_Stats_Temp
        ORDER BY TS DESC)
        
    INSERT INTO Perf.dbo.DMVIndex_Usage_Stats_Temp
    SELECT @TS
        , database_id
        , object_id
        , index_id
        , user_seeks
        , user_scans
        , user_lookups
        , user_updates
        , system_seeks
        , system_scans
        , system_lookups
        , system_updates
    FROM sys.dm_db_index_usage_stats
    
    INSERT INTO Perf..DMVIndex_Usage_Stats
    SELECT isnull(@Old_TS, @Reboot_TS)
        , @TS
        , N.database_id
        , N.object_id
        , N.index_id
        , N.user_seeks - isnull(O.user_seeks, 0)
        , N.user_scans - isnull(O.user_scans, 0)
        , N.user_lookups - isnull(O.user_lookups, 0)
        , N.user_updates - isnull(O.user_updates, 0)
        , N.system_seeks - isnull(O.system_seeks, 0)
        , N.system_scans - isnull(O.system_scans, 0)
        , N.system_lookups - isnull(O.system_lookups, 0)
        , N.system_updates - isnull(O.system_updates, 0)
    FROM Perf..DMVIndex_Usage_Stats_Temp N --N = New, O = Old
        LEFT JOIN Perf..DMVIndex_Usage_Stats_Temp O ON O.TS = @Old_TS
            AND O.database_id = N.database_id
            AND O.object_id = N.object_id
            AND O.index_id = N.index_id
            AND N.user_seeks >= O.user_seeks
            AND N.user_scans >= O.user_scans
            AND N.user_lookups >= O.user_lookups
            AND N.user_updates >= O.user_updates
            AND N.system_seeks >= O.system_seeks
            AND N.system_scans >= O.system_scans
            AND N.system_lookups >= O.system_lookups
            AND N.system_updates >= O.system_updates
            AND @Reboot_TS < O.TS --They're reset on reboot, so don't compare if it was rebooted since last reading
    WHERE N.TS = @TS
END

Job Schedules – SQL Agent


This info should be easier to get than it is. Keep in mind that for something to run the subscription and schedule both have to be enabled.

To be fair, the majority of this code was grabbed from the internet, and this is not something I modified to the extent that I could call it my own. Michael Abair is the original author, and the original code can be found at the link below. My contribution is limited to very minor changes and a lot of hitting the tab key. I know it works just as well, maybe even 2ms faster, without the whitespace, but I had to do it.

http://www.sqlservercentral.com/scripts/Jobs/69088/

SELECT /*S.job_id,*/ S.job_name, S.is_job_enabled, S.is_schedule_enabled, S.schedule_name, S.Description
    ,avg_duration_in_seconds = avg(datediff(s, '1/1/2000', ('1/1/2000 ' + cast(stuff(stuff(right('000000' + cast(h.run_duration as varchar(6)), 6),5,0,':'),3,0,':') as datetime))))
    ,number_of_runs = count(1)
FROM (SELECT SJ.job_id 
        , SJ.name as job_name
        , SJ.enabled as is_job_enabled
        , SS.enabled as is_schedule_enabled
        , SS.name as schedule_name
        , CASE freq_type
            WHEN 1 THEN 'Occurs on ' + STUFF(RIGHT(active_start_date, 4), 3,0, '/') + '/' + LEFT(active_start_date, 4) + ' at '
                + REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime) /* hh:mm:ss 24H */, 9), 14), ':000', ' ') /* HH:mm:ss:000AM/PM then replace the :000 with space.*/
            WHEN 4 THEN 'Occurs every ' + CAST(freq_interval as varchar(10)) + ' day(s) '
                + CASE freq_subday_type
                    WHEN 1 THEN 'at '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    WHEN 2 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' second(s)'
                    WHEN 4 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' minute(s)'
                    WHEN 8 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' hour(s)'
                    ELSE '' 
                    END
                + CASE WHEN freq_subday_type in (2,4,8) /* repeat seconds/mins/hours */
                        THEN ' between '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                        + ' and '
                        + LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_end_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    ELSE ''
                    END
            WHEN 8 THEN 'Occurs every ' + CAST(freq_recurrence_factor as varchar(10))
                + ' week(s) on '
                +
                REPLACE( CASE WHEN freq_interval&1 = 1 THEN 'Sunday, ' ELSE '' END
                + CASE WHEN freq_interval&2 = 2 THEN 'Monday, ' ELSE '' END
                + CASE WHEN freq_interval&4 = 4 THEN 'Tuesday, ' ELSE '' END
                + CASE WHEN freq_interval&8 = 8 THEN 'Wednesday, ' ELSE '' END
                + CASE WHEN freq_interval&16 = 16 THEN 'Thursday, ' ELSE '' END
                + CASE WHEN freq_interval&32 = 32 THEN 'Friday, ' ELSE '' END
                + CASE WHEN freq_interval&64 = 64 THEN 'Saturday, ' ELSE '' END
                + '|', ', |', ' ') /* get rid of trailing comma */

                + CASE freq_subday_type
                    WHEN 1 THEN 'at '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    WHEN 2 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' second(s)'
                    WHEN 4 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' minute(s)'
                    WHEN 8 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' hour(s)'
                    ELSE '' 
                    END
                + CASE WHEN freq_subday_type in (2,4,8) /* repeat seconds/mins/hours */
                        THEN ' between '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                        + ' and '
                        + LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_end_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    ELSE ''
                    END
            WHEN 16 THEN 'Occurs every ' + CAST(freq_recurrence_factor as varchar(10))
                + ' month(s) on '
                + 'day ' + CAST(freq_interval as varchar(10)) + ' of that month ' 
                + CASE freq_subday_type
                    WHEN 1 THEN 'at '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    WHEN 2 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' second(s)'
                    WHEN 4 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' minute(s)'
                    WHEN 8 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' hour(s)'
                    ELSE '' 
                    END
                + CASE WHEN freq_subday_type in (2,4,8) /* repeat seconds/mins/hours */
                        THEN ' between '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                        + ' and '
                        + LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_end_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    ELSE ''
                    END
            WHEN 32 THEN 'Occurs ' 
                + CASE freq_relative_interval
                    WHEN 1 THEN 'every first '
                    WHEN 2 THEN 'every second '
                    WHEN 4 THEN 'every third '
                    WHEN 8 THEN 'every fourth '
                    WHEN 16 THEN 'on the last '
                    END
                + CASE freq_interval 
                    WHEN 1 THEN 'Sunday'
                    WHEN 2 THEN 'Monday'
                    WHEN 3 THEN 'Tuesday'
                    WHEN 4 THEN 'Wednesday'
                    WHEN 5 THEN 'Thursday'
                    WHEN 6 THEN 'Friday'
                    WHEN 7 THEN 'Saturday'
                    WHEN 8 THEN 'day'
                    WHEN 9 THEN 'weekday'
                    WHEN 10 THEN 'weekend'
                    END
                + ' of every ' + CAST(freq_recurrence_factor as varchar(10)) + ' month(s) '
                + CASE freq_subday_type
                    WHEN 1 THEN 'at '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    WHEN 2 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' second(s)'
                    WHEN 4 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' minute(s)'
                    WHEN 8 THEN 'every ' + CAST(freq_subday_interval as varchar(10)) + ' hour(s)'
                    ELSE '' 
                    END
                + CASE 
                    WHEN freq_subday_type in (2,4,8) /* repeat seconds/mins/hours */
                        THEN ' between '+ LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_start_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                        + ' and '
                        + LTRIM(REPLACE( RIGHT(CONVERT(varchar(30), CAST(convert(varchar(8), STUFF(STUFF(RIGHT('000000' + CAST(active_end_time as varchar(10)), 6), 3, 0, ':' ), 6, 0, ':' ), 8) as datetime), 9), 14), ':000', ' '))
                    ELSE ''
                    END
            WHEN 64 THEN 'Runs when the SQL Server Agent service starts'
            WHEN 128 THEN 'Runs when the computer is idle'
            END 
        AS [Description]
        FROM msdb.dbo.sysjobs SJ
            INNER JOIN msdb.dbo.sysjobschedules SJS ON SJ.job_id = SJS.job_id
            INNER JOIN msdb.dbo.sysschedules SS ON SJS.schedule_id = SS.schedule_id
            INNER JOIN msdb.dbo.syscategories SC ON SJ.category_id = SC.category_id
        --WHERE SC.name = 'Name from query below'
    ) S
    INNER JOIN msdb.dbo.sysjobhistory H ON s.job_id = H.job_id AND H.step_id = 0
WHERE H.run_date >= /* 7 days ago */cast(datepart(yyyy, dateadd(d, -7, getDate())) as VarChar(10)) + cast(datepart(mm, dateadd(d, -7, getDate())) as VarChar(10)) + cast(datepart(dd, dateadd(d, -7, getDate())) as VarChar(10)) --format getDate once to compare against multiple run_dates
GROUP BY /*S.job_id,*/ S.job_name, S.is_job_enabled, S.is_schedule_enabled, S.schedule_name, S.Description
ORDER BY S.job_name

--SELECT * FROM msdb..syscategories --If you want to uncomment the line "WHERE SC.name = ...", copy/paste from this

Roll Logs – Backup Recovery


Rolling log files for a day, especially with 15 or even 5 minute log backups is a pain at best. Here’s a slightly better way to do it. Set the two variables at the top to the directory where the log backups are and the database you’re looking to restore. It will, assuming you named your log backups as DatabaseName_Log*, display all log backups in that folder in chronological order into an output script which is best read when doing results to text (Query/Results To/Results To Text).

Here’s what this won’t do. It won’t require SQL Server to remember taking the backups, so you can run it from any server that has access to this folder. It won’t restore anything for you, it will just give you the text to copy/paste and run yourself. It won’t leave xp_cmdshell on, although it does require it to be turned on for a bit if it was off.

Since this does nothing more than display the code you’ll probably run next, feel free to run it even if you’re just playing around or only want to roll half the logs.

DECLARE @LogBackupDirectory VarChar(1000)
DECLARE @DatabaseName VarChar(256)

SELECT @LogBackupDirectory = '\\BUServer\BUShare\Folder\logs\'
    , @DatabaseName = 'model'

---Don't change anything below this point unless upgrading the script---
---Also, slight change if backup and restore database names are different---

SET NOCOUNT ON

DECLARE @XP_CmdShell_Enabled INT
    , @XP_CmdShell_Command VarChar(4000)

IF Object_ID('TempDB..#temp') IS NOT NULL BEGIN
    DROP TABLE #temp
END

create table #temp
(
      Dir VARCHAR(MAX)
)

SELECT @XP_CmdShell_Enabled = CONVERT(INT, ISNULL(value, value_in_use))
FROM master.sys.configurations
WHERE name = 'xp_cmdshell'

IF @XP_CmdShell_Enabled = 0 BEGIN
    EXEC sp_configure 'xp_cmdshell', 1
    RECONFIGURE WITH OVERRIDE 
END 

SELECT @XP_CmdShell_Command = 'dir  /od /b ' + @LogBackupDirectory + @DatabaseName + '_log*'

INSERT INTO #temp
EXECUTE xp_cmdshell @XP_CmdShell_Command

IF @XP_CmdShell_Enabled = 0 BEGIN
    EXEC sp_configure 'xp_cmdshell', 0
    RECONFIGURE WITH OVERRIDE 
END 

DELETE #temp
WHERE Dir IS NULL

SELECT commands = 'Restore Log [' + @DatabaseName + '] from DISK=N''' + @LogBackupDirectory + Dir + ''' with NoRecovery' 
FROM #temp

SELECT 'RESTORE DATABASE [' + @DatabaseName + '] WITH RECOVERY'

Table Sizes


There’s no simple way in SQL Server to see the sizes of all the tables and their indexes.  Even seeing a single table’s size through SSMS can take a while.  The solution below is grabbed from statistics, so there’s no guarantee you’re getting exact numbers.  However, exact numbers don’t tend to matter, as you care more that the table is about 1 GB, not that it’s exactly 998 MB.

There are variations of this code floating all over the place, and I tweaked the results to be in the easiest to read format possible.

The only issue I have with this script is that you have to scroll to the next-to-the-last line to filter which tables it’s pulling. Yes, I’m picky, and I don’t like to scroll to see things that change.

SELECT Database_ID = DB_ID()
    , Database_Name = DB_NAME()
    , Schema_Name = a3.name
    , TableName = a2.name
    , TableSize_MB = (a1.reserved + ISNULL(a4.reserved,0)) / 128
    , RowCounts = a1.rows
    , DataSize_MB = a1.data / 128
    , IndexSize_MB = (CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data 
                        THEN (a1.used + ISNULL(a4.used,0)) - a1.data 
                        ELSE 0 
                    END) /128
    , Free_MB = (CASE WHEN (a1.reserved + ISNULL(a4.reserved,0)) > a1.used 
                        THEN (a1.reserved + ISNULL(a4.reserved,0)) - a1.used 
                        ELSE 0 
                    END) / 128
FROM (SELECT ps.object_id
            , [rows] = SUM(CASE
                                WHEN (ps.index_id < 2) THEN row_count
                                ELSE 0
                            END)
            , reserved = SUM(ps.reserved_page_count)
            , data = SUM(CASE
                            WHEN (ps.index_id < 2) 
                                THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count)
                            ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count)
                        END)
            , used = SUM (ps.used_page_count) 
        FROM sys.dm_db_partition_stats ps
        GROUP BY ps.object_id) AS a1
    INNER JOIN sys.all_objects a2  ON a1.object_id = a2.object_id
    INNER JOIN sys.schemas a3 ON a2.schema_id = a3.schema_id
    LEFT JOIN (SELECT it.parent_id
            , reserved = SUM(ps.reserved_page_count)
            , used = SUM(ps.used_page_count)
        FROM sys.dm_db_partition_stats ps
            INNER JOIN sys.internal_tables it ON it.object_id = ps.object_id
        WHERE it.internal_type IN (202,204)
        GROUP BY it.parent_id) AS a4 ON a4.parent_id = a1.object_id
WHERE a2.type <> 'S' and a2.type <> 'IT'
    --AND a2.name IN ('spt_values')
ORDER BY a1.reserved desc

Index Columns and Names


Don’t trust the name of an index to tell you what it is. The name is wrong, or at least that’s what you have to assume.

I’m a huge fan of index naming conventions, with my favorite being starting out by saying it’s an index (IDX) followed by the table name, then the key columns, the letters INCL if there are included columns, each included column listed (if reasonable, just do all or nothing for each index), a U or N denoting if it’s unique or not, then a C or N denoting if it’s clustered or not. However, there are too many times I’ve seen an index naming convention get me in trouble where the index IDX_TableName_Key1_Key2_Key3_INCL_Incl1_Incl2_U_N actually not have the column Key1 in it, and it wasn’t unique either.

My answer for that is to query the system tables to tell me what each index is. This is exactly the information I would like to see in a naming convention and pulled from a source that’s never wrong.

SELECT TableName = t.name
    , IndexName = i.name
    , i.is_unique
    , IndexType = i.type_desc
    , FileGroupName = d.name
    , KeyColumnList = substring((SELECT (', ' + c.name)
                            FROM sys.index_columns ic
                                INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
                            WHERE i.object_id = ic.object_id AND i.index_id = ic.index_id
                                AND ic.is_included_column = 0
                            ORDER BY ic.key_ordinal
                            FOR XML PATH ('')
                            ), 3, 2000)
    , IncludedColumnList = substring((SELECT (', ' + c.name)
                            FROM sys.index_columns ic
                                INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
                            WHERE i.object_id = ic.object_id AND i.index_id = ic.index_id
                                AND ic.is_included_column = 1
                            ORDER BY ic.key_ordinal
                            FOR XML PATH ('')
                            ), 3, 2000)
FROM sys.tables t
    INNER JOIN sys.indexes i ON t.object_id = i.object_id
    INNER JOIN sys.data_spaces d ON i.data_space_id = d.data_space_id
WHERE t.name LIKE '%'
    --AND (i.is_unique = 1 OR i.type_desc = 'Clustered')
ORDER BY 1, 2, 4, 3

You may say that’s a temporary one-off inqury the doesn’t fix anything, and it is. However, the permanent fix is very invasive, will void your support contracts, can cause damage, would cause pieces of future upgrades to fail, may not work if it generates a name that’s too long, and other minor details. Assuming you have a home-grown database and absolutely no query hints specifying an index anywhere in your code, have a dev environment recently refreshed from prod, and have looked into every other issue that I never even considered, do I have some code for you!!!

SELECT Command = '--DON''T RUN THIS WITHOUT FIRST LOOKING INTO THE CONSEQUENCES AND UPDATING YOUR RESUME'

UNION 
SELECT Command = 'EXEC sp_rename ''' + TableName + '.' + IndexName + ''', ''' + 'IX_' + TableName + ColumnList + case when len(IncludeList) > 3 then '_INCL' else '' end + ISNULL(IncludeList, '') + '_' + case is_unique when 1 then 'U' else 'N' end + '_' + left(IndexType COLLATE SQL_Latin1_General_CP1_CS_AS, 1) + ''' , ''INDEX'''
FROM (
        SELECT TableName = t.name
            , IndexName = i.name
            , i.is_unique
            , IndexType = i.type_desc
            , FileGroupName = d.name
            , ColumnList = substring((SELECT ('_' + c.name)
                                    FROM sys.index_columns ic
                                        INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
                                    WHERE i.object_id = ic.object_id AND i.index_id = ic.index_id
                                        AND ic.is_included_column = 0
                                    ORDER BY ic.key_ordinal
                                    FOR XML PATH ('')
                                    ), 1, 2000)
            , IncludeList = substring((SELECT ('_' + c.name)
                                    FROM sys.index_columns ic
                                        INNER JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
                                    WHERE i.object_id = ic.object_id AND i.index_id = ic.index_id
                                        AND ic.is_included_column = 1
                                    ORDER BY ic.key_ordinal
                                    FOR XML PATH ('')
                                    ), 1, 2000)
        FROM sys.tables t
            INNER JOIN sys.indexes i ON t.object_id = i.object_id
            INNER JOIN sys.data_spaces d ON i.data_space_id = d.data_space_id
        WHERE t.name IN (
                  'Table_1'
                , 'Table_2')
    )x
ORDER BY 1

Running Processes


Sp_who2 is not used by this DBA.  That info is amazingly good to have, but I want more.  Don’t get me wrong, sp_who2 is kinda hidden in this one as it’s really just a proc that does little more than “SELECT * FROM SysProcesses“, and I use SysProcesses where I couldn’t find the same info in the DMVs that were supposed to replace it.  Don’t go thinking that SysProcesses is just a depreciated dinosaur, even MVPs still use it: http://sqlblog.com/blogs/kalen_delaney/archive/2008/06/29/why-i-still-need-sysprocesses.aspx

Threads and RunningThreads are new for me here.  It’s good to see when a query is split into 100 threads and only 1 is doing anything.  It’s also easier on the eyes to see the number 100 instead of seeing 100 separate rows for one SPID.

The percent complete and estimated completion time on this script are typically blank, with the major exceptions being backups and restores where these columns are the best thing ever! 

BlockedBy is something overlooked by many people.  If your query seems to be running forever when it normally runs quick, there’s about a 75% chance you’re just waiting for another process to complete. It’s also why you’ll see the notorious (nolock) in almost everything I write for myself, but that’s another conversation.  The HeadBlocker attempts find the start of a blocking chain, although I can say that I know my logic here isn’t bulletproof.

My wait types used to be perfect where if there wait time was 0 then it was lowercase, otherwise it was all caps.  It seems I broke it slightly, but it’s almost perfect.  My OCD will eventually get the best of me here, but it has to contend with time with the rest of my life.

The rest is pretty self-explanitory with the exception of query plans.  They are the last line of the SELECT and FROM clauses that are commented out.  It’s somewhat rare to need the plan, and it’s not too uncommon to have this script run noticably slower when looking for plans.  Because of that, I uncomment that column and the the cross apply that makes it possible only when needed.

SELECT SPID = er.session_id
    , ot.Threads
    , RunningThreads = coalesce(rsp.RunningThreads,0)
    , Pct_Comp = er.percent_complete
    , Est_Comp_Time = CASE er.estimated_completion_time WHEN 0 THEN NULL ELSE dateadd(ms, er.estimated_completion_time, getdate()) END 
    , er.status
    , er.command
    , database_name = sd.name
    , BlockedBy = wt.blocking_session_id
    , HeadBlocker = coalesce(hb5.session_id, hb4.session_id, hb3.session_id, hb2.session_id, hb1.session_id)
    , wait_type = coalesce(CASE er.wait_type WHEN 'CXPACKET' THEN 'CXPACKET - ' + sp.lastwaittype1 ELSE sp.lastwaittype1 END, lower(er.last_wait_type)) --Lowercase denotes it's not currently waiting, also noted by a wait time of 0.
    , Wait_Time_Sec = Cast(er.wait_time/1000.0 as DEC(20,3))
    , er.wait_resource
    , Duration_Sec = Cast(DATEDIFF(s, er.start_time, GETDATE()) as DEC(20,0))
    , CPU_Sec = Cast(er.cpu_time/1000.0 as DEC(20,3))
    , Reads_K = Cast(er.reads/1000.0 as DEC(20,3))
    , Writes_K = Cast(er.writes/1000.0 as DEC(20,3))
    , [Statement] = SUBSTRING (st.text, er.statement_start_offset/2,
        abs(CASE WHEN er.statement_end_offset = -1
               THEN LEN(CONVERT(NVARCHAR(MAX), st.text)) * 2 
            ELSE er.statement_end_offset END - er.statement_start_offset)/2)
    , st.text as Query
    , es.login_time
    , es.host_name
    , program_name = CASE LEFT(es.program_name, 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(es.program_name,32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(es.program_name, 67, len(es.program_name)-67)
                    ELSE es.program_name
                    END  
    , es.client_interface_name
    , es.login_name
    , es.status
    , es.total_scheduled_time
    , es.total_elapsed_time
    , er.start_time
    , es.last_request_start_time
    , es.last_request_end_time
    , er.database_id  
    --, qp.query_plan 
FROM sys.dm_exec_requests er
    INNER JOIN sys.dm_exec_Sessions es on er.session_id=es.session_id
    LEFT JOIN sys.databases sd on er.database_id=sd.database_id
    INNER JOIN (SELECT session_id, count(1) Threads FROM sys.dm_os_tasks GROUP BY session_id) ot on er.session_id=ot.session_id
    LEFT JOIN (SELECT spid, LastWaitType1 = MIN(lastwaittype), LastWaitType2 = MAX(lastwaittype) FROM sysprocesses sp WHERE waittime > 0 AND lastwaittype <> 'cxpacket' GROUP BY spid) sp ON er.session_id = sp.spid
    LEFT JOIN (SELECT spid, RunningThreads = COUNT(1) FROM sysprocesses sp WHERE waittime = 0 GROUP BY spid) rsp ON er.session_id = rsp.spid
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt WHERE wt.blocking_session_id <> wt.session_id GROUP BY session_id) wt ON er.session_id = wt.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb1 ON wt.blocking_session_id = hb1.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb2 ON hb1.blocking_session_id = hb2.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb3 ON hb2.blocking_session_id = hb3.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb4 ON hb3.blocking_session_id = hb4.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb5 ON hb4.blocking_session_id = hb5.session_id 
    CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS st  
    --CROSS APPLY sys.dm_exec_query_plan(er.plan_handle) qp
WHERE er.session_id <> @@SPID
    --AND es.host_name like '%%'
    --AND er.session_id = 2702
ORDER BY er.percent_complete DESC, er.cpu_time DESC, er.session_id

--Use the below command to get the last input of an open session id
--dbcc inputbuffer(61)

Backup History


Can you tell me the drive we were backing up server #58 to exactly 8 months ago, and, if possible, it would really help if you could give me the exact file name for me to pull the backup from tape for you.  Or, last night’s backup didn’t finish until after the tapes started, did those backups always take that long and is it taking longer because the databases are actually growing that much?

The answer to those questions tends to be “Ok, no problem.”

SELECT TOP 100 bs.database_name
    , bs.backup_start_date
    , bs.backup_finish_date
    , backup_duration_minutes = (DateDiff(second, backup_start_date, backup_finish_date)+30)/60
    , backup_duration_seconds = DateDiff(second, backup_start_date, backup_finish_date)
    , backup_size_mb = Cast(bs.backup_size / 1024 / 1024 as Int)
    --, compressed_backup_size_mb = Cast(bs.compressed_backup_size / 1024 / 1024 as Int) --SQL 2008+ only
    , backup_type = CASE bs.type 
                        WHEN 'D' THEN 'FULL' 
                        WHEN 'I' THEN 'Diff' 
                        WHEN 'L' THEN 'Log' 
                        ELSE 'UNKNOWN' 
                    END
    , bmf.physical_device_name
FROM msdb..backupset bs
    INNER JOIN msdb..backupmediafamily bmf ON bs.media_set_id = bmf.media_set_id
--WHERE bs.database_name like 'msdb'
    --AND bs.type <> 'L'
    --AND backup_start_date BETWEEN '2013-05-13 00:00' AND '2013-05-13 08:00'
ORDER BY Backup_Start_Date DESC

It’s common in my scripts for me to put 10 things in the where clause and comment them out. My problem is that I don’t want to look around for field names to filter my scripts, but I don’t always want to filter my scripts the same way. It works for me, and you’ll either get used to it or rewrite it for your own use.

Cluster Info


This script is so simple you’ll start off by asking why you’d ever use it.  Then you’ll use it and wonder why no one gave this to you as part of your new-hire orientation at your first DBA job.  The truth is that in larger environments there will always be times when you’re asked what the active node is, are two instances running on the same node, what nodes are in that cluster, what drives belong to what instance, etc…  Here’s your answer.

SELECT VirtualServerName = SERVERPROPERTY('ServerName')
    , ActiveNode = SERVERPROPERTY('ComputerNamePhysicalNetBIOS') 
    , PassiveNode = CASE (SELECT COUNT(*) FROM sys.dm_os_cluster_nodes WHERE NodeName <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS'))
        WHEN 0 THEN 'Stand-Alone'
        WHEN 1 THEN (SELECT NodeName FROM sys.dm_os_cluster_nodes WHERE NodeName <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS'))
        ELSE Cast((SELECT COUNT(*) FROM sys.dm_os_cluster_nodes WHERE NodeName <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS')) as VarChar(2)) + ' Passive Nodes'
        END

SELECT PassiveNodes = NodeName 
FROM sys.dm_os_cluster_nodes 
WHERE NodeName <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS') 

SELECT ClusterDrives = DriveName 
FROM sys.dm_io_cluster_shared_drives
ORDER BY DriveName

File Sizes


There are two scripts I like to use to look at the file sizes on servers.  The first one is everything I could hope for on servers where there is only a single file per filegroup.  All of the vital information is all in one place.  I know, DBAs tend to be good at math and there was no need for me to make all of these columns.  However, any room I leave to demonstrate my mathematical powers in a professional environment…. Ok, look, CPU time is cheap anymore, I’ll go with that for my reasoning.

IF Object_ID('TempDB..##FileSize') IS NOT NULL BEGIN
    DROP TABLE ##FileSize
END

CREATE TABLE ##FileSize
(      DB VarChar(128)
    , FileLogicalName VarChar(128)
    , FilePhysicalName VarChar(512)
    , FileGroup VarChar(128)
    , Used_MB VarChar(128)
    , Free_MB VarChar(128)
    , Size_MB VarChar(128)
    , GrowthRate VarChar(128)
    , MaxSize VarChar(128)
)

exec sp_MSforeachdb N'use [?]; 
INSERT INTO ##FileSize
SELECT DB = db_name()
    , FileLogicalName = f.name
    , FilePhysicalName = f.physical_name
    , FileGroup = ISNULL(g.name, f.Type_Desc) 
    , Used_MB = REPLACE(CONVERT(varchar(100), (CAST((FileProperty(f.name, ''SpaceUsed'')) / 128 AS money)), 1), ''.00'', '''')
    , Free_MB = REPLACE(CONVERT(varchar(100), (CAST((f.size - FileProperty(f.name, ''SpaceUsed'')) / 128 AS money)), 1), ''.00'', '''')
    , Size_MB = REPLACE(CONVERT(varchar(100), (CAST((f.size) / 128 AS money)), 1), ''.00'', '''')
    , GrowthRate = Case Is_Percent_Growth
                    WHEN 1 THEN '''' + Cast(Growth as VarChar(100)) + ''%''
                    ELSE REPLACE(CONVERT(varchar(100), (CAST(Growth/128 AS money)), 1), ''.00'', '''') + '' MB''
                    END 
    , MaxSize = Case Max_Size
                    WHEN -1 THEN ''---''
                    WHEN 268435456 THEN ''---''
                    ELSE REPLACE(CONVERT(varchar(100), (CAST(Max_Size/128 AS money)), 1), ''.00'', '''') + '' MB''
                    END
FROM sys.database_files f
    LEFT JOIN sys.filegroups g on f.data_space_id = g.data_space_id
'

SELECT *
    , PctUsed = Cast(100 * replace(Used_MB, ',', '') / (Cast(replace(Size_MB,',', '') as Dec(20,2)) + .01) as Dec(20,2))
    --, PctFree = 100 - Cast(100 * replace(Used_MB, ',', '') / (Cast(replace(Size_MB,',', '') as Dec(20,2)) + .01) as Dec(20,2))
FROM ##FileSize
--WHERE DB = 'msdb'
ORDER BY FilePhysicalName

DROP TABLE ##FileSize

The problem with that script is that there are times it makes sense to have multiple files in a filegroup.  Be it disk limitations, best practices on a SAN saying not to extend a LUN, or your predecessor tossed logic to the wind, there are many reasons you could have multiple files where the previous script leaves you doing math to figure out how big the filegroup is or how much room for growth you have.  You know what I said about doing math…CPU time is cheap.  Here’s what I use when I need to see the filegroup sizes.

IF Object_ID('TempDB..##FileGroupSize') IS NOT NULL BEGIN
    DROP TABLE ##FileGroupSize
END

CREATE TABLE ##FileGroupSize
(      DB VarChar(128)
    , FileGroup VarChar(128)
    , Used_MB Int
    , Free_MB Int
    , Size_MB Int
)

exec sp_MSforeachdb N'use [?]; 
INSERT INTO ##FileGroupSize
SELECT DB = db_name()
    , FileGroup = ISNULL(g.name, f.Type_Desc) 
    , Used_MB = SUM(FileProperty(f.name, ''SpaceUsed'')) / 128
    , Free_MB = SUM(f.size - FileProperty(f.name, ''SpaceUsed'')) / 128
    , Size_MB = SUM(f.size) / 128 
FROM sys.database_files f
    LEFT JOIN sys.filegroups g on f.data_space_id = g.data_space_id
GROUP BY f.Type_Desc, g.name
'

SELECT *
    , PctUsed = Cast(100 * Used_MB / (Cast(Size_MB as Dec(20,2)) + .01) as Dec(20,2))
    , PctFree = 100 - Cast(100 * Used_MB / (Cast(Size_MB as Dec(20,2)) + .01) as Dec(20,2))
FROM ##FileGroupSize
ORDER BY DB, FileGroup

DROP TABLE ##FileGroupSize

It should be noted that I have different rules for different situations.  If a script is part of a stored proc or view then * isn’t even a consideration.  If an external program, even an SSRS report, will consume the code then I would never try to format it using SQL.  However, these are saved in my scripts folder as FileSize.sql and FilegroupSizes.sql, and my personal .sql files get to break a rule or two.

Copy / Paste Code


Copy / Paste formatted text is a nightmare in WordPress.  I’m singling out WordPress because I use it, and, from what I can see, every blog site has this issue.  If you have a SQL Script (again, singled out because that’s what I use), and copy / paste from SSMS to WordPress, it loses all color.  If you put it in Microsoft Word in the middle then that doesn’t look good.  If you put it in Word, save it as HTML, view the source, then paste that source into the text (not visual) when editing a post you’ll end up with extra line breaks and gaps between each line.  What once took 4 lines in SSMS now takes 12.  Who cares about how long it is, no one will want to read it.

The best solution so far is a program called vs2html.exe, which I found here: http://www.jeffblankenburg.com/2008/10/07/tutorial-6-visual-studio-code-highlighting-to-html/

Copy what you want straight out of SQL Server Management Studio, run this program, paste into the text editer in WordPress.  To get to the text editer you need to save and edit your post, then click on “Text” instead of the default “Visual”.  When you get there you just paste it where you want it.

Database Assessment


This script looks much more intimidating than the results it produces, so I recommend running it before reading through it.  I’ll run this when I first get on a unfamiliar server as a basic assessment of what’s there and find some hard to find issues with no effort.

The issues you’ll find include:

  • Backups – Where are they, when did they last occur, how big, etc.
    • I always look for databases (excluding model) that are in full or bulk logged recovery with no log backups.  This is a common issue, especially with vendor databases, that will fill your drives with data you didn’t care to have.
      • To know how critical it is when you find this, and you probably will, I include the log size and percent full.
    • There are variables in this script for a cutoff date which will ignore old backups. If you set @use_cutoffdate to 1 then any database not backed up since the value of @cutoffdate will show up as never being backed up. I love this feature because there’s no chance that you’ll misread a year-old backup as being taken last night.
    • Warning, the size of the backup is the size of the data, not the backup file.  To keep this compatible with SQL 2005, I’m not grabbing the compressed_backup_size field.
    • Differential backup results are commented out for a reason…most people don’t use them.  There is absolutely no reason to comment this out other than limiting the number of columns in my results.
  • Compatibility Level – If you upgrade a server it’s easy to leave a database behind.  I’ve seen SQL 2000 compatibility on SQL 2008 R2 for no other reason than “oops”.
  • Collation name – If this isn’t consistent you could run into issues that are next to impossible to debug.  The exceptions, ironically, are Microsoft databases.  Report Server and SharePoint databases specifically will be a different collation than anything else, but they were written with that in mind and handle it gracefully and shouldn’t be changed.
  • Auto Close – Ok, I admit it, I actually have this one commented out in my personal copy of this code.  If this is turned on it will typically blow up your error logs and make them half-way unreadable, and I’ll see this when I attempt to read them.  However, it’s good to know if this is ever turned on as it will kill performance.  I should uncomment this in my version with the mentality that too much information will slightly annoy you, too little will kill you.
Especially with most of us running more vendor databases than home-grown, don’t trust that this stuff was done right.
DECLARE @cutoffdate datetime
DECLARE @use_cutoffdate bit
SET @cutoffdate = DateAdd(Month, -3, GetDate())
SET @use_cutoffdate = 0

IF Object_ID('TempDB..##FileGroupSize') IS NOT NULL BEGIN
    DROP TABLE ##FileGroupSize
END

CREATE TABLE ##FileGroupSize
(      DB VarChar(128)
    , FileGroup VarChar(128)
    , Used_MB Int
    , Free_MB Int
    , Size_MB Int
)

exec sp_MSforeachdb N'use [?]; 
INSERT INTO ##FileGroupSize
SELECT DB = db_name()
    , FileGroup = ISNULL(g.name, f.Type_Desc) 
    , Used_MB = SUM(FileProperty(f.name, ''SpaceUsed'')) / 128
    , Free_MB = SUM(f.size - FileProperty(f.name, ''SpaceUsed'')) / 128
    , Size_MB = SUM(f.size) / 128 
FROM sys.database_files f
    LEFT JOIN sys.filegroups g on f.data_space_id = g.data_space_id
WHERE f.Type_Desc = ''LOG''
GROUP BY f.Type_Desc, g.name
'
SELECT database_name = d.name 
    , d.create_date
    , d.compatibility_level
    , LastFullBU = BUFull.last_backup  
    , LastDiffBU = BUDiff.last_backup  
    , LastLogBU = BULog.last_backup  
    , Recovery_Model = d.recovery_model_desc
    , Log_Size_MB = LogSize.Size_MB
    , Log_Pct_Used = Cast(100 * Used_MB / (Cast(Size_MB as Dec(20,2)) + .01) as Dec(20,2))
    , d.collation_name
    , LastFullLoc = BUFullLoc.physical_device_name
    , LastFullSize_MB = Cast(BUFullLoc.backup_size / 1024 / 1024 as bigint)
    --, LastDiffLoc = BUDiffLoc.physical_device_name
    --, LastDiffSize_MB = Cast(BUDiffLog.backup_size / 1024 / 1024 as bigint)
    , LastLogLoc = BULogLoc.physical_device_name
    , LastLogSize_MB = Cast(BULogLoc.backup_size / 1024 / 1024 as bigint)
    , d.is_auto_close_on
FROM master.sys.databases d with (nolock)
    LEFT JOIN (SELECT b.database_name, max(b.backup_finish_date) last_backup
        FROM msdb..backupset b with (nolock)
        WHERE b.type = 'D' 
            AND (b.backup_finish_date > @cutoffdate or @use_cutoffdate = 0)
        GROUP BY b.database_name
        ) BUFull ON d.name = BUFull.database_name
LEFT JOIN (SELECT b.database_name, max(b.backup_finish_date) last_backup
        FROM msdb..backupset b with (nolock) 
        WHERE b.type = 'I' 
            AND (b.backup_finish_date > @cutoffdate or @use_cutoffdate = 0)
        GROUP BY b.database_name
        ) BUDiff ON d.name = BUDiff.database_name 
    LEFT JOIN (SELECT b.database_name, max(b.backup_finish_date) last_backup
        FROM msdb..backupset b with (nolock) 
        WHERE b.type = 'L' 
            AND (b.backup_finish_date > @cutoffdate or @use_cutoffdate = 0)
        GROUP BY b.database_name
        ) BULog ON d.name = BULog.database_name 
    LEFT JOIN (SELECT f.physical_device_name, b.database_name, b.backup_finish_date, b.backup_size
        FROM msdb..backupset b with (nolock)
            INNER JOIN msdb..backupmediafamily f ON b.media_set_id = f.media_set_id
        WHERE b.type = 'D') BUFullLoc ON BUFull.database_name = BUFullLoc.database_name AND BUFull.last_backup = BUFullLoc.backup_finish_date
    LEFT JOIN (SELECT f.physical_device_name, b.database_name, b.backup_finish_date, b.backup_size
        FROM msdb..backupset b with (nolock)
            INNER JOIN msdb..backupmediafamily f ON b.media_set_id = f.media_set_id
        WHERE b.type = 'I') BUDiffLoc ON BUDiff.database_name = BUDiffLoc.database_name AND BUDiff.last_backup = BUDiffLoc.backup_finish_date
    LEFT JOIN (SELECT f.physical_device_name, b.database_name, b.backup_finish_date, b.backup_size
        FROM msdb..backupset b with (nolock)
            INNER JOIN msdb..backupmediafamily f ON b.media_set_id = f.media_set_id 
        WHERE b.type = 'L') BULogLoc ON BULog.database_name = BULogLoc.database_name AND BULog.last_backup = BULogLoc.backup_finish_date
    LEFT JOIN ##FileGroupSize LogSize ON d.name = LogSize.DB
WHERE d.name <> 'tempdb'
    AND d.state = 0 --Online

DROP TABLE ##FileGroupSize

Intro


Like many SQL Server DBAs I run into a problem at work or get an idea in my head then scour the internet to see if someone else had the same thing so I don’t have to reinvent the wheel.  Some DBAs stop at that point, I don’t.  I grab a script that serves my basic needs, test it out, rewrite it to be written as simple as possible while granting me every wish and desire I ever had on the process, then I save it off.

Beyond that, my biggest thing on SQL Server is base-lining and trending.  A common issue you’ll run into is in the middle of a performance incident you’ll find a stat that seems off on your biggest, most critical server.  The other servers don’t look like that, but they don’t have the hardware or the load so you don’t know if you should trust that or not.  I’ve read before (I think from Brad McGehee?) that the only thing consistent between all database servers is making sure the power is turned on, and it’s right.  You’re lost, and the only solution you have should have started over a week ago so you know what “normal” is for that server.  Many of my scripts will rely on info being gathered every 1, 5, or 15 minutes and will, by default, compare the current snapshot to the same time window yesterday and the same day last week.

Finally, my personal approach to it all.  I don’t take anything too seriously as it stifles creativity and sets you up for arguments on things not worth arguing about.  Some people will tell you there’s a right way and a wrong way…they’re wrong.  There are thousands of right ways and millions of wrong ways.  There are times I’m wrong, and always another “right” way to handle the way I do everything I do.  Don’t keep these things to yourself, tell me, yell at me, compliment me, suggest other ways.  The purpose of this blog is to have a previous coworker stop calling me a leech while at the same time leech off of your critiques of my work.

Ask for anything.  I may have a script written for it, I may write queries for it ad-hoc (rare event…if I wrote it, I saved it), and it may be a new concept I never considered before.  In any case, if it’s worth me knowing then I’ll find a way to know it, and if I know it then there’s no reason not to tell the world.